Re: Appropriate use of HTTP status codes for application health checks

Willy Tarreau <w@1wt.eu> Mon, 27 February 2017 06:23 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 41D521296C3 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 26 Feb 2017 22:23:38 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.922
X-Spam-Level:
X-Spam-Status: No, score=-6.922 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id z9ycwSSdOpLm for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 26 Feb 2017 22:23:36 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C7661129665 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sun, 26 Feb 2017 22:23:36 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.80) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1ciEg3-0004iM-HZ for ietf-http-wg-dist@listhub.w3.org; Mon, 27 Feb 2017 06:20:55 +0000
Resent-Date: Mon, 27 Feb 2017 06:20:55 +0000
Resent-Message-Id: <E1ciEg3-0004iM-HZ@frink.w3.org>
Received: from titan.w3.org ([128.30.52.76]) by frink.w3.org with esmtps (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <w@1wt.eu>) id 1ciEfw-0004gn-4K for ietf-http-wg@listhub.w3.org; Mon, 27 Feb 2017 06:20:48 +0000
Received: from wtarreau.pck.nerim.net ([62.212.114.60] helo=1wt.eu) by titan.w3.org with esmtp (Exim 4.84_2) (envelope-from <w@1wt.eu>) id 1ciEfo-0002LU-5w for ietf-http-wg@w3.org; Mon, 27 Feb 2017 06:20:42 +0000
Received: (from willy@localhost) by pcw.home.local (8.15.2/8.15.2/Submit) id v1R6JbD0005805; Mon, 27 Feb 2017 07:19:37 +0100
Date: Mon, 27 Feb 2017 07:19:37 +0100
From: Willy Tarreau <w@1wt.eu>
To: Amos Jeffries <squid3@treenet.co.nz>
Cc: ietf-http-wg@w3.org
Message-ID: <20170227061937.GA5797@1wt.eu>
References: <CADfyV-Pa0fu2SDwLYzMrUe4D0Tv0wu27pmHpLjCxQXR3ev4mmA@mail.gmail.com> <119d9b4e-8587-0d8b-d292-3be61cd1ea72@treenet.co.nz> <20170223102431.GC30956@1wt.eu> <d2b11486-267e-230f-bf3d-821ee9036f56@treenet.co.nz>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <d2b11486-267e-230f-bf3d-821ee9036f56@treenet.co.nz>
User-Agent: Mutt/1.6.1 (2016-04-27)
Received-SPF: pass client-ip=62.212.114.60; envelope-from=w@1wt.eu; helo=1wt.eu
X-W3C-Hub-Spam-Status: No, score=-7.0
X-W3C-Hub-Spam-Report: AWL=0.943, BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_IRA=-1, W3C_IRR=-3, W3C_WL=-1
X-W3C-Scan-Sig: titan.w3.org 1ciEfo-0002LU-5w d38e67d2dec64cfaf2f107f5dc75751a
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Appropriate use of HTTP status codes for application health checks
Archived-At: <http://www.w3.org/mid/20170227061937.GA5797@1wt.eu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/33621
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On Mon, Feb 27, 2017 at 05:38:49PM +1300, Amos Jeffries wrote:
> On 23/02/2017 11:24 p.m., Willy Tarreau wrote:
> > Hi Amos,
> > 
> > On Thu, Feb 23, 2017 at 10:53:07PM +1300, Amos Jeffries wrote:
> >> IMHO a better efficient way for a polling system is to use 204 as "All
> >> okay", and 200 as "some problem(s)". No bandwidth wasted with payload on
> >> the common Up status, and ability to deliver details about the outage on
> >> the Down status.
> > 
> > In fact it's common to see health check applications return 5xx for a
> > very simple reason, the front equipment performing the check (often a
> > load balancer) has to deal with these situations anyway, and most use
> > cases just want to return "completely up" or "completely dead". But I
> > agree that when you want to support the gray area in between, it's much
> > better to support intermediary codes. FWIW haproxy also supports a
> > special case of 404 to mean "closing soon, no more requests please" so
> > that admins can simply touch/rm a file in a docroot. That's just to say
> > that there are many valid use cases and tha common sense adapted to what
> > components *reliably* support is often the best here.
> > 
> 
> For an individual health-check you are right. But that is not the
> use-case matt has.
> 
> The use-case in question is for the response coming from some aggregator
> process, which uses health-checks as its input/data. One status code
> summarizing the situation of N endpoints.  No 4xx or 5xx is going to be
> adequate for that, simply because of what the 400 and 500 defaults mean
> to the general HTTP ecosystem.

I totally get your point but I see a big difference between what would
be perfect and what components can do. For example for over a decade
haproxy was not able to consider anything but a status code, and because
of this there have been many people who implemented 500 as a response to
aggregated tests just for this (now it's more flexible). And I've had to
deal with other products which could only use this as well.

Also, even for an aggregated test, you may end up with real 5xx errors
because of timeouts or failure to deal with unexpected responses, so
the LB still has to deal with this case normally.

So I'd summarize it like this when seen from the front component :

   - 200 => status is OK
   - <something> => status is faulty (partially or totally)
   - 5xx => a technical error appeared during the processing

Given the 5xx has to be dealt with, if there is no need for a clear
distinction between a failure in the health check component and a
faulty test, the 5xx will work fine. If it's needed to make a
distinction (eg: all responses are logged), then something else
would be better (including some 2xx as you proposed).

Cheers,
willy