Re: RFC 9113 and :authority header field

Willy Tarreau <w@1wt.eu> Thu, 30 June 2022 07:05 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CF49CC15B279 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 30 Jun 2022 00:05:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.66
X-Spam-Level:
X-Spam-Status: No, score=-2.66 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.25, MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5dUz6s6vGbfr for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 30 Jun 2022 00:05:42 -0700 (PDT)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 12315C15CF42 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Thu, 30 Jun 2022 00:05:41 -0700 (PDT)
Received: from lists by lyra.w3.org with local (Exim 4.94.2) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1o6oBP-00Gydo-Py for ietf-http-wg-dist@listhub.w3.org; Thu, 30 Jun 2022 07:01:47 +0000
Resent-Date: Thu, 30 Jun 2022 07:01:47 +0000
Resent-Message-Id: <E1o6oBP-00Gydo-Py@lyra.w3.org>
Received: from titan.w3.org ([128.30.52.76]) by lyra.w3.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <w@1wt.eu>) id 1o6oBO-00Gycq-Or for ietf-http-wg@listhub.w3.org; Thu, 30 Jun 2022 07:01:46 +0000
Received: from wtarreau.pck.nerim.net ([62.212.114.60] helo=1wt.eu) by titan.w3.org with esmtp (Exim 4.94.2) (envelope-from <w@1wt.eu>) id 1o6oBM-0076KQ-Hw for ietf-http-wg@w3.org; Thu, 30 Jun 2022 07:01:45 +0000
Received: (from willy@localhost) by pcw.home.local (8.15.2/8.15.2/Submit) id 25U71N59020621; Thu, 30 Jun 2022 09:01:23 +0200
Date: Thu, 30 Jun 2022 09:01:23 +0200
From: Willy Tarreau <w@1wt.eu>
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com>, HTTP <ietf-http-wg@w3.org>
Message-ID: <20220630070123.GA20552@1wt.eu>
References: <CAPyZ6=+q+MoOOwoCxbtFjt+gqsjHBqTzz9KXNVcs3EP-4VFp=Q@mail.gmail.com> <D7142A8A-5B80-46F5-A653-2307EE2DC5D8@gbiv.com> <CAPyZ6=LCSDAsPoFCQ2cRO-i+dpo5vnp2L5A7ZLw8dvRtDs6HUg@mail.gmail.com> <20220629055254.GA18881@1wt.eu> <34B74169-9A07-4003-8F76-1B518DE3A3A0@gbiv.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <34B74169-9A07-4003-8F76-1B518DE3A3A0@gbiv.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Received-SPF: pass client-ip=62.212.114.60; envelope-from=w@1wt.eu; helo=1wt.eu
X-W3C-Hub-Spam-Status: No, score=-4.9
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, W3C_AA=-1, W3C_IRA=-1, W3C_WL=-1
X-W3C-Scan-Sig: titan.w3.org 1o6oBM-0076KQ-Hw f23c60033c3ecca5d758013f6bc0f9c5
X-Original-To: ietf-http-wg@w3.org
Subject: Re: RFC 9113 and :authority header field
Archived-At: <https://www.w3.org/mid/20220630070123.GA20552@1wt.eu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/40223
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Hi Roy,

On Wed, Jun 29, 2022 at 10:49:58AM -0700, Roy T. Fielding wrote:
> > With HTTP/1.1 there are less ambiguities since Host is mandatory, but
> > the distinction between "proxy requests" and origin requests is still
> > relevant, especially when you don't know whether or not the origin
> > server supports HTTP/1.1 or only 1.0 (and may be confused by the
> > presence of an authority in the request line). For example, if a
> > client sends:
> > 
> >  GET / HTTP/1.1
> >  Host: example.com
> > 
> > to an HTTP/1.0 server that parses Host, it will work. If it sends
> > 
> >  GET http://example.com/ HTTP/1.1
> >  Host: example.com
> > 
> > To an HTTP/1.1 server, it will work as well, but it may fail to an HTTP/1.0
> > server (or worse, loop over itself if it supports proxing requests and
> > resolves itself as example.com).
> 
> Well, this ship has sailed, but I must have missed that original discussion.
> 
> The premise is incorrect in all respects, since all of those HTTP/1.1
> requests are also valid HTTP/1.0 requests (even with an absolute URI)
> and so is the presence of Host in those requests.

That's what I mentioned as well (sorry if I was not clear), it's just
that there are not the same expectations in that HTTP/1.0 is more lenient.

> Host is an HTTP/1.x field that was used in HTTP/1.0 requests (in 1995)
> as soon as we reached consensus on the field name. That was long before
> 1.1 was finished and 1.0 obsoleted.

Oh I'm well aware of this as well, and indeed, 1.1 was mostly an update
to write down what was being practised in field.

> Host is a required part of HTTP/1.0 now just by virtue of the Internet as
> deployed, regardless of the informational RFC.
>
> [The idea was originally proposed in 1994 by John Franks
> 
>    https://lists.w3.org/Archives/Public/ietf-http-wg-old/1994SepDec/0019.html
> 
> but it took a long time to converge on a single syntax
> 
>    https://lists.w3.org/Archives/Public/ietf-http-wg-old/1995JanApr/0067.html
>    https://lists.w3.org/Archives/Public/ietf-http-wg-old/1995JanApr/0084.html
>    https://lists.w3.org/Archives/Public/ietf-http-wg-old/1995JanApr/0130.html
>    https://lists.w3.org/Archives/Public/ietf-http-wg-old/1995SepDec/0291.html

Indeed.

> and while we still talk about it as an important addition of HTTP/1.1 (because
> that's where we chose to document it), the feature is required for 1.0 to
> work with deployed servers.]

That's one point I disagree with. Actually, the *vast* majority of servers
I'm seeing do not require a Host on HTTP/1.0 requests. And I'm pretty sure
that it doesn't change much over time, because most of our users continue
to use HTTP/1.0 to send health checks to servers, precisely because it
doesn't require to configure a host. Thus you just send "HEAD / HTTP/1.0"
and nothing more and if you get a response it indicates the server is not
dead. E.g:

  $ telnet www 80
  Trying 10.x.x.x...
  Connected to www.
  Escape character is '^]'.
  HEAD / HTTP/1.0
  
  HTTP/1.1 200 OK
  Date: Thu, 30 Jun 2022 06:15:39 GMT
  Server: Apache
  Last-Modified: Wed, 18 Nov 2015 19:41:20 GMT
  Accept-Ranges: bytes
  Content-Length: 15019
  Cache-Control: max-age=28800
  Expires: Thu, 30 Jun 2022 14:15:39 GMT
  Vary: Accept-Encoding
  Connection: close
  Content-Type: text/html

I'm quite often using this to find the site name from an IP address
that doesn't resolve, based on a redirect present in the response
or the domain of a set-cookie field for example. And one could think
that it's only for internal hosts but you can still find plenty of
them on the net, probably in order to satisfy scripts or low quality
tools:

  $ telnet google.com 80
  Trying 216.xx.xxx.xxx...
  Connected to google.com.
  Escape character is '^]'.
  HEAD / HTTP/1.0
  
  HTTP/1.0 200 OK
  Content-Type: text/html; charset=ISO-8859-1
  Date: Thu, 30 Jun 2022 06:18:14 GMT
  Server: gws
  X-XSS-Protection: 0
  X-Frame-Options: SAMEORIGIN
  Expires: Thu, 30 Jun 2022 06:18:14 GMT
  Cache-Control: private
  Set-Cookie: AEC=...; expires=Tue, 27-Dec-2022 06:18:14 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax
  
And there a wide variety of small and convenient servers such as thttpd
which remain used by admins to distribute packages or deliver status pages
from their monitoring solution, as well as many trivial servers coming
from generic language frameworks that continue to only support 1.0 and
sometimes do not even care for Host (and in any case do not require it).

E.g:
  $ python -m SimpleHTTPServer 8080 &
  Serving HTTP on 0.0.0.0 port 8080 ...
  $ telnet 0 8080
  Connected to 0.
  Escape character is '^]'.
  HEAD / HTTP/1.0
  
  127.0.0.1 - - [30/Jun/2022 08:24:53] "HEAD / HTTP/1.0" 200 -
  HTTP/1.0 200 OK
  Server: SimpleHTTP/0.6 Python/2.7.17
  Date: Thu, 30 Jun 2022 06:24:53 GMT
  Content-type: text/html; charset=ISO-8859-1
  Content-Length: 4882
  
  Connection closed by foreign host.

  $ telnet 0 8080
  Trying 0.0.0.0...
  Connected to 0.
  Escape character is '^]'.
  HEAD / HTTP/1.0
  Host: _this isn't even a valid host field_
  
  127.0.0.1 - - [30/Jun/2022 08:25:48] "HEAD / HTTP/1.0" 200 -
  HTTP/1.0 200 OK
  Server: SimpleHTTP/0.6 Python/2.7.17
  Date: Thu, 30 Jun 2022 06:25:48 GMT
  Content-type: text/html; charset=ISO-8859-1
  Content-Length: 4882
  
  Connection closed by foreign host.

> So, an HTTP proxy recipient that receives any form of authority/host
> information must forward that information in either Host or :authority,
> no matter what version it is using.

That's not what I've been used to seeing from early HTTP proxies, and
I can even still verify it right here on an old squid that I don't use
anymore but that's still available here:

  $ telnet proxy 3128
  Trying 10.x.x.x...
  Connected to proxy.
  Escape character is '^]'.
  HEAD / HTTP/1.0
  Host: google.com
  
  HTTP/1.0 400 Bad Request
  Server: squid/2.6.STABLE13
  Date: Thu, 30 Jun 2022 06:29:59 GMT
  Content-Type: text/html
  Content-Length: 1204
  Expires: Thu, 30 Jun 2022 06:29:59 GMT
  X-Squid-Error: ERR_INVALID_REQ 0
  X-Cache: MISS from px.home.local
  Via: 1.0 px.home.local:3128 (squid/2.6.STABLE13)
  Proxy-Connection: close

On the opposite with a full request and no host:

  $ telnet proxy 3128
  Trying 10.x.x.x...
  Connected to proxy.
  Escape character is '^]'.
  HEAD http://google.com/ HTTP/1.0
  
  HTTP/1.0 301 Moved Permanently
  Location: http://www.google.com/
  Content-Type: text/html; charset=UTF-8
  Date: Thu, 30 Jun 2022 06:31:29 GMT
  Expires: Sat, 30 Jul 2022 06:31:29 GMT
  Cache-Control: public, max-age=2592000
  Server: gws
  Content-Length: 219
  X-XSS-Protection: 0
  X-Frame-Options: SAMEORIGIN
  X-Cache: MISS from px.home.local
  Via: 1.0 px.home.local:3128 (squid/2.6.STABLE13)
  Proxy-Connection: close

Of course, modern proxies get this right. But based on what you see
above, it's extremely important to preserve the distinction between
these respective fields (i.e. :authority goes to :authority and
host goes to host), and here I agree that it's irrelevant to the
HTTP version.

> Failure to do so introduces a
> security bypass because L7 routers act on that information whether
> or not the client/server pair is aware of their presence.

Normally the L7 routers will decide what to do when that info is
absent, or pick it from the authority field, and they must reject
requests which have both and mismatch. But I'm extremely careful
not to pick one field and move it to the other one and conversely.

> Hence, an HTTP/1.0 proxy that receives your first example should forward
> that as
> 
>     GET / HTTP/1.0
>     Host: example.com
>     Proxy-connection: keep-alive
> 
> because the routing doesn't work otherwise due to name-based hosts
> being deployed before HTTP/1.1.

A proxy aware of HTTP/1.1 will likely do that because it knows about
such rules, but an older proxy will not necessarily (as seen above).
If you remember, these were among the issues we've all been facing in
the late 90's when chaining proxies or starting to mix proxies and
servers. And when you're developing a gateway that can be placed before
any type of agent you have to be extremely careful about not denaturating
the messages that pass through.

> And, no, there is absolutely no reason to concern ourselves with proxies
> that loop over their own hostnames, since that is a self-correcting error
> whenever a full URI is received as the request target.

In fact you're right here, I remember the exact case where I was facing
this recurrent problem, it was when configuring a component to act both
as a forward and reverse proxy, precisely because due to the reverse
proxy case it was allowed to forward the request it passed onto itself.
I still remember blocking requests having a Via from itself to break
such loops, and insisting not to deploy forward and reverse together...
painful times if you ask me. With 1.1 and the requirement that origin
servers accepted absolute requests and made host mandaory, and that
proxies would route on regular requests, that solved everything but
it took quite some time to spread fully.

> > What we're
> > doing in haproxy is that both Host and :authority are used interchangeably
> > after having been checked for proper matching, and are modified at the
> > same time if needed, and we have a flag indicating if an authority was
> > present in the incoming request to know if we have to produce one on
> > output or not. That's in the end what seems to preserve the most accurate
> > representation along a chain of multiple versions. This allows us to emit
> > a Host field only if one was present, and an authority only if one was
> > present, regardless of the HTTP version. I don't think that RFC9113 brings
> > any changes regarding this, it might only be a matter of what constitutes
> > "control data".
> 
> Sorry, that is a broken implementation. You need to send Host regardless
> of the original request version.

I can guarantee you that each time we accidently failed to do this because
of a tiny change or some strengthening of the checks of host vs authority,
we got instant reports of various 1.0 applications getting broken. And
actually I did verify carefully that the updated set of RFCs continued to
cover that compatibility requirement with these old components, i.e. Host
remains Host and :authority remains :authority along all the chain, and
only when both are set, they must match and we can simplify (e.g. drop
authority when passing to an HTTP/1.x server).

And there's a reason why HTTP/1.0 remains quite popular for internal
tools, it has the benefit of requiring zero processing after the end of
headers. This is extremely convenient for scripts, you read till the
empty line and stream the rest till the closure into the file (or to
a pipe or whatever; you just need "sed '1,/^$/d'" to strip headers).
You can also find plenty of simple update scripts that download and
install a package based just on netcat or even just /dev/tcp in bash
or zsh. As soon as you start to speak HTTP/1.1 there you have the risk
that the server responds with chunked encoding then you need curl or
wget. Thus as much as I would like it to disappear, I regularly discover
new implementations of it :-/

Regards,
Willy