Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis

"Roy T. Fielding" <fielding@gbiv.com> Fri, 26 May 2023 19:41 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8D2FFC14CE44 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Fri, 26 May 2023 12:41:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -5.048
X-Spam-Level:
X-Spam-Status: No, score=-5.048 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.25, MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gbiv.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id daub_bTUkvdl for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Fri, 26 May 2023 12:41:17 -0700 (PDT)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 91A05C14F75F for <httpbisa-archive-bis2Juki@lists.ietf.org>; Fri, 26 May 2023 12:41:17 -0700 (PDT)
Received: from lists by lyra.w3.org with local (Exim 4.94.2) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1q2dGg-009y1y-8v for ietf-http-wg-dist@listhub.w3.org; Fri, 26 May 2023 19:38:30 +0000
Resent-Date: Fri, 26 May 2023 19:38:30 +0000
Resent-Message-Id: <E1q2dGg-009y1y-8v@lyra.w3.org>
Received: from titan.w3.org ([128.30.52.76]) by lyra.w3.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <fielding@gbiv.com>) id 1q2dGe-009y10-Kj for ietf-http-wg@listhub.w3.org; Fri, 26 May 2023 19:38:28 +0000
Received: from bird.elm.relay.mailchannels.net ([23.83.212.17]) by titan.w3.org with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <fielding@gbiv.com>) id 1q2dGc-009Iyg-MS for ietf-http-wg@w3.org; Fri, 26 May 2023 19:38:28 +0000
X-Sender-Id: dreamhost|x-authsender|fielding@gbiv.com
Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 444373E118A; Fri, 26 May 2023 19:38:20 +0000 (UTC)
Received: from pdx1-sub0-mail-a204.dreamhost.com (unknown [127.0.0.6]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id B42C33E127F; Fri, 26 May 2023 19:38:17 +0000 (UTC)
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1685129897; a=rsa-sha256; cv=none; b=4KK6dfI+70w0vnC0giZeStXemf1VL/EuUzta3EnVR4rYUsLgnfXhdtDEXdUi0dWoV70MiE AGRGA/asaLJPjKEPYyjrsWr9ocd1f8DJqACBUxww5Lcq6v/OT2tRPyTcClGo05Rcnp7qed lmZv7iVlZbPqEaMBOHo1pOIp1UrrUDNoJlbmhyaTmRm+EC8XXmLqAxfe6zXbpG5H3F5U9j qeJhJ7D0uX7r7Yq+Kpbhw0e9RGfAZWlV6cXhhtdPIfTMhF41b2eNHNOlJ1ejzefuuY2gb6 G3KtIoeY0Ale95+ciiI1Y/A7Do01DpQ9Rp22Rf4UAG/sMwZrIv3kCpmODlZ2sQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=mailchannels.net; s=arc-2022; t=1685129897; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5faaP7qqt/g2n0a9FKT+xATLij9G4HA1NqmTz+vxCUo=; b=k5bFqSl3WxbDyKWd/bHfQ4AjNvdNaDnv2gRJ1SCIsjECZrFz02qnV+FDceLeUSHElCC3O3 brtKuid5sb79h0euka8lehTIRvhoY5ME0FIUcQqOwoLasZlEJLFj76DQsa5EC7k4LN27n4 ZbKVJvI0CG2pCJNWLjf+0Mjeqph+VQjMJcB4FHfQZosLWFqLjMjJMr6lm53/dvJzwfpH8L JV8CPtKmwflLhg+CIiU8rWJO+IlRt6cGCLI8je23GgyMvjIX1a3c/bPVXxMDUZ8tngkZq/ X7E5AHYwGJeNlRa4cSxbP4kJGRanUFN6WGX8itNiAhU1Y20jkQK0OFAQAwwrEA==
ARC-Authentication-Results: i=1; rspamd-859c7bff78-c4tvz; auth=pass smtp.auth=dreamhost smtp.mailfrom=fielding@gbiv.com
X-Sender-Id: dreamhost|x-authsender|fielding@gbiv.com
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|fielding@gbiv.com
X-MailChannels-Auth-Id: dreamhost
X-Turn-Tasty: 02f4d61707c766da_1685129900081_2293094518
X-MC-Loop-Signature: 1685129900081:4213992117
X-MC-Ingress-Time: 1685129900080
Received: from pdx1-sub0-mail-a204.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384) by 100.104.253.229 (trex/6.8.1); Fri, 26 May 2023 19:38:20 +0000
Received: from smtpclient.apple (ip72-194-77-117.oc.oc.cox.net [72.194.77.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: fielding@gbiv.com) by pdx1-sub0-mail-a204.dreamhost.com (Postfix) with ESMTPSA id 4QSZw923BYzHn; Fri, 26 May 2023 12:38:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gbiv.com; s=dreamhost; t=1685129897; bh=bcyGA/0MwxbLcnaLGLBYnSN2gmDvzl5WKa/mmTz4VlY=; h=Content-Type:Subject:From:Date:Cc:Content-Transfer-Encoding:To; b=T4tDKS+M6Y4ScOekt9oq/rG7pJvO2Ip9WogKdaxvGkuHnJ1l7Ds0iFf+4yx0JyYum za9hUjSWEyWN3c32hBYA3tmBL9Pbth7eDybi35y83ByQkyfpeWOQ/pAVvDvY6kBtJ5 jLUTQNhqm7IhcqD4Qkd/yrjxU9Mb2Uq42TC1pIOZyE9PYUvHVql5f81txhDBtkAxnF jkjTCy/toeA09D/T+L4DtMOQJYwHpnk7KObUJgOI36V/MWD9M+hYmywRBTP0oFDfDd Fs9+o7oXZZUXhxN11sbuu7kJSMbrqXqYEmzc0KpCRnOa2IBjowZVvxx/ML1tUHHlww hTs/k/O8JQjHQ==
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.500.231\))
From: "Roy T. Fielding" <fielding@gbiv.com>
In-Reply-To: <F84B0780-7710-4F74-9830-ECBD4A926C3D@mnot.net>
Date: Fri, 26 May 2023 12:38:06 -0700
Cc: Tommy Pauly <tpauly@apple.com>, HTTP Working Group <ietf-http-wg@w3.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <B38AA4F7-1F75-4690-9706-B8C7538B4DCC@gbiv.com>
References: <FC5270AF-509C-4331-AE8F-1F2D51BBC5F2@apple.com> <C687C218-7793-4B74-BB51-B7C34059F9C4@gbiv.com> <F84B0780-7710-4F74-9830-ECBD4A926C3D@mnot.net>
To: Mark Nottingham <mnot@mnot.net>
X-Mailer: Apple Mail (2.3731.500.231)
Received-SPF: pass client-ip=23.83.212.17; envelope-from=fielding@gbiv.com; helo=bird.elm.relay.mailchannels.net
X-W3C-Hub-DKIM-Status: validation passed: (address=fielding@gbiv.com domain=gbiv.com), signature is good
X-W3C-Hub-Spam-Status: No, score=-6.1
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, W3C_AA=-1, W3C_DB=-1, W3C_IRA=-1, W3C_WL=-1
X-W3C-Scan-Sig: titan.w3.org 1q2dGc-009Iyg-MS 84fde73f467e9954a6e23523c80c1dcd
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis
Archived-At: <https://www.w3.org/mid/B38AA4F7-1F75-4690-9706-B8C7538B4DCC@gbiv.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/51102
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On May 25, 2023, at 3:38 PM, Mark Nottingham <mnot@mnot.net> wrote:

> Hi Roy,
> 
>> On 26 May 2023, at 3:21 am, Roy T. Fielding <fielding@gbiv.com> wrote:
>> 
>> I think (b) is unnecessary given that HTTP is 8-bit clean for UTF-8
>> and we are specifically talking about new fields for which there
>> are no deployed parsers. Yes, I know what it says in RFC 9110.
> 
> Yes, the parsers may be new, but in some contexts, they may not have access to the raw bytes of the field value. Many HTTP libraries and abstractions (e.g., CGI) assume an encoding and expose strings; some of those may apply the advice that HTTP has documented for many years and assume ISO-8859-1.

That's not a problem in practice, since the data does not change.
It just looks like messy characters on display.

What would be a problem is if an implementation transcoded the values 
incorrectly while being parsed, or used code-point lengths instead
of octet lengths for measuring the memory allocated in copies.
But again, we are not breaking such systems: they are already broken
and insecure, and at worst we are doing folks a service by surfacing
the bad code in a visible way.

The valid systems we might be breaking would be those that parse
for high-bit octets and reject the message as invalid. I do not
know of any such systems because of the legacy of ISO-8859-*
(especially among Cyrillic servers). In any case, such systems
don't use display strings.

However, I agree that it is hard for me to argue against my
own long history of being unable to adopt UTF-8 in HTTP.
I just find it annoying to assume that a totally new parser
of a totally new field should somehow be constrained in the
parsing of its values by a mere perception of what might be
the case for legacy parsers that shouldn't even be looking
at new fields.

It would be different if we knew of an example that fails.

> Yes, in many cases you can use UTF-8 on the wire successfully. However, making that assumption is a local convention; we can't assume that it holds for the entire Internet, because we don't know all of the various implementations that have been deployed and how they behave. All we know is a) how the implementations we've seen behave, and b) what we've written down before.

I prefer to think locally and act globally.

> In the past we've made decisions like this and chosen to be conservative. We could certainly break that habit now, but we'd need (at the least) to have a big warning that this type might not be interoperable with deployed systems. Personally, I don't think that's worth it, given the relative rarity that we expect for this particular type, and the relatively low overhead of encoding.

If this were an important use case, I would agree with you.
We are talking about a display string, which seems to be
the perfect opportunity to find out what we can get away
with changing.

>> The PR doesn't clearly express any of these points. It says the
>> strings contain Unicode (a character set) but they obviously don't;
>> they contain sequences of unvalidated pct-encoded octets.
>> This allows arbitrary octets to be encoded for something that
>> is supposed to be a display string.
> [...]
>> If this is truly for a display string, the feature must be
>> specific about the encoding and allowed characters.
>> My suggestion would be to limit the string to non-CNTRL
>> ASCII and non-control valid UTF-8. We don't want to allow
>> anything that would twist the feature to some other ends.
>> 
>> Assuming we do this with pct-encoding, we should not allow
>> arbitrary octets to be encoded. We should disallow encodings
>> that are unnecessary (normal printable ASCII aside from % and "),
>> control characters, or octets not valid for UTF-8. That can
>> be specified by prose and reference to the IETF specs, or
>> we could specify the allowed ranges with a regular expression.
>> Either one is better than allowing arbitrary octets to be encoded.
> 
> I think that's reasonable and we can discuss improvements after adopting the PR.

I think the pct-encoding feature is actively dangerous without
those constraints because it encourages a means to bypass HTTP's
normal safeguards. I don't want to discuss them as improvements.

>> In general, it is safer to send raw UTF-8 over the wire in HTTP
>> than it is to send arbitrary pct-encoded octets, simply because
>> pct-encoding is going to bypass most security checks long enough
>> for the data to reach an applications where people do stupid
>> things with strings that they assume contain something that is
>> safe to display.
> 
> That's an odd assertion - where are those security checks taking place?

In places like the Fastly config, right now, though I only do that
for an incoming request-target when I don't need a premium WAF.
For example (extracted from an error snippet):

   if (var.path ~ {"%[0-7][0-9A-Fa-f]"}) {
     set obj.http.x-error = "Forbidden encoded ASCII in URL path";
     set obj.status = 403;
     set obj.response = "Forbidden";
     return (deliver);
   }

[Note that this is making assumptions about what is allowed
in a URL path that is specific to the origin servers behind
this CDN. It is not a universal config.]

Others use a WAF (or mod_security rules) applied to various
parts of a request message, or just bayesian analysis of
example fails.

What I mean by this odd assertion is that raw UTF-8 sent
through the message parsing algorithm of HTTP will result
in a very obvious message for recipients on the backend,
even if it contains unwanted characters, whereas pct-encoding
makes the message look safe until passes though the checks
and it reaches a point in later processing where an application
(perhaps unaware of the source of that data) foolishly
decodes the string without expecting it to contain
arbitrary octets that might become command invocations,
request smuggling, or cache poisoning.

Of course, there is nothing preventing such pct-encoding from
being included in any non-literal part of an HTTP message,
which is what pentesters and script kiddies are constantly
running against our Web properties (and those of our CMS
customers) in the hope of finding some application, somewhere
downstream, that will fail to validate the data it receives.
This feature won't change that.

The problem is that it takes what is normally considered
an evil encoding (if found anywhere other than an expected
URI-reference or x-url-encoded content) and calls it a
"good encoding" for a display string, which means we will
have to worry about breaking a new feature of HTTP instead
of just blocking all bad strings.

Even so, I can live with pct-encodings when they are restricted
to a reasonably safe range of characters for display.

For example,

% pcre2grep -e '^([\x20-\x21\x23-\x24\x26-\x5B\x5D-\x7E]|\x5C[\x22\x5C]|%((2[25])|([Cc][2-9A-Fa-f]%[89A-Fa-f][0-9A-Fa-f])|([Dd][0-9A-Fa-f]%[89A-Fa-f][0-9A-Fa-f])|([Ee][0-9A-Fa-f](%[89A-Fa-f][0-9A-Fa-f]){2})|([Ff][0-4](%[89A-Fa-f][0-9A-Fa-f]){3})))*$'

which, IIRC, is a safe subset of display string characters
that allows printable ASCII (aside from " and %), safe
non-ASCII UTF-8 as pct-escapes (regardless of current
Unicode code points), and disallows the unsafe UTF-8.

Alternatively, require that pct-encoding be limited to %22, %25,
and pct-encoded sequences of valid non-ASCII, non-control, UTF-8
octets, as defined by [UTF-8].

It's somewhat pedantic, but guides implementations toward
detecting such errors rather than ignoring them as someone
else's problem. Also, it is something people can implement with
interoperability, rather than a string of Unicode characters
in general (which isn't).

Cheers,

....Roy