Re: support for non-ASCII in strings, was: signatures vs sf-date

Willy Tarreau <w@1wt.eu> Sat, 03 December 2022 09:53 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6D71DC1522A0 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 3 Dec 2022 01:53:20 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.951
X-Spam-Level:
X-Spam-Status: No, score=-4.951 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.25, MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PkwhG5-tdeUo for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 3 Dec 2022 01:53:18 -0800 (PST)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CF495C14CF17 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 3 Dec 2022 01:53:18 -0800 (PST)
Received: from lists by lyra.w3.org with local (Exim 4.94.2) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1p1PCj-00BgQ8-Tv for ietf-http-wg-dist@listhub.w3.org; Sat, 03 Dec 2022 09:53:05 +0000
Resent-Date: Sat, 03 Dec 2022 09:53:05 +0000
Resent-Message-Id: <E1p1PCj-00BgQ8-Tv@lyra.w3.org>
Received: from mimas.w3.org ([128.30.52.79]) by lyra.w3.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <w@1wt.eu>) id 1p1PCh-00BgOf-Eo for ietf-http-wg@listhub.w3.org; Sat, 03 Dec 2022 09:53:03 +0000
Received: from wtarreau.pck.nerim.net ([62.212.114.60] helo=1wt.eu) by mimas.w3.org with esmtp (Exim 4.94.2) (envelope-from <w@1wt.eu>) id 1p1PCf-006K9y-Ga for ietf-http-wg@w3.org; Sat, 03 Dec 2022 09:53:03 +0000
Received: (from willy@localhost) by pcw.home.local (8.15.2/8.15.2/Submit) id 2B39qmjN007117; Sat, 3 Dec 2022 10:52:48 +0100
Date: Sat, 03 Dec 2022 10:52:48 +0100
From: Willy Tarreau <w@1wt.eu>
To: Julian Reschke <julian.reschke@gmx.de>
Cc: ietf-http-wg@w3.org
Message-ID: <20221203095248.GB7078@1wt.eu>
References: <202212021129.2B2BTY9f005362@critter.freebsd.dk> <b1d3af79-373f-a9af-7ff9-39f5f44915f0@gmx.de> <202212021214.2B2CEUQx005654@critter.freebsd.dk> <7a93fa17-38fe-5fa8-54ed-a726ab9d5a39@gmx.de> <841DC85E-F936-4350-A74F-170D22E6ADCE@gbiv.com> <202212021918.2B2JIBHC007228@critter.freebsd.dk> <65070e79-5429-a4cd-abe2-667b526badf1@gmx.de> <202212022147.2B2LlcqP008154@critter.freebsd.dk> <53D8E497-284A-4B2C-91D8-367542AA0A7C@mnot.net> <c6b41b93-23b0-f3b8-5d7f-05e52614070a@gmx.de>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <c6b41b93-23b0-f3b8-5d7f-05e52614070a@gmx.de>
User-Agent: Mutt/1.10.1 (2018-07-13)
Received-SPF: pass client-ip=62.212.114.60; envelope-from=w@1wt.eu; helo=1wt.eu
X-W3C-Hub-Spam-Status: No, score=-4.9
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_IRA=-1, W3C_WL=-1
X-W3C-Scan-Sig: mimas.w3.org 1p1PCf-006K9y-Ga befb401870e21df2cb2d94143876051c
X-Original-To: ietf-http-wg@w3.org
Subject: Re: support for non-ASCII in strings, was: signatures vs sf-date
Archived-At: <https://www.w3.org/mid/20221203095248.GB7078@1wt.eu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/40638
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Hi Julian,

On Sat, Dec 03, 2022 at 08:47:10AM +0100, Julian Reschke wrote:
> > There are some cases where non-ASCII strings are needed in header fields; mostly, when you're presenting something to a human from the fields. Those cases are not as common. However, there's a catch to adding them: if full unicode strings were available in the protocol, many designers will understandably use them because it's been drilled into all our heads that unicode is what you use for strings.
> > 
> > Hence, footgun.
> 
> I would appreciate if you would explain why there is a problem we need
> to prevent, and what exactly that problem is. Do you have an example?

The main problem I'm personally having with this is that lots of text-based
processing (regex etc) that is designed to apply on a subset of the input
set will first have to pass through some non-bijective transformation
(typically iconv) and that's where problems start to happen, with the
usual stuff such as accentuated letters which lose their accents and
turn to the regular one, sometimes only after being turned to upper case,
and so on, making it possible to make some invalid contents match certain
rules on certain components. I am particularly worried of letting this
enter the protocol. If I'm setting up a rule saying that /static always
routes to the static server, it means that /stàtic will not go there. But
what if down the chain this gets turned to /STATIC then back to /static,
to finally match an existing directory on the default server ? You will
of course tell me that this is a bad example as I'm putting it on the URL
but the problem is exactly the same with other headers. Causing such trouble
to Link, Content-Type (for content analysis evasion), the path or domain in
Set-Cookie etc is really problematic. On the request path we could imagine
such things landing as far as into logs or data bases, with some diacritics
being accidently turned into language symbols or delimitors.

I actually find it very nice that anything that is not computer-safe has
to be percent-encoded, it clearly sets a limit between the two worlds,
the one that must match bytes, and the one that interpret characters,
including homoglyphs, emojis, RTL vs LTR etc. The world has had several
decades to adapt to this, and web development frameworks now make it
seamless for developers to deal with this. People set up blogs, shopping
carts and discussion boards with a few lines of code without ever having
to wonder how data are encoded over the wire.

Computers don't need to know what characters *look like* but how they
are encoded. Humans mostly don't need to know how they are encoded but
are only interested in what they look like. The current situation serves
both worlds perfectly fine, and a move in either direction would break
this important balance in my opinion.

We could of course imagine to pass some info indicating how contents are
supposed to be interpreted when that's not obvious from the header field
name, but if applications use non-standard fields, they're expected to
either know how they are supposed to exploit their contents, or to ignore
the header. That has always been like this and been fine. After all,
nothing prevents one from passing percent-encoded sounds, images, or
even shell code in headers if they want. Right now it's reliably
transported till its target.

Just my two cents,
Willy