Re: support for non-ASCII in strings, was: signatures vs sf-date

Julian Reschke <julian.reschke@gmx.de> Sat, 03 December 2022 13:48 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B6E4EC14F72C for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 3 Dec 2022 05:48:48 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.75
X-Spam-Level:
X-Spam-Status: No, score=-7.75 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.25, MAILING_LIST_MULTI=-1, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmx.de
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Zzw2qh2nEr-E for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 3 Dec 2022 05:48:44 -0800 (PST)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 46E18C14F6EC for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 3 Dec 2022 05:48:43 -0800 (PST)
Received: from lists by lyra.w3.org with local (Exim 4.94.2) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1p1SsZ-00C1k6-B1 for ietf-http-wg-dist@listhub.w3.org; Sat, 03 Dec 2022 13:48:31 +0000
Resent-Date: Sat, 03 Dec 2022 13:48:31 +0000
Resent-Message-Id: <E1p1SsZ-00C1k6-B1@lyra.w3.org>
Received: from titan.w3.org ([128.30.52.76]) by lyra.w3.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <julian.reschke@gmx.de>) id 1p1SsX-00C1j9-W1 for ietf-http-wg@listhub.w3.org; Sat, 03 Dec 2022 13:48:30 +0000
Received: from mout.gmx.net ([212.227.15.18]) by titan.w3.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <julian.reschke@gmx.de>) id 1p1SsW-004inO-BT for ietf-http-wg@w3.org; Sat, 03 Dec 2022 13:48:29 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.de; s=s31663417; t=1670075295; bh=6fn6UOczQteU1iQvkoJUZ8YYrrwPRws/DtpLMUzG8Fg=; h=X-UI-Sender-Class:Date:Subject:To:References:From:In-Reply-To; b=J/fOSSD/ReRbourbKOS+XyjW6ZU3vyJidUoXOs+7Gqcxn57jhP8yUxgbbn8hqU0i4 iXLQtHexhzjX4fnbZPnT3JEeWYHZSnWyStCdl90K+prGlr97fUConQKYqNFNwaMxfC jU+5lJAo8QSyme5HoZORRXClhMrA9GEbgPNVD4Io5v+EdaE8+UI/BBGhYiCVUB28Aw 3sJXP4+eTadGHkLPd0rllEVMEffsmees07qP3hpzOYgycRylPCfsywiSauWmqRfM5G 9TpI+pvTivVSbHfsoLUVpHLPrOcDZPY+d9eHGax2OSCwKqa4Fbamf8iUl2zQqgSRAA 5kQck7gWNuN/w==
X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a
Received: from [192.168.178.20] ([84.171.152.225]) by mail.gmx.net (mrgmx005 [212.227.17.190]) with ESMTPSA (Nemesis) id 1My32L-1omOxM1Aqd-00zZ0h for <ietf-http-wg@w3.org>; Sat, 03 Dec 2022 14:48:15 +0100
Message-ID: <b585826a-db75-9fff-b2c6-63e808356928@gmx.de>
Date: Sat, 03 Dec 2022 14:48:14 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0
Content-Language: en-US
To: ietf-http-wg@w3.org
References: <202212021129.2B2BTY9f005362@critter.freebsd.dk> <b1d3af79-373f-a9af-7ff9-39f5f44915f0@gmx.de> <202212021214.2B2CEUQx005654@critter.freebsd.dk> <7a93fa17-38fe-5fa8-54ed-a726ab9d5a39@gmx.de> <841DC85E-F936-4350-A74F-170D22E6ADCE@gbiv.com> <202212021918.2B2JIBHC007228@critter.freebsd.dk> <65070e79-5429-a4cd-abe2-667b526badf1@gmx.de> <202212022147.2B2LlcqP008154@critter.freebsd.dk> <53D8E497-284A-4B2C-91D8-367542AA0A7C@mnot.net> <c6b41b93-23b0-f3b8-5d7f-05e52614070a@gmx.de> <20221203095248.GB7078@1wt.eu>
From: Julian Reschke <julian.reschke@gmx.de>
In-Reply-To: <20221203095248.GB7078@1wt.eu>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: quoted-printable
X-Provags-ID: V03:K1:MGa8SvlOBCBwCIDOIRHnX/F3/JoZRWN5b53AatNi4k8zyUyT9my qcUGaNGUOrcDpAwhjGu0HADj3Nj3yOj20QnpjAp0x0aRNRen5AiGQSAqxhH2vEiuzjva8Ye lGiGC5MRRrRK+iwyaxzE0yarbBsD+KQ0Gcyo41zcMgtqhOuMoCOauR5p6IppsUJ6olPpLkA gpgy+IlPtdqNCcpF6v5lA==
UI-OutboundReport: notjunk:1;M01:P0:HjGUNiFKp54=;W0iRyoyBjojrbP9FlPNPNfJFPBl l4hoID7UULsUAuVPx/dbPKlxCcT6TzCOgkSFnv17mS4V5Hqpj5Osu4flbFEoqPW9Ntip3gfem cUXFhoRKfDsuo2+gZUEl78JpPlbYQwafNNkrav6KMIwKg72uOasg+EEw7L58auXqQWdWSD+ho f8SNl1SsB8aCj9rh5bA5D3w4F/w6rZJh9oWiRzrS2CvN5emRte3tIblhIfi1vtw3LCc0OXxyy o+TfHD7Nk5sW4pzlRrWmTJ6eMf6lx5uyPmE2DVnWQ07Vf2+BKFHmYQuStrw3QvkhRQPrIwfsD 99CEOYYlUFYGe3pxTJlbrCrNDw7Scr/xTXc6ffwMKlJkMIdp5McHbApyOXSYRaB8nE2nK35Lr lWSr6zkwWmc4X3ii8Psy0X2x3sok/XMHobfvmAiAPAMcJ4xMVcuLn36i5F4KQcxYkF8eut++Z Ycz6i43HnpwolJxC15PxlD9QFfVn4OV3Whji89Ph8aYz928d4pNjahoCyT5bHmiR3sTXjt10C vziqWA2KxvFrtztOk+NVOsDS60ueXqiNEbDD34fp6+sYsU/35eApOyMJX24aBEwZ8fwP6i56Z O5zIC34B2kikpUaTznoUZfXv5tNHz6ebhAFk3y73E4xfBHghJPKbSo2Nd107Y2MW0fnYsGdPh HgvHZQAeoUH/BhGCi0a2ZYV/Mt4JzTvxpim3bld/JSaT03OVov0Fn+ZFUwEiwuG4lGY0CoeQR kelEyJHcaX6Y3fiaA+zaikDPoBXGzBkfxQIjxALv/OtP8IUy1eW08yhlwFDpr7d2gnex5wXUc 0T/FNqlY+3uldA4R1mjxYOiAXC6kJCZwBIH1Z71mV+VX9g7ZsOqWOvX59DmriNYdL49Z167SZ BgXuD5PzR6rkeuT9XtK5rxdBuzukrkmFvc85Hu2nw84eC3XrVorh+nO8nbc+IpTrVi/Dy1nZo 2tCbmg==
Received-SPF: pass client-ip=212.227.15.18; envelope-from=julian.reschke@gmx.de; helo=mout.gmx.net
X-W3C-Hub-DKIM-Status: validation passed: (address=julian.reschke@gmx.de domain=gmx.de), signature is good
X-W3C-Hub-Spam-Status: No, score=-6.1
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, NICE_REPLY_A=-0.265, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_IRA=-1, W3C_WL=-1
X-W3C-Scan-Sig: titan.w3.org 1p1SsW-004inO-BT 0633463d6b228361fcc6adbcc161a4ff
X-Original-To: ietf-http-wg@w3.org
Subject: Re: support for non-ASCII in strings, was: signatures vs sf-date
Archived-At: <https://www.w3.org/mid/b585826a-db75-9fff-b2c6-63e808356928@gmx.de>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/40639
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On 03.12.2022 10:52, Willy Tarreau wrote:
> Hi Julian,
>
> On Sat, Dec 03, 2022 at 08:47:10AM +0100, Julian Reschke wrote:
>>> There are some cases where non-ASCII strings are needed in header fields; mostly, when you're presenting something to a human from the fields. Those cases are not as common. However, there's a catch to adding them: if full unicode strings were available in the protocol, many designers will understandably use them because it's been drilled into all our heads that unicode is what you use for strings.
>>>
>>> Hence, footgun.
>>
>> I would appreciate if you would explain why there is a problem we need
>> to prevent, and what exactly that problem is. Do you have an example?
>
> The main problem I'm personally having with this is that lots of text-based
> processing (regex etc) that is designed to apply on a subset of the input
> set will first have to pass through some non-bijective transformation
> (typically iconv) and that's where problems start to happen, with the
> usual stuff such as accentuated letters which lose their accents and
> turn to the regular one, sometimes only after being turned to upper case,
> and so on, making it possible to make some invalid contents match certain
> rules on certain components. I am particularly worried of letting this
> enter the protocol. If I'm setting up a rule saying that /static always
> routes to the static server, it means that /stàtic will not go there. But
> what if down the chain this gets turned to /STATIC then back to /static,
> to finally match an existing directory on the default server ? You will
> of course tell me that this is a bad example as I'm putting it on the URL
> but the problem is exactly the same with other headers. Causing such trouble
> to Link, Content-Type (for content analysis evasion), the path or domain in
> Set-Cookie etc is really problematic. On the request path we could imagine
> such things landing as far as into logs or data bases, with some diacritics
> being accidently turned into language symbols or delimitors.
>
> I actually find it very nice that anything that is not computer-safe has
> to be percent-encoded, it clearly sets a limit between the two worlds,
> the one that must match bytes, and the one that interpret characters,
> including homoglyphs, emojis, RTL vs LTR etc. The world has had several
> decades to adapt to this, and web development frameworks now make it
> seamless for developers to deal with this. People set up blogs, shopping
> carts and discussion boards with a few lines of code without ever having
> to wonder how data are encoded over the wire.
>
> Computers don't need to know what characters *look like* but how they
> are encoded. Humans mostly don't need to know how they are encoded but
> are only interested in what they look like. The current situation serves
> both worlds perfectly fine, and a move in either direction would break
> this important balance in my opinion.
>
> We could of course imagine to pass some info indicating how contents are
> supposed to be interpreted when that's not obvious from the header field
> name, but if applications use non-standard fields, they're expected to
> either know how they are supposed to exploit their contents, or to ignore
> the header. That has always been like this and been fine. After all,
> nothing prevents one from passing percent-encoded sounds, images, or
> even shell code in headers if they want. Right now it's reliably
> transported till its target.
>
> Just my two cents,
> Willy

More than 2 cents, actually :-)

Willy, let me ask a clarifying question. As you mentioned that percent
escaping is fine, it seems what you're worried about are actual octets
with the highest bit set appearing in an HTTP field value? Or do I
misread that?

Best regards, Julian