Libraries assuming iso-8859-1 (was: Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis)

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Sun, 28 May 2023 06:47 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 05C55C15109F for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 27 May 2023 23:47:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.95
X-Spam-Level:
X-Spam-Status: No, score=-4.95 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.25, MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=itaoyama.onmicrosoft.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MV-g6jgg78vs for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 27 May 2023 23:47:51 -0700 (PDT)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D24C2C15108E for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 27 May 2023 23:47:50 -0700 (PDT)
Received: from lists by lyra.w3.org with local (Exim 4.94.2) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1q3A9E-00CgZd-RV for ietf-http-wg-dist@listhub.w3.org; Sun, 28 May 2023 06:45:00 +0000
Resent-Date: Sun, 28 May 2023 06:45:00 +0000
Resent-Message-Id: <E1q3A9E-00CgZd-RV@lyra.w3.org>
Received: from mimas.w3.org ([128.30.52.79]) by lyra.w3.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <duerst@it.aoyama.ac.jp>) id 1q3A97-00CgYc-IZ for ietf-http-wg@listhub.w3.org; Sun, 28 May 2023 06:44:53 +0000
Received: from mail-os0jpn01on2131.outbound.protection.outlook.com ([40.107.113.131] helo=JPN01-OS0-obe.outbound.protection.outlook.com) by mimas.w3.org with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <duerst@it.aoyama.ac.jp>) id 1q3A96-003F5N-Ej for ietf-http-wg@w3.org; Sun, 28 May 2023 06:44:53 +0000
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=TE23Zo5Y4CvaXBMrNtvILdgrnm4kp0tTgtoSj+g7O2z1Lb+1iGSyQ/7G6OJwg374hM3SQVn2FgPXhX0yQyktTGV67Kv4JnNCjUzRt9y6gpi73fQc7RNMPBYzmALzovBbNCbSrnzTJwSzhXzzyIikq/JpCFk8z7qlu82pUsbgHx6S6MEqb8bjwmzT7GSlIce4FKmKC5/sIqmam0fM9z2mwWayBAUeUSo9QMo3iM4GAo/SIjEFZI71XGkiv6v5PiKp5l72hwFBjRbI1EH4bq14ztIO7kYD3ytHdB/ct0JMOvQQ5wDT5qRvKrnrVxcW9HCwjBx9zuSCVHUZkKrM0PI8YQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=dVlUoOyUwTvIkR46tZl3MrA6vXoNyA702RxETRJwx6I=; b=cZ4MJttA8O721tR3hxaf3Ep23mxROeIzSBwXtBPQ/SzPqmCFEpcsNMG3Nkl8KWWg/jUK3lG3oiVNgW6V7FEMHG9pVXcVWbhD8GDaOrc//nQca52N+GuCYtL8yEHCp2EtaTXcp+4ze18kLysPcOrIt0oUwOLH5iDCUSEC2vETEF7rqgs6rUa64ugpT7eZxlVoZdeZa2iBXXX047xcmliPYupv5LNC4X82CSSw7SNM6b4nHZhip5Y1vtzI7ltUCrZLPkZ4QW8Mn1wTmnWeuZ3gw9NKJdMpaevtG2Oedw5ewVs0ueXDEJDwmoCdFpN8jSvNLcBSkQMfgsn8Y8Pj/iBIUA==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=it.aoyama.ac.jp; dmarc=pass action=none header.from=it.aoyama.ac.jp; dkim=pass header.d=it.aoyama.ac.jp; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itaoyama.onmicrosoft.com; s=selector2-itaoyama-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=dVlUoOyUwTvIkR46tZl3MrA6vXoNyA702RxETRJwx6I=; b=NejCb79gIKdtHjcP2ExCxPcyzGtX0Z1fgTQxqp61wrDYDyJhlxIdtqb7y3S2t7zsttBN3PaWDk0UKOfmxSjKQHCb6rBRdUKPz72GL5NSnubebB6VFai+hcrqquUpUxvleu/DS1Ll4cXF+2z1/1SkEfx0uAC1JU6aNS9xxqffaso=
Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=it.aoyama.ac.jp;
Received: from TYAPR01MB5689.jpnprd01.prod.outlook.com (2603:1096:404:8053::7) by TYCPR01MB8437.jpnprd01.prod.outlook.com (2603:1096:400:156::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6433.21; Sun, 28 May 2023 06:44:42 +0000
Received: from TYAPR01MB5689.jpnprd01.prod.outlook.com ([fe80::29a4:16ca:2bec:36d1]) by TYAPR01MB5689.jpnprd01.prod.outlook.com ([fe80::29a4:16ca:2bec:36d1%7]) with mapi id 15.20.6433.020; Sun, 28 May 2023 06:44:42 +0000
Message-ID: <c81e6562-7927-a342-9032-df69aba4ad43@it.aoyama.ac.jp>
Date: Sun, 28 May 2023 15:44:41 +0900
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.10.1
Content-Language: en-US
To: Mark Nottingham <mnot@mnot.net>, Roy Fielding <fielding@gbiv.com>
Cc: Tommy Pauly <tpauly@apple.com>, HTTP Working Group <ietf-http-wg@w3.org>
References: <FC5270AF-509C-4331-AE8F-1F2D51BBC5F2@apple.com> <C687C218-7793-4B74-BB51-B7C34059F9C4@gbiv.com> <F84B0780-7710-4F74-9830-ECBD4A926C3D@mnot.net>
From: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
In-Reply-To: <F84B0780-7710-4F74-9830-ECBD4A926C3D@mnot.net>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-ClientProxiedBy: TY2PR04CA0012.apcprd04.prod.outlook.com (2603:1096:404:f6::24) To TYAPR01MB5689.jpnprd01.prod.outlook.com (2603:1096:404:8053::7)
MIME-Version: 1.0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYAPR01MB5689:EE_|TYCPR01MB8437:EE_
X-MS-Office365-Filtering-Correlation-Id: eafa98d3-c9a2-45e8-c56c-08db5f47045c
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: hb6YzO4hd2OShVl8oeTq4fPHezh5ZEEy2igo1vylyWkzezcCdf4fguvB6R1IUBiDAaUTRAbANCYjSilPyrdpL1r1tx3S0fql3roJsbCUQMKjG7ZKtGeC/bj62K5eDtcnN19BKbgyf3Fd6oWZKaEKgK0L3KngdMW7eoea0WnUILNyDXUdTUFbdISQ6gcBU4SJgCiN/7QIWiHMoD8ShqNpY4Z8KqH22gQACTIht6qyadHm80O49r9ZDK72DZI054q9MsP90pav0rJgYAYQ9G3ZHPMOPzYLfyvLBJh2V9lR+yWZ13ev5lqqTDmjIrNwEwZ+OYASBBsPI5lfPpFqc8YAfwLrKQP5xP0mfh4otSZvAWBrYY1aS5+q3FU6PEDVAH4c9i9e1WdPAYMy1ZiF4NKMubhDsKNNclItqrRCVUUelCUUx3C3G82xhpI2bR1KPSLHBw/xsbVn7kLK0o1YpgPzjuByvf3hCxQTFWHurtVyBMP06mN2tZ+CFy4fwjx3ZnwXFIOJff8rqfigqO4y9oa5H0HVTvmrFkJiTjBBm1U0r7htZOQKDXQaqpQXD0041p8WJwLlsAVvYr4SS22ulwi0UCe3TKdyb+Oxu8ULJV0+qanSy4IUSZpMEBDIENPWcVuYeuL+YLJ0kXyGqKIc7ek9ntkaKn8Q4Rn7u7HYzv929nondKwNbMm7n8bxmmJ5Lpk+
X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:TYAPR01MB5689.jpnprd01.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230028)(396003)(376002)(136003)(366004)(39850400004)(346002)(451199021)(86362001)(41320700001)(4326008)(186003)(31696002)(8676002)(8936002)(41300700001)(53546011)(6506007)(6512007)(26005)(66574015)(83380400001)(2616005)(5660300002)(110136005)(54906003)(478600001)(66946007)(66556008)(66476007)(786003)(52116002)(316002)(36916002)(6486002)(38100700002)(38350700002)(2906002)(66899021)(31686004)(45980500001)(43740500002);DIR:OUT;SFP:1102;
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: b8SDBJUKCzBmAXRa6L6U0VpVptYYAzIGH8m/lNXAakWesVNcmU/z9IkFJ7u229W7wVX5M7QElh9d1L914GfKUsoYfzU8b2Lo8E5OwHt0vhV4UI4WkIr2Bi2RTSjRWFaC5ch2ZOXVaNqvkc3/YhkuKqSN8Mj0uviIkQPbRlxwVtIeRh17b/b+iHqbXbVGhDfvbsmcsVwpDwbdJN6exd9GGxc0dJ5cfhJ0XVuzUiBuOjBFsOACKvL6Lg/X9zf1qDlCuGS0zsW89N/KfE3hHKcFl8amdcRofokA0goEcuqiSTIQDx02TlgqqFAg9cTqnvh6SNemL5WtMbytoQ6WbZKbAiOA2Lhcx4zpVTfEFKJXVIui8/ugUnqhXkYij4+jvT0Fmsr0aXrRxjf6ADSGVdSW8VhNzQxBLpu+iRFfVmL/diJo68Yqlq47UCpdLr7rpjLhVPkfsPz26xVTaIjWIlcsD2W0bqN8sCZxGnVHdHfpE1vwrQ+/8AcMYNMoJ0Riw43/6yXKjxvgUHXxEu1FXlYPQf6R3PKDHLbG9gaYNwv9H3tgkLQz55sUzgqKEftVEPQA6kQomwK9xsryRm3dR8gbnExSsdz5V4giP8g1hW0nxvqEnDGncnwa1ygFW6goaXBrZQzEfxV+rUoR737Ka8l8z/lfyZbCNAd4JGWtMBWZuSvBgiBpaGQVhndsTpfb2T/RHN17ZumOiNZLt66ArX10oGnA/I/zal/mKRWP34ooQBPO21D2BPkeBtF96TY+Bdo01GFnf1tNXO8R8wT8HxJYxpgkQI6+K7XoqMdbbE6snoGbTxhQ7eKsmXXMtA4bb7AHFF/0GxWe5C3rMbe2IBpCz1knNoT2Y7cTtP1tMu9gqfJK52xTy4Jz0wvLImdFXi2GsKNKFmYSNlyZQaxMqegJVFR0F4nv1Zzc0gWyQ5z+vLL90GOlEVJHNg1k5SHmyTZZmrl8pS/AKY52s4pmEGZHe6KkEc9shQ6Hk0V4LZ00wvubl4tGpYXdftigiA9XNV4QFM93zxXJlYcqVQx4d5b0yGuUnEy81Xwh6nhL6WlcZlOz04X8yRR9CRrFNJCBQF7YQh7DtkK46XD9AkykhTBIM2/wa6FFwanmc17C3qGqQAH/iWam8TWF2RKLRcXRVKNkaHAozYQ9hU6dcCH5GivAXBdlwDE1Ggs3dhCd3y82aLD6sRro/B6dry49wRtNUUVFByWPZbkViO19EpQZJ0RRbBwe42gCW+azLt9LH+cU4oxBGTjzRWWKVt48matq9q5xKiA6zbnDtyI8LCeZ6EII6vsPQrLwfRw1upt6tLJv2NQD1z7CSsA0NRmgG86UZjPGHj+9H2bQFrz2DxN/LIBrL2Vxns7fjQNtqJF9bysr4xiT1/ahQPmUQoEIAz9S/AAenRTzNseCu1XA1juBoJLM3i7j2HPtnihLqlW3IoKEkyS9KEnX4UgK1ptFtcVo+Y8yyZu4Hl/PiQLrmKUSHkRyoQuvz1cmYJWbZmweSPHcIz7Sn17QEsqBV8UaM/oxaEREomJDq4iaihH1+NEYkWBjf/LTUoXeB27uz0O4SjzKKrazir2bbdUuLmUlzLtbWQfj
X-OriginatorOrg: it.aoyama.ac.jp
X-MS-Exchange-CrossTenant-Network-Message-Id: eafa98d3-c9a2-45e8-c56c-08db5f47045c
X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB5689.jpnprd01.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 May 2023 06:44:42.5046 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: e02030e7-4d45-463e-a968-0290e738c18e
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: TjJZToinHcbNcbI2Z1bSJoB+2XR4oegkpSoHKJysb+APxuDF6huvDtersfolSZN1tjM3xXmjGTzGlDAT78DKkQ==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYCPR01MB8437
Received-SPF: pass client-ip=40.107.113.131; envelope-from=duerst@it.aoyama.ac.jp; helo=JPN01-OS0-obe.outbound.protection.outlook.com
X-W3C-Hub-DKIM-Status: validation passed: (address=duerst@it.aoyama.ac.jp domain=itaoyama.onmicrosoft.com), signature is good
X-W3C-Hub-Spam-Status: No, score=-8.9
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, W3C_AA=-1, W3C_DB=-1, W3C_IRA=-1, W3C_IRR=-3, W3C_WL=-1
X-W3C-Scan-Sig: mimas.w3.org 1q3A96-003F5N-Ej 1847629aa75ba7faeb9cab1d73cf7c91
X-Original-To: ietf-http-wg@w3.org
Subject: Libraries assuming iso-8859-1 (was: Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis)
Archived-At: <https://www.w3.org/mid/c81e6562-7927-a342-9032-df69aba4ad43@it.aoyama.ac.jp>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/51113
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On 2023-05-26 07:38, Mark Nottingham wrote:
> Hi Roy,
> 
>> On 26 May 2023, at 3:21 am, Roy T. Fielding <fielding@gbiv.com> wrote:
>>
>> I think (b) is unnecessary given that HTTP is 8-bit clean for UTF-8
>> and we are specifically talking about new fields for which there
>> are no deployed parsers. Yes, I know what it says in RFC 9110.
> 
> Yes, the parsers may be new, but in some contexts, they may not have access to the raw bytes of the field value. Many HTTP libraries and abstractions (e.g., CGI) assume an encoding and expose strings; some of those may apply the advice that HTTP has documented for many years and assume ISO-8859-1.

This is a valid point, but I think it can be addressed rather easily.

The solution is simply to move back to bytes using ISO-8859-1 and then 
to move from bytes to characters using UTF-8. This can be done by the 
parser for Display Strings.

In Ruby, assuming that the string's encoding is ISO-8859-1 (Ruby carries 
an encoding for each string, and can to some extent deal with multiple 
strings in different encodings, although these days, it's mostly just 
everything in UTF-8), this would just be
    display_string.force_encoding('UTF-8')
(this just changes the interpretation of the underlying bytes).

In most other languages, where the actual string encoding is opaque and 
uniform (such as Python), this would be done e.g. by something like the 
following:
    display_string.encode('iso-8859-1').decode('utf-8')
Although I probably have written less than a dozen lines of Python code 
in my whole life, I checked this with
 >>> 
b'\xe2\x82\xac'.decode("iso-8859-1").encode('iso-8859-1').decode('utf-8')
which successfully printed
'€'
(the first "decode" is what the general HTTP library would do; the 
following encode/decode is what the structured header parser would do).

Of course, in a language such as Java, the whole thing would be a bit 
longer, having to instantiate a CharsetEncoder and a CharsetDecoder and 
so on :-(.

> Yes, in many cases you can use UTF-8 on the wire successfully. However, making that assumption is a local convention; we can't assume that it holds for the entire Internet, because we don't know all of the various implementations that have been deployed and how they behave. All we know is a) how the implementations we've seen behave, and b) what we've written down before.

The 'wire' (which for the moment I assume to be TCP and below, or TLS 
and below) just transports bytes, and therefore should not be a problem.
Problems may occur at places where (contrary to the HTTP specs) 
iso-8859-1 isn't passed through in headers. There may also be problems 
in cases iso-8859-1 is interpreted as excluding the bytes in the range 
0x80 to 0x9F. But I think it's easy to say that such cases should be 
very rare.

There may also be implementations that just cut off the most significant 
bit in each byte, or otherwise don't let non-ASCII bytes through. It 
would be good to know whether such cases actually have been reported, or 
whether that's just something we think might be out there but isn't 
actually confirmed.


> In the past we've made decisions like this and chosen to be conservative. We could certainly break that habit now, but we'd need (at the least) to have a big warning that this type might not be interoperable with deployed systems. Personally, I don't think that's worth it, given the relative rarity that we expect for this particular type, and the relatively low overhead of encoding.

If by overhead, you mean processing, then I agree that's low. If you 
mean size, I think the situation is different. Here's a little table 
with some of the most important scripts and the byte count and expansion 
factor for their characters when compared to legacy encodings and pure 
UTF-8:

                              Legacy  UTF-8   proposed  expansion
ASCII                        1       1       1         1
Latin+Accents, e.g. Polish   1       ~1.5    ~2        2
Arabic/Cyrillic/...          1       2       6         6
Indic scripts,...            1       3       9         9
Chinese/Japanese/...         2       3       9         4.5

So some text in an Indic or South Asian Script gets expanded by a factor 
of 9 when compared to a legacy singlebyte encoding.


>> In general, it is safer to send raw UTF-8 over the wire in HTTP
>> than it is to send arbitrary pct-encoded octets, simply because
>> pct-encoding is going to bypass most security checks long enough
>> for the data to reach an applications where people do stupid
>> things with strings that they assume contain something that is
>> safe to display.
> 
> That's an odd assertion - where are those security checks taking place?

I don't know about headers in general, but I hope people remember the 
attack where it was possible to smuggle a path like
/abc/def/../../xyz.html by percent-encoding (part of) "/../../" and 
access the file xyz.html which was access-protected, because the access 
check happened before the decoding.

Regards,   Martin.