Re: AD Review of draft-ietf-httpbis-sfbis-05

Carsten Bormann <cabo@tzi.org> Thu, 01 February 2024 08:02 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=ietf.org@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 851C6C14F6EC for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 1 Feb 2024 00:02:59 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.756
X-Spam-Level:
X-Spam-Status: No, score=-2.756 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.249, MAILING_LIST_MULTI=-1, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=w3.org
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VFS2CCaYtNNo for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 1 Feb 2024 00:02:55 -0800 (PST)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5E73CC14F6BB for <httpbisa-archive-bis2Juki@ietf.org>; Thu, 1 Feb 2024 00:02:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=w3.org; s=s1; h=Subject:To:References:Message-Id:Cc:Date:In-Reply-To:From: Mime-Version:Content-Type:Reply-To; bh=PBiP23rZB3vfzT64VHdGH8QfdNZY2bgou1po8WOS5OI=; b=O7GltBgRt5RCe1lvvhk7kKCDq4 YjlZvjuT0ZAcR/8SxNJr1oN53dcKOUXo2AIJEDQVGIUIUqQZzWSXDDS1sdLpiAblgWqEJzsWqxnTt e7FGUfmIJyQ37ZCp5cgy0Nuv7ozLFkbQg4UAN/E6P5I6h06FEjc0VPKcFB58yxWuL4NhpoAAVJKuh A/izCNDuBZknilirKHJSqGSBDkSNvLIsL103jDgbNDv2M6TC1krvkakQUeItQBSV6+exrHG6ksAWS kssnEfA3g9gzozCYFOUeXjKyJhMRa+GaZX3TzE3amzMx3/CfJgJTJSGlCNy08o6r6J8aQKLkGmgK2 fwDFxiMQ==;
Received: from lists by lyra.w3.org with local (Exim 4.94.2) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1rVS1y-0089vd-At for ietf-http-wg-dist@listhub.w3.org; Thu, 01 Feb 2024 08:02:42 +0000
Resent-Date: Thu, 01 Feb 2024 08:02:42 +0000
Resent-Message-Id: <E1rVS1y-0089vd-At@lyra.w3.org>
Received: from titan.w3.org ([128.30.52.76]) by lyra.w3.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <cabo@tzi.org>) id 1rVS1w-0089tN-WB for ietf-http-wg@listhub.w3.org; Thu, 01 Feb 2024 08:02:41 +0000
Received: from smtp.zfn.uni-bremen.de ([2001:638:708:32::21]) by titan.w3.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from <cabo@tzi.org>) id 1rVS1u-00BUh1-8h for ietf-http-wg@w3.org; Thu, 01 Feb 2024 08:02:40 +0000
Received: from [192.168.217.145] (p548dcbf2.dip0.t-ipconnect.de [84.141.203.242]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4TQWbW1rNKzDCdB; Thu, 1 Feb 2024 09:02:31 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <5614883D-76A9-4157-A9B6-694AB1B5FB63@mnot.net>
Date: Thu, 01 Feb 2024 09:02:30 +0100
Cc: Francesca Palombini <francesca.palombini@ericsson.com>, "draft-ietf-httpbis-sfbis@ietf.org" <draft-ietf-httpbis-sfbis@ietf.org>, HTTP Working Group <ietf-http-wg@w3.org>
X-Mao-Original-Outgoing-Id: 728467350.671849-ac77862379de130cba08b52cbf6c36e4
Content-Transfer-Encoding: quoted-printable
Message-Id: <7E5DCED8-557C-459D-A80F-B47BF3D09998@tzi.org>
References: <AM0PR07MB6019C5D8DF60CE53F27E0722987E2@AM0PR07MB6019.eurprd07.prod.outlook.com> <56617A72-D775-41DC-88E8-3A82DC5225C7@tzi.org> <5614883D-76A9-4157-A9B6-694AB1B5FB63@mnot.net>
To: Mark Nottingham <mnot@mnot.net>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Received-SPF: pass client-ip=2001:638:708:32::21; envelope-from=cabo@tzi.org; helo=smtp.zfn.uni-bremen.de
X-W3C-Hub-Spam-Status: No, score=-4.9
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, W3C_AA=-1, W3C_IRA=-1, W3C_WL=-1
X-W3C-Scan-Sig: titan.w3.org 1rVS1u-00BUh1-8h 457dd9b801c49b4bf47764559263dd21
X-Original-To: ietf-http-wg@w3.org
Subject: Re: AD Review of draft-ietf-httpbis-sfbis-05
Archived-At: <https://www.w3.org/mid/7E5DCED8-557C-459D-A80F-B47BF3D09998@tzi.org>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/51757
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/email/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On 2024-02-01, at 08:05, Mark Nottingham <mnot@mnot.net> wrote:
> 
> Hi Carsten!
> 
>> On 30 Jan 2024, at 2:38 am, Carsten Bormann <cabo@tzi.org> wrote:
>> 
>> On 2024-01-29, at 15:41, Francesca Palombini <francesca.palombini@ericsson.com> wrote:
>>> 
>>> What parts of [I-D.draft-bray-unichars] is the reader supposed to look at? Or if it is the whole document, could we have some context around it?
>> 
>> It seems that sfbis refers to Unicode codepoints where it should have referred to Unicode scalar values (what are said to be codepoints now, need to allow encoding in UTF-8, which only applies to Unicode scalar values).
> 
> People seem to have strong and conflicting beliefs about the correct terminology here -- others have asserted the opposite in my recollection.
> 
> So I'm afraid that before I'm willing to change the spec (again) I need see a reference supporting any assertions, and agreement on its interpretation.

Hi Mark,

as sfbis is based on UTF-8, your main reference should be STD63, specifically  RFC3629 [1].
Obviously, UTF-8 is based on Unicode standardization work, so the other reference is [2].

[1]: https://www.rfc-editor.org/rfc/rfc3629.html
[2]: https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf

Unicode terminology is sometimes confusing, and it doesn’t help that at the time RFC 3629 was written, there wasn’t a term defined for what the Unicode consortium now clumsily calls “Unicode scalar values”: the set of Unicode characters that Unicode encoding forms (nee Unicode transformation formats) such as UTF-8 can encode.  See this definition (page 119 of [2]:)

D76 
  Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.
  As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF(16) and E000(16) to 10FFFF(16), inclusive.

When Unicode created the term “Unicode scalar values”, they thought that they could not use the more natural wording “Unicode characters” because Unicode scalar values include some code values that are called “non-characters” (*) in Unicode…  “Unicode characters” is still what most people would understand and what I therefore tend to use in informal conversation.

The term "Unicode code points” encompasses the Unicode scalar values as well as some code points that are used inside UTF-16 only.  Before “Unicode scalar values” was defined, “Unicode code points" was used often in its place because it is the encompassing concept, and often still is used because “Unicode scalar values” is so clumsy or simply because documentation is created by copying from old sources.  

This seems all pretty obvious, until you encounter the problem that a number of platforms are living on a legacy character model that was created as a transition strategy from the original pure 16-bit Unicode they adopted early on.  Applications what work in this space tend to leak out UTF-16 internals, causing a lot of pain [3].  For interchange, we could (and should) ignore that, except that there are people who are convinced that we should share that pain.
RFC 3629 [1] calls out specifically that the Unicode code points that are not Unicode scalar values (today’s words) cannot be encoded in UTF-8 on page 5 (mid of Section 3, [4]).
To minimize the confusion (and to reduce the number of hooks that the pain-sharers can use to muddy the issue) a standard like yours should try to avoid the generalism “Unicode code points” and talk about “Unicode scalar values” throughout, possibly after copying D76.

[3]: https://www.ietf.org/archive/id/draft-bormann-dispatch-modern-network-unicode-03.html#name-history-legacy
[4]: https://www.rfc-editor.org/rfc/rfc3629.html#page-5

Grüße, Carsten

(*) There is lots of structure in the range covered by Unicode scalar values.  A specification that is not intricately bound to those details, but really mostly wants to encode Unicode, is best off to simply use the stable term “Unicode scalar values” in its explanations and ignore those details, which are evolving.