Re: [art] draft-bray-unichars

Carsten Bormann <cabo@tzi.org> Tue, 29 August 2023 18:35 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: art@ietfa.amsl.com
Delivered-To: art@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C6F5CC1519B0 for <art@ietfa.amsl.com>; Tue, 29 Aug 2023 11:35:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.906
X-Spam-Level:
X-Spam-Status: No, score=-6.906 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 518yv8pNldIa for <art@ietfa.amsl.com>; Tue, 29 Aug 2023 11:35:39 -0700 (PDT)
Received: from smtp.zfn.uni-bremen.de (smtp.zfn.uni-bremen.de [134.102.50.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D3820C14CF12 for <art@ietf.org>; Tue, 29 Aug 2023 11:35:37 -0700 (PDT)
Received: from [192.168.217.124] (p548dc15c.dip0.t-ipconnect.de [84.141.193.92]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4RZx1z5SPZzDCbp; Tue, 29 Aug 2023 20:35:35 +0200 (CEST)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAHBU6iuDwquhacp1r7qREfaA1CGLR5LjqdasMdOQUQim6NeJsw@mail.gmail.com>
Date: Tue, 29 Aug 2023 20:35:35 +0200
Cc: art@ietf.org
X-Mao-Original-Outgoing-Id: 715026935.2752399-8311d127c5efa9f17a5d84ebbf6c7b70
Content-Transfer-Encoding: quoted-printable
Message-Id: <D870487D-0398-4C91-A1F3-69F1C5E6D036@tzi.org>
References: <CAHBU6iuDwquhacp1r7qREfaA1CGLR5LjqdasMdOQUQim6NeJsw@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
X-Mailer: Apple Mail (2.3608.120.23.2.7)
Archived-At: <https://mailarchive.ietf.org/arch/msg/art/q37e8UuvGXpw1VR9hIcw5syFhCA>
Subject: Re: [art] draft-bray-unichars
X-BeenThere: art@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Applications and Real-Time Area Discussion <art.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/art>, <mailto:art-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/art/>
List-Post: <mailto:art@ietf.org>
List-Help: <mailto:art-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/art>, <mailto:art-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Aug 2023 18:35:42 -0000

Hi Tim,

it is certainly useful to write a backgrounder on how to use Unicode in today’s network protocols.

Actually, I started writing such a document [1], and it seems I’ll need to pick up where I left this before the pandemic.

[1]: https://datatracker.ietf.org/doc/html/draft-bormann-dispatch-modern-network-unicode-02

(I received some very good feedback at the time that I can use to create the next revision of this document.)

The document [2] being announced here has a slightly different background: It seems to have been motivated by the discussion of an errata report that is trying to change RFC 8259 [3] and was discussed at length in [4].

[2]: https://datatracker.ietf.org/doc/draft-bray-unichars/
[3]: https://www.rfc-editor.org/errata/eid7603 
[4]: https://mailarchive.ietf.org/arch/msg/json/Hkks1atRTycjGi0Hh2NWhdef8W8

The change requested was:

Original Text
-------------
  A string is a sequence of zero or more Unicode characters [UNICODE].

Corrected Text
--------------
  A string is a sequence of zero or more Unicode code points [UNICODE].

Even if this may not be obvious at first glance, this would have been a rather significant change of an approved document, so there was a lot of discussion.

## Backgrounder

The IETF has taken a decision in the late 1990s favoring Unicode and UTF-8 as the interchange format for Unicode.  That decision has been upheld in the IETF for almost a quarter of a century now.

One problem with the introduction of Unicode and the replacement of what was there in the marketplace before, was that initially Unicode was based on 16-bit characters (UCS-2).  When it became clear that this wouldn’t be enough, a number of environments already had picked up UCS-2 and had built platforms around that.  The extension to now ~ 21 bit that Unicode underwent then was realized on this platforms by switching to UTF-16, a “Unicode transformation format” (UTF-16) based on 16-bit code points that reserves certain code points (“surrogates”) for usage in pairs to represent characters that don’t fit into 16 bits.  

The UCS-2 based character models of the legacy 16-bit platforms in many cases couldn’t be repaired for fully embracing UTF-16 right away, e.g., only much later did ECMAScript introduce the “u” (Unicode) flag for regular expressions to have them actually match “Unicode” characters.  So, on these platforms, UTF-16 is transported in a UCS-2 character model, and sometimes orphaned surrogates turn up instead of Unicode characters as “code points” in interfaces that are not meant to leak these implementation limitations to the outside world.  

UTF-8 of course doesn’t support encoding surrogates (UTF-8 is careful to allow a single representation only for each Unicode character, and surrogate pairs would violate that, while isolated surrogates don’t mean anything in Unicode), so IETF protocols typically do not have to consider these problems of specific platforms.

## The current discussion

The IETF-wide consensus to use Unicode and UTF-8 as designed has upheld for almost a quarter of a century.  Now, for some reason, there is some mood to open this up without need.

I am not going to repeat the content of RFC 9413 [5], which discusses the harm from protocols being “flexible”.  But it is good that this has been written up, because it shows that effort is often required to avoid protocols turning into what I call “soup”.

[5]: https://www.rfc-editor.org/rfc/rfc9413.html

> So, this tries to say “here’s how an RFC should specify which Unicode characters it supports”.  

Replacing Unicode by “Unicode plus some leakage from legacy UCS-2 platforms” MUST not be a “choice” that is open to a protocol designer.  True, in some cases there may be no alternative to integrating a widely used protocol that gets this wrong in some way, but promulgating this as a choice that every protocol designer can make on a whim is deeply wrong.

I would like to help make sure that we don’t make mistakes that would create the appearance that IETF protocols are now free to fall back to enabling the use of surrogates in place of characters (except where they are meant for, in pairs in ITF-16, which we however normally do not use).

Grüße, Carsten


PS.:
https://unicode.org/glossary/
points to
https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
for the definition of an (abstract) character.
Page forward to page 88, Definition 7 (D7), and do read.
Unfortunately, the whole document really is required reading for discussing the fine points people will bring up.
Terms such as “Unicode scalar value”, “noncharacter", etc. come up, and it is important to understand the meaning of these terms in Unicode-based protocols.