Re: [art] draft-bray-unichars

Carsten Bormann <cabo@tzi.org> Tue, 29 August 2023 18:35 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.7\))
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAHBU6iuDwquhacp1r7qREfaA1CGLR5LjqdasMdOQUQim6NeJsw@mail.gmail.com>
Date: Tue, 29 Aug 2023 20:35:35 +0200
Cc: art@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <D870487D-0398-4C91-A1F3-69F1C5E6D036@tzi.org>
References: <CAHBU6iuDwquhacp1r7qREfaA1CGLR5LjqdasMdOQUQim6NeJsw@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/art/q37e8UuvGXpw1VR9hIcw5syFhCA>
Subject: Re: [art] draft-bray-unichars
Precedence: list

Hi Tim,

it is certainly useful to write a backgrounder on how to use Unicode in today’s network protocols.

Actually, I started writing such a document [1], and it seems I’ll need to pick up where I left this before the pandemic.

[1]: https://datatracker.ietf.org/doc/html/draft-bormann-dispatch-modern-network-unicode-02

(I received some very good feedback at the time that I can use to create the next revision of this document.)

The document [2] being announced here has a slightly different background: It seems to have been motivated by the discussion of an errata report that is trying to change RFC 8259 [3] and was discussed at length in [4].

[2]: https://datatracker.ietf.org/doc/draft-bray-unichars/
[3]: https://www.rfc-editor.org/errata/eid7603
[4]: https://mailarchive.ietf.org/arch/msg/json/Hkks1atRTycjGi0Hh2NWhdef8W8

The change requested was:

Original Text
-------------
A string is a sequence of zero or more Unicode characters [UNICODE].

Corrected Text
--------------
A string is a sequence of zero or more Unicode code points [UNICODE].

Even if this may not be obvious at first glance, this would have been a rather significant change of an approved document, so there was a lot of discussion.

## Backgrounder

The IETF has taken a decision in the late 1990s favoring Unicode and UTF-8 as the interchange format for Unicode. That decision has been upheld in the IETF for almost a quarter of a century now.

One problem with the introduction of Unicode and the replacement of what was there in the marketplace before, was that initially Unicode was based on 16-bit characters (UCS-2). When it became clear that this wouldn’t be enough, a number of environments already had picked up UCS-2 and had built platforms around that. The extension to now ~ 21 bit that Unicode underwent then was realized on this platforms by switching to UTF-16, a “Unicode transformation format” (UTF-16) based on 16-bit code points that reserves certain code points (“surrogates”) for usage in pairs to represent characters that don’t fit into 16 bits.

The UCS-2 based character models of the legacy 16-bit platforms in many cases couldn’t be repaired for fully embracing UTF-16 right away, e.g., only much later did ECMAScript introduce the “u” (Unicode) flag for regular expressions to have them actually match “Unicode” characters. So, on these platforms, UTF-16 is transported in a UCS-2 character model, and sometimes orphaned surrogates turn up instead of Unicode characters as “code points” in interfaces that are not meant to leak these implementation limitations to the outside world.

UTF-8 of course doesn’t support encoding surrogates (UTF-8 is careful to allow a single representation only for each Unicode character, and surrogate pairs would violate that, while isolated surrogates don’t mean anything in Unicode), so IETF protocols typically do not have to consider these problems of specific platforms.

## The current discussion

The IETF-wide consensus to use Unicode and UTF-8 as designed has upheld for almost a quarter of a century. Now, for some reason, there is some mood to open this up without need.

I am not going to repeat the content of RFC 9413 [5], which discusses the harm from protocols being “flexible”. But it is good that this has been written up, because it shows that effort is often required to avoid protocols turning into what I call “soup”.

[5]: https://www.rfc-editor.org/rfc/rfc9413.html

> So, this tries to say “here’s how an RFC should specify which Unicode characters it supports”.

Replacing Unicode by “Unicode plus some leakage from legacy UCS-2 platforms” MUST not be a “choice” that is open to a protocol designer. True, in some cases there may be no alternative to integrating a widely used protocol that gets this wrong in some way, but promulgating this as a choice that every protocol designer can make on a whim is deeply wrong.

I would like to help make sure that we don’t make mistakes that would create the appearance that IETF protocols are now free to fall back to enabling the use of surrogates in place of characters (except where they are meant for, in pairs in ITF-16, which we however normally do not use).

Grüße, Carsten

PS.:
https://unicode.org/glossary/
points to
https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
for the definition of an (abstract) character.
Page forward to page 88, Definition 7 (D7), and do read.
Unfortunately, the whole document really is required reading for discussing the fine points people will bring up.
Terms such as “Unicode scalar value”, “noncharacter", etc. come up, and it is important to understand the meaning of these terms in Unicode-based protocols.

[art] draft-bray-unichars Tim Bray
Re: [art] draft-bray-unichars Rob Sayre
Re: [art] draft-bray-unichars Carsten Bormann
Re: [art] draft-bray-unichars Rob Sayre
Re: [art] draft-bray-unichars Carsten Bormann
Re: [art] draft-bray-unichars Rob Sayre
Re: [art] draft-bray-unichars Carsten Bormann
Re: [art] draft-bray-unichars John C Klensin
Re: [art] draft-bray-unichars Rob Sayre
Re: [art] draft-bray-unichars Carsten Bormann
[art] [intro] was: Re: draft-bray-unichars Rob Sayre
Re: [art] draft-bray-unichars John C Klensin
Re: [art] draft-bray-unichars Martin J. Dürst
Re: [art] draft-bray-unichars Carsten Bormann
Re: [art] draft-bray-unichars tom petch
Re: [art] draft-bray-unichars Tim Bray
Re: [art] draft-bray-unichars Tim Bray
[art] Modern Network Unicode -03 Carsten Bormann
Re: [art] draft-bray-unichars Carsten Bormann
Re: [art] draft-bray-unichars Tim Bray
Re: [art] draft-bray-unichars Carsten Bormann
Re: [art] draft-bray-unichars Rob Sayre
Re: [art] draft-bray-unichars Carsten Bormann
Re: [art] draft-bray-unichars John C Klensin
Re: [art] draft-bray-unichars Nico Williams
Re: [art] draft-bray-unichars Tim Bray
Re: [art] draft-bray-unichars Rob Sayre
Re: [art] draft-bray-unichars Tim Bray
Re: [art] draft-bray-unichars Nico Williams
Re: [art] draft-bray-unichars Steffen Nurpmeso
Re: [art] draft-bray-unichars Tim Bray
Re: [art] draft-bray-unichars Tim Bray
Re: [art] draft-bray-unichars Nico Williams
Re: [art] draft-bray-unichars Carsten Bormann
Re: [art] draft-bray-unichars Nico Williams
Re: [art] draft-bray-unichars Rob Sayre
Re: [art] draft-bray-unichars Carsten Bormann