Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Tim Bray <tbray@textuality.com> Sun, 10 September 2023 17:52 UTC

Mime-Version: 1.0 (Mimestream 1.1.1)
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <ME3PR01MB59730B45D9339180AF00E941E5F3A@ME3PR01MB5973.ausprd01.prod.outlook.com>
In-Reply-To: <ME3PR01MB59730B45D9339180AF00E941E5F3A@ME3PR01MB5973.ausprd01.prod.outlook.com>
From: Tim Bray <tbray@textuality.com>
Date: Sun, 10 Sep 2023 10:52:09 -0700
Message-ID: <CAHBU6ivc4W3KyYtbK2H7PQUa8C4+g=73nSTgBK+xLXnzH7V6GA@mail.gmail.com>
To: "Manger, James" <James.H.Manger@team.telstra.com>, Asmus Freytag <asmusf@ix.netcom.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000078d7fc060504de49"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/NPsOzYJ00xkjvwkKsrgAG_SGSFc>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
Precedence: list

On Sep 10, 2023 at 6:51:33 AM, "Manger, James" <
James.H.Manger@team.telstra.com> wrote:

> Comments on draft-bray-unichars-03
> <https://www.ietf.org/archive/id/draft-bray-unichars-03.html>
>
>
>
> Section 3.1. Unicode Code Points
>
> The default repertoire of CBOR is unicode-scalar-values, not
> unicode-code-points. RFC8949 CBOR states that it’s string type “major type
> 3” is “a text string encoded as UTF-8”. That (since it is UTF-8) can’t
> include surrogates. It also states that “characters in this type are never
> escaped” so a JSON "\uDEAD" escape cannot be used to sneak in a surrogate.
> RFC8949 does use the phrase “Unicode code point” but appends “(scalar
> value)” at one point.
>

Interesting. I read that exact same text and came away with the impression
that if I were constructing a conforming CBOR reader, I would have to
accept all the code points. Do you believe the repeated use of “code
points” or do you index off the single trailing parenthesized “scalar
values”? Also, I bet that if I had a JSON text {“example”: “\uDEAD”} and
fed it to JSON-to-CBOR converters, a lot of them would emit CBOR containing
ill-formed UTF-8. However, you’ve established that the reading of 8949 is
at least ambiguous on this point. So we should probably take the
CBOR-related repertoire assertion out?

BTW this assertion that “UTF-8 can’t include surrogates”, which has been
made repeatedly, needs to be taken with a grain of salt. The UTF-8
procedures for converting between code points and byte sequence work
perfectly well for surrogates and a whole lot of software out there will
silently convert both ways. The UTF-8 in question is in fact not
well-formed nor does it conform to the definition of UTF-8, but it exists
in the wild and it can’t really be defined as “non-existent”.

 The default repertoire of JSON is not unicode-code-points since JSON
> excludes controls except tab, newline and carriage return. Given this spec
> distinguishes useful-assignables from unicode-scalar-values it should
> distinguish JSON’s actual subset from unicode-code-points if it is going to
> mention JSON.
>

You’re thinking of XML?
https://datatracker.ietf.org/doc/html/rfc8259#section-7 says that C0
controls must be expressed in \u notation but they’re allowed.

 3.1. needs to explicitly state that this unicode-code-points cannot be
> encoded in well-formed UTF-8 (or UTF-16 or UTF-32). It can only be used via
> higher-level escape sequences in protocols that offers those (such as
> JSON). This is mentioned in 2.2.1 (“it is impossible to represent a
> surrogate in well-formed UTF-8”), but also needs to be in 3.1. Otherwise,
> 3.1 and 3.2 appear as two similar choices, which elides their huge
> difference.
>

Agreed. Check out the forthcoming -04.

 Section 2.2.3. Noncharacters
>
> This spec highlights noncharacters for exclusion. However, Unicode
> explicitly warns against that: Corrigendum #9 Clarification About
> Noncharacters <https://www.unicode.org/versions/corrigendum9.html> says
> “the real intent of noncharacters is that they are permanently prohibited
> from being assigned standard, interchangeable meanings, rather than that
> they are prohibited from occurring in Unicode strings which happen to be
> interchanged”.
>
> So an IETF spec is never going to define a string that needs a
> noncharacter; but it’s also never going to define a string that needs a
> private-use character either. If a spec defines an element that can hold
> any string, should that allow private-use characters but exclude
> noncharacters and non-useful controls? I’m not sure. That still leaves a
> lot of junk (eg BOM).
>

That’s a real issue.  Are we confident in saying that no IETF spec could
ever find a use for PUA code points?  If somebody wants to add structure to
text I think they should use CBOR or JSON or something, but if some WG
wanted to use a BMP PUA for some reason or other, I could see that being
OK. But if a bunch of people call for the exclusion of PUA, I guess I could
live with that.  I’ve cc’ed Asmus Freytag, the best Unicode expert I know
of, for his opinion.

 Section 5. Refining Character Repertoires
>
> "\u7FFFF" is NOT a JSON escape for U+7FFFF; it a JSON escape for U+7FFF
> followed by an F character (as a few others have pointed out).
>
> A proper JSON escape for U+7FFFF is "\uD9BF\uDFFF".
>

Ouch, of course, you’re right. And thanks for providing the UTF-16 so I
don’t have to remember how to compute it.

 I agree that “many libraries will silently parse” "\uDEAD", but I’m not
> sure how many “generate an ill-formed UTF-8 string”. In Java, for instance,
> "\uDEAD".getBytes("UTF-8") returns a single byte 0x3F “?” – it’s valid
> UTF-8, just no longer a 1-to-1 representation of the in-memory unpaired
> surrogate.
>

Wow. I had no idea. As with many aspects of Java+Unicode, this feels deeply
wrong. It should either round-trip or throw a damn exception. Anyhow, that
ship sailed a long time ago.  I think we should include the Java example to
illustrate another way that surrogates can lead to breakage.

>

[I18ndir] Just uploaded draft-bray-unichars-03 Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Steffen Nurpmeso
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] Just uploaded draft-bray-unichars-03 Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Manger, James
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Carsten Bormann
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Carsten Bormann
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Carsten Bormann
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Manger, James
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Carsten Bormann
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Steffen Nurpmeso
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Steffen Nurpmeso
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Steffen Nurpmeso
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Kevin Marks
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray