Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Tim Bray <tbray@textuality.com> Sun, 10 September 2023 17:52 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3CBE4C15107B for <i18ndir@ietfa.amsl.com>; Sun, 10 Sep 2023 10:52:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.096
X-Spam-Level:
X-Spam-Status: No, score=-7.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3DJA7-Q8M09l for <i18ndir@ietfa.amsl.com>; Sun, 10 Sep 2023 10:52:12 -0700 (PDT)
Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 95FDBC14CF1E for <i18ndir@ietf.org>; Sun, 10 Sep 2023 10:52:12 -0700 (PDT)
Received: by mail-ed1-x533.google.com with SMTP id 4fb4d7f45d1cf-52f3ba561d9so1896805a12.1 for <i18ndir@ietf.org>; Sun, 10 Sep 2023 10:52:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1694368331; x=1694973131; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=jZggJuTkuyiY20RaOl1OWWZyuTuQjWXAEBWMbA7qLDw=; b=JG8GfAkfqjWkvIy4tbhyMyG93ej/WBR3XYq7OvGzMZgRjHkG9Ag34i2UkFfDxbx2ea EvyvjyOukwLNi5YD6um9J/igOpc3EoAyTf8FGBy04SruMrli5UAbL6dPMntvL8w0BGUa TD0iq2ytdVpB8JDDAOTIbZ0pDaiDy0zGBNwN0=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694368331; x=1694973131; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jZggJuTkuyiY20RaOl1OWWZyuTuQjWXAEBWMbA7qLDw=; b=CLZzi9MT3wzhMbp1zZoQ8ZENaEHkuuRCYjKctTAvGvKNqGGmdjUdf8okXiHel22Ccl HNYQf4Gls9sLO4EfjSuMhk2xtA/HgNxeVNlde29Mn90Y2E61SLu64Qesk1PZBNXOrilv qOvAiBgv5HuFQ6wr4GVm1P0sLBMGRZa4jpy9JKwUX1blnXErx+vfh7MxeRZzpzpg9CjN mHe38wHXLkyLckRCYfbMvR7ChXFjyyb/2uZO4oNr/W764riVpyLPI4A5yqQifRjeLH0k HtjzhvLXBfiOeJEAIKKZZ+RlJTPTEhv8m+E3aO0Tr/AAQ2b0ANS8/PNcX8uQG3l3FFyC 2TAA==
X-Gm-Message-State: AOJu0YxZo6PKjRxAytPXJ0g8ju3KvZpFz7Wi9lJn86khAQ7hO8Bp07Vi PZkTtAA527kAGXA6f/pw6MhTWZp9OJFHyBHmsSDdVw==
X-Google-Smtp-Source: AGHT+IGo22wZynkDm8i7gVBsZpUEOBxKJ3hqBFsFlJKXXMeGjoIUr91E/i/oXiItKUyNcB+O+c3OGfsctE4yhh3ssqw=
X-Received: by 2002:a05:6402:26ca:b0:523:72fe:a3c4 with SMTP id x10-20020a05640226ca00b0052372fea3c4mr14976999edd.0.1694368330727; Sun, 10 Sep 2023 10:52:10 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sun, 10 Sep 2023 10:52:09 -0700
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sun, 10 Sep 2023 10:52:06 -0700
Mime-Version: 1.0 (Mimestream 1.1.1)
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <ME3PR01MB59730B45D9339180AF00E941E5F3A@ME3PR01MB5973.ausprd01.prod.outlook.com>
In-Reply-To: <ME3PR01MB59730B45D9339180AF00E941E5F3A@ME3PR01MB5973.ausprd01.prod.outlook.com>
From: Tim Bray <tbray@textuality.com>
Date: Sun, 10 Sep 2023 10:52:09 -0700
Message-ID: <CAHBU6ivc4W3KyYtbK2H7PQUa8C4+g=73nSTgBK+xLXnzH7V6GA@mail.gmail.com>
To: "Manger, James" <James.H.Manger@team.telstra.com>, Asmus Freytag <asmusf@ix.netcom.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000078d7fc060504de49"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/NPsOzYJ00xkjvwkKsrgAG_SGSFc>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Sep 2023 17:52:17 -0000

On Sep 10, 2023 at 6:51:33 AM, "Manger, James" <
James.H.Manger@team.telstra.com> wrote:

> Comments on draft-bray-unichars-03
> <https://www.ietf.org/archive/id/draft-bray-unichars-03.html>
>
>
>
> Section 3.1. Unicode Code Points
>
> The default repertoire of CBOR is unicode-scalar-values, not
> unicode-code-points. RFC8949 CBOR states that it’s string type “major type
> 3” is “a text string encoded as UTF-8”. That (since it is UTF-8) can’t
> include surrogates. It also states that “characters in this type are never
> escaped” so a JSON "\uDEAD" escape cannot be used to sneak in a surrogate.
> RFC8949 does use the phrase “Unicode code point” but appends “(scalar
> value)” at one point.
>

Interesting. I read that exact same text and came away with the impression
that if I were constructing a conforming CBOR reader, I would have to
accept all the code points. Do you believe the repeated use of “code
points” or do you index off the single trailing parenthesized “scalar
values”? Also, I bet that if I had a JSON text {“example”: “\uDEAD”} and
fed it to JSON-to-CBOR converters, a lot of them would emit CBOR containing
ill-formed UTF-8. However, you’ve established that the reading of 8949 is
at least ambiguous on this point. So we should probably take the
CBOR-related repertoire assertion out?

BTW this assertion that “UTF-8 can’t include surrogates”, which has been
made repeatedly, needs to be taken with a grain of salt. The UTF-8
procedures for converting between code points and byte sequence work
perfectly well for surrogates and a whole lot of software out there will
silently convert both ways. The UTF-8 in question is in fact not
well-formed nor does it conform to the definition of UTF-8, but it exists
in the wild and it can’t really be defined as “non-existent”.

 The default repertoire of JSON is not unicode-code-points since JSON
> excludes controls except tab, newline and carriage return. Given this spec
> distinguishes useful-assignables from unicode-scalar-values it should
> distinguish JSON’s actual subset from unicode-code-points if it is going to
> mention JSON.
>

You’re thinking of XML?
https://datatracker.ietf.org/doc/html/rfc8259#section-7 says that C0
controls must be expressed in \u notation but they’re allowed.

 3.1. needs to explicitly state that this unicode-code-points cannot be
> encoded in well-formed UTF-8 (or UTF-16 or UTF-32). It can only be used via
> higher-level escape sequences in protocols that offers those (such as
> JSON). This is mentioned in 2.2.1 (“it is impossible to represent a
> surrogate in well-formed UTF-8”), but also needs to be in 3.1. Otherwise,
> 3.1 and 3.2 appear as two similar choices, which elides their huge
> difference.
>

Agreed. Check out the forthcoming -04.

 Section 2.2.3. Noncharacters
>
> This spec highlights noncharacters for exclusion. However, Unicode
> explicitly warns against that: Corrigendum #9 Clarification About
> Noncharacters <https://www.unicode.org/versions/corrigendum9.html> says
> “the real intent of noncharacters is that they are permanently prohibited
> from being assigned standard, interchangeable meanings, rather than that
> they are prohibited from occurring in Unicode strings which happen to be
> interchanged”.
>
> So an IETF spec is never going to define a string that needs a
> noncharacter; but it’s also never going to define a string that needs a
> private-use character either. If a spec defines an element that can hold
> any string, should that allow private-use characters but exclude
> noncharacters and non-useful controls? I’m not sure. That still leaves a
> lot of junk (eg BOM).
>

That’s a real issue.  Are we confident in saying that no IETF spec could
ever find a use for PUA code points?  If somebody wants to add structure to
text I think they should use CBOR or JSON or something, but if some WG
wanted to use a BMP PUA for some reason or other, I could see that being
OK. But if a bunch of people call for the exclusion of PUA, I guess I could
live with that.  I’ve cc’ed Asmus Freytag, the best Unicode expert I know
of, for his opinion.

 Section 5. Refining Character Repertoires
>
> "\u7FFFF" is NOT a JSON escape for U+7FFFF; it a JSON escape for U+7FFF
> followed by an F character (as a few others have pointed out).
>
> A proper JSON escape for U+7FFFF is "\uD9BF\uDFFF".
>

Ouch, of course, you’re right. And thanks for providing the UTF-16 so I
don’t have to remember how to compute it.

 I agree that “many libraries will silently parse” "\uDEAD", but I’m not
> sure how many “generate an ill-formed UTF-8 string”. In Java, for instance,
> "\uDEAD".getBytes("UTF-8") returns a single byte 0x3F “?” – it’s valid
> UTF-8, just no longer a 1-to-1 representation of the in-memory unpaired
> surrogate.
>

Wow. I had no idea. As with many aspects of Java+Unicode, this feels deeply
wrong. It should either round-trip or throw a damn exception. Anyhow, that
ship sailed a long time ago.  I think we should include the Java example to
illustrate another way that surrogates can lead to breakage.


>