Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt

Rob Sayre <sayrer@gmail.com> Sat, 30 September 2023 22:13 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D254EC14CE36; Sat, 30 Sep 2023 15:13:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.104
X-Spam-Level:
X-Spam-Status: No, score=-2.104 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LUW0WKftSx_A; Sat, 30 Sep 2023 15:13:47 -0700 (PDT)
Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 87FB3C151083; Sat, 30 Sep 2023 15:13:47 -0700 (PDT)
Received: by mail-ej1-x634.google.com with SMTP id a640c23a62f3a-9a9cd066db5so2182051366b.0; Sat, 30 Sep 2023 15:13:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696112025; x=1696716825; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=sTM1yBOMV6dnQxAIQPntoYuZMr01KJfb8E0HUaIA+5Y=; b=J2MFfRHI/Ua/+Rk+SKFobhttnz2s1KFnifsdo30s/yI1cKZN9gyHJ81WMLkURdu93I sIUQxvOcpbXgZqiQhKCS5EPhOZb3/XNrH057HbySo7WCQl/e18RXAQaKwZARdtePuVDo ++C8Q1ONgzsdNEWh0CHMehTuqTOreMI76tQNUs7cUTnzoRAZRpPnQLQ021+ccQvsHovt swa/+i/MA02dU1ryIoNS505IdjCTNsfsZj895l6T7qBPGcQXbqTnTP6XOn4RMhTpbXzV pPdlW1vLVxeEJ5YtQyp0SUirhAbA6q22AZijyjCCuhyCzCfxZ18cnkyMKqtmv08+5TXY 3tjQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696112025; x=1696716825; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=sTM1yBOMV6dnQxAIQPntoYuZMr01KJfb8E0HUaIA+5Y=; b=trtLdCn9Qf88KD2u29T5AknP1gsSx1GroCZq1j7SpjBeP4WtZnS+mjhhaR8jP/j58e nAH4eAwz13uCIY4y+BcuHV4tLVXDGRXm2mMoxKA2bW8vNe1rLj82d0JvKYJuOhbQAFek /tyjWbpVxaAcqlJEPiOABjR7pwl+Jd/Ftwrn7f51g0P9vaqw3tHbFcM2zGSQP1aFEsho dmR7/7do4dvq4bmFkryj6au3uXg0iqccJUWoQvy4Kr3S/VUkaTYJ6d0xcecQCzOg6FhI kwbKU8JgpgCQC95hUniHPXfcppzYZDwLE3WJoEycwpuL462+7GvxAjwPyWP6TlfoIYYS PCMw==
X-Gm-Message-State: AOJu0Yz8RGEIyecHfECYKfkI6swxroKKu88rzds4AqRCnCRKCjgx7LNm Ji2R0gipqpKyYtIQm2CQEIT4Lc6SSW4VaVcpLjk=
X-Google-Smtp-Source: AGHT+IHtsAc/qoFs3jGKGJMSkj9ylDEABZG1NX1CqocVBwpJXCkcDHQlCgj3YFfSEgliTEvaDsUs0M5G6mtZ2c7qzCg=
X-Received: by 2002:a17:907:2ceb:b0:99c:6825:ca06 with SMTP id hz11-20020a1709072ceb00b0099c6825ca06mr7034490ejc.12.1696112024898; Sat, 30 Sep 2023 15:13:44 -0700 (PDT)
MIME-Version: 1.0
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com> <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com>
In-Reply-To: <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com>
From: Rob Sayre <sayrer@gmail.com>
Date: Sat, 30 Sep 2023 15:13:33 -0700
Message-ID: <CAChr6SwsSWo+0ozJ3dbOurD-8ES=1rCpwX8FS2tfVuYr11s1gQ@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
Cc: "Manger, James" <James.H.Manger@team.telstra.com>, "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000be521606069adafb"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/FQaYQZk7xNKRcKFBu8TnxI2czJs>
Subject: Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 30 Sep 2023 22:13:51 -0000

On Sat, Sep 30, 2023 at 2:46 PM Tim Bray <tbray@textuality.com> wrote:

> This is interesting, thank you James.
>
> James proposes a restructure that cleverly removes mentions of code points
> and surrogates, on the basis that scalars are more important than code
> points.
>

I think it's a bad edit, although not wrong. Readers that are learning
about these issues do need to know the difference.

He also points out an important error; the I-JSON repertoire is *not* scalars,
> it’s scalars minus noncharacters. I had entirely forgotten this.  Should we
> introduce an I-JSON subset for people who might want to use control codes?
> I personally think using control codes is a bad idea, do others disagree?
>

No disagreement. It might be better to describe the issue, and explain that
you might not be dealing with text if you hit these.


> While I agree that scalars are more important than code points, I have a
> problem with some of this, which is not technical.  I think the most common
> scenario in which this document would be used (should it progress), would
> be someone who is working on an internet draft with text fields and gets
> told “you need to specify your character repertoire, go look at
> [Unichars]”.
>

I noticed I have failed to write it: this document should progress, even if
I disagree with the parts that attempt to sweep the really messy stuff
under the rug. The options the document does present are good, and would
save a lot of trouble.


> So the document exists at least in part to educate people about what’s
> lurking in Unicode.   For that reason, I don’t want to have a spec in the
> style of “just do this, don’t worry about why, trust us”. I think the
> explanation of the problem works better if you start from code points and
> explicitly caution against surrogates. Also, it’s hard to motivate why
> scalars exist without introducing surrogates.
>

Yes, it builds up an understanding from the more meaningless building
blocks. It's done well.


> I don’t think the discussion about scalars and UTF-8/16/32 is appropriate.
> UTF-16 and UTF-32 shouldn’t be used in protocols, the Internet has
> converged on UTF-8, RFC2277 mandates it, and this document should assume
> that characters on the wire are UTF-8.
>

I agree that it's not helpful. If you do need to know about these, you also
need to know about WTF-8 etc. These things will appear on the wire until
the underlying data is fixed. That will take some time.


> What do people think about the BOM?  Officially, it’s ZERO WIDTH NO-BREAK
> SPACE.  It probably doesn't add value to a UTF-8 field but it also probably
> won’t break anything.  I can see people using it to pad out fixed-length
> fields?  If we remove it from “Unicode Assignables” then we probably have
> to include an explanation about the BOM functionality.
>

I don't think the audience for this document will need to worry about this
one. Anyone that does worry about it will have their nose in the Unicode
specification itself. It is also really not used, since ZWNJ (U+200C) is
the more common thing (even used for compound emoji).

In a few places, the message said "Drop section [N.N]" with no rationale.
Those sections didn't tell me anything I didn't know, but most people do
not know these annoying things, so Ithink the sections should stay in.

thanks,
Rob




> On Sep 30, 2023 at 2:15:01 AM, "Manger, James" <
> James.H.Manger@team.telstra.com> wrote:
>
>> Comments on draft-bray-unichars-06
>> <https://datatracker.ietf.org/doc/html/draft-bray-unichars>.
>>
>>
>>
>> §2 “Characters and Code Points” should start with scalars; they are far
>> more important than code points. Suggested replacement text.
>>
>> [UNICODE <http://www.unicode.org/versions/latest/>] defines the
>> 1,081,344 integers in the ranges 0 to D7FF16 and E00016 to 10FFFF16 as
>> "Unicode scalars". Every character is assigned to one scalar. As of Unicode
>> 15.1 (2023), 149,813 characters have been assigned, leaving 931,531 scalars
>> available for assignment in future versions.
>>
>> unicode-scalar = %x0-D7FF / %xE000-10FFFF
>>
>> Scalars are the complete set of values that can be uniquely represented
>> in all 3 Unicode encoding forms – UTF-8, UTF-16, and UTF-32 – which use
>> 8-bit, 16-bit and 32-bit code units respectively. So scalars are the
>> repertoire that works for representing characters in memory, storage, and
>> in network protocols regardless of choices to use 8, 16 or 32-bit words.
>>
>> I’d rename the section to be “Characters and scalars”.
>>
>> Drop §2.1 “Transformation formats”.
>>
>> §2.2 “Problematic Code Point Types” could be renamed “Problematic
>> characters”.
>>
>> BOM should be considered a problematic character (and excluded from
>> unicode-assignables) as it is used as an encoding-layer signal.
>>
>> Surrogates are handled at the encoding layer, not the character layer, so
>> drop §2.2.1 “Surrogates”. I suggest a new §2.3 “Ill-formed encodings” to
>> replace §2.2.1 and most of §3.
>>
>> 2.3. Ill-formed encodings
>>
>> A sequence of 8-bit, 16-bit or 32-bit code units representing scalars is
>> a well-formed UTF-8, UTF-16, or UTF-32 encoding respectively. However,
>> there are other code unit sequences in each of these 3 encodings that don’t
>> map to scalars (eg C016 8016 in UTF-8; D80016 in UTF-16; 20FFFF16 in
>> UTF-32). Such sequences are call ill-formed. They can exist in practice.
>> Reasonable options when interpreting such code unit sequences are
>> signalling an error or treating them as "�" (U+FFFD, REPLACEMENT
>> CHARACTER). Silently ignoring ill-formed code unit sequences is a known
>> security risk.
>>
>> Drop §3 “Dealing With Problematic Code Points”.
>>
>> Typo: \U0089 should be \u0089.
>>
>> Typo: RFC19413 should be RFC9413.
>>
>> I’d define unicode-scalar in §2 so we don’t need §4.1 “Unicode Scalars”.
>> §4 can say:
>>
>> Specifications can refer to these by the names “Unicode scalars” (section
>> 2), “XML Characters”, and “Unicode Assignables”.
>>
>> I-JSON can’t be used as an example using unicode-scalar as it explicitly
>> excludes noncharacters; and the difference between the repertoire for the
>> JSON vs the repertoire for the logical string that can be represented by a
>> JSON string is not explained.
>>
>> i-json-value-repertoire = %x9 / %xA / %xD / %x20-D7FF / %xE000-FFFD /
>> %x10000-1FFFD / %x20000-2FFFD / … / %x100000-10FFFD
>>
>> i-json-logical-string-repertoire = %x0-D7FF / %xE000-FFFD /
>> %x10000-1FFFD / %x20000-2FFFD / … / %x100000-10FFFD
>>
>> --
>>
>> James Manger
>>
>>
>>
>>
>>
>>
>>
>> * General From: *art <art-bounces@ietf.org> on behalf of Tim Bray <
>> tbray@textuality.com>
>> *Date: *Tuesday, 26 September 2023 at 2:51 am
>> *To: *i18ndir@ietf.org <i18ndir@ietf.org>, ART Area <art@ietf.org>
>> *Subject: *[art] Fwd: New Version Notification for
>> draft-bray-unichars-06.txt
>>
>> [External Email] This email was sent from outside the organisation – be
>> cautious, particularly with links and attachments.
>>
>> What’s new and different here.
>>
>>
>>
>>    1. Locked down definition of “problematic”
>>    2. Locked down definition of “character repertoire”
>>    3. Changed “Useful Assignables” to “Unicode Assignables” (checked
>>    with Asmus first)
>>
>>
>>
>> A new version of Internet-Draft draft-bray-unichars-06.txt has been
>>
>> successfully submitted by Paul Hoffman and posted to the
>> IETF repository.
>>
>> Name:     draft-bray-unichars
>> Revision: 06
>> Title:    Unicode Character Repertoire Subsets
>> Date:     2023-09-25
>> Group:    Individual Submission
>> Pages:    10
>> URL:      https://www.ietf.org/archive/id/draft-bray-unichars-06.txt
>> Status:   https://datatracker.ietf.org/doc/draft-bray-unichars/
>> HTML:     https://www.ietf.org/archive/id/draft-bray-unichars-06.html
>> HTMLized: https://datatracker.ietf.org/doc/html/draft-bray-unichars
>> Diff:
>> https://author-tools.ietf.org/iddiff?url2=draft-bray-unichars-06
>>
>> Abstract:
>>
>>   This document discusses specifying subsets of the Unicode character
>>   repertoire for use in protocols and data formats.
>>
>>
>>
>> The IETF Secretariat
>>
>> _______________________________________________
> art mailing list
> art@ietf.org
> https://www.ietf.org/mailman/listinfo/art
>