Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt

Tim Bray <tbray@textuality.com> Sat, 30 September 2023 21:46 UTC

Mime-Version: 1.0 (Mimestream 1.1.2)
References: <169566019635.41806.9804796677919971070@ietfa.amsl.com> <CAHBU6is-wU2NLXNWL56nSJ4=nKvDzGv_Aw4qJN6N2O8CuM4-yw@mail.gmail.com> <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com>
In-Reply-To: <SYBPR01MB59814B3448F5754AAEDA1740E5C7A@SYBPR01MB5981.ausprd01.prod.outlook.com>
From: Tim Bray <tbray@textuality.com>
Date: Sat, 30 Sep 2023 21:46:22 +0000
Message-ID: <CAHBU6iueqtd5T1T-ciYUMWvmo8XqBQqO5LkWbdRaoXQzPYSQOQ@mail.gmail.com>
To: "Manger, James" <James.H.Manger@team.telstra.com>
Cc: "i18ndir@ietf.org" <i18ndir@ietf.org>, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000ee1dff06069a78be"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/QbFOT7W3lFKARYQAFcgmtM0uRHQ>
Subject: Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-06.txt
Precedence: list

 This is interesting, thank you James.

James proposes a restructure that cleverly removes mentions of code points
and surrogates, on the basis that scalars are more important than code
points.

He also points out an important error; the I-JSON repertoire is *not* scalars,
it’s scalars minus noncharacters. I had entirely forgotten this.  Should we
introduce an I-JSON subset for people who might want to use control codes?
I personally think using control codes is a bad idea, do others disagree?

While I agree that scalars are more important than code points, I have a
problem with some of this, which is not technical.  I think the most common
scenario in which this document would be used (should it progress), would
be someone who is working on an internet draft with text fields and gets
told “you need to specify your character repertoire, go look at
[Unichars]”.

So the document exists at least in part to educate people about what’s
lurking in Unicode.   For that reason, I don’t want to have a spec in the
style of “just do this, don’t worry about why, trust us”. I think the
explanation of the problem works better if you start from code points and
explicitly caution against surrogates. Also, it’s hard to motivate why
scalars exist without introducing surrogates.

I don’t think the discussion about scalars and UTF-8/16/32 is appropriate.
UTF-16 and UTF-32 shouldn’t be used in protocols, the Internet has
converged on UTF-8, RFC2277 mandates it, and this document should assume
that characters on the wire are UTF-8.

What do people think about the BOM?  Officially, it’s ZERO WIDTH NO-BREAK
SPACE.  It probably doesn't add value to a UTF-8 field but it also probably
won’t break anything.  I can see people using it to pad out fixed-length
fields?  If we remove it from “Unicode Assignables” then we probably have
to include an explanation about the BOM functionality.



On Sep 30, 2023 at 2:15:01 AM, "Manger, James" <
James.H.Manger@team.telstra.com> wrote:

> Comments on draft-bray-unichars-06
> <https://datatracker.ietf.org/doc/html/draft-bray-unichars>.
>
>
>
> §2 “Characters and Code Points” should start with scalars; they are far
> more important than code points. Suggested replacement text.
>
> [UNICODE <http://www.unicode.org/versions/latest/>] defines the 1,081,344
> integers in the ranges 0 to D7FF16 and E00016 to 10FFFF16 as "Unicode
> scalars". Every character is assigned to one scalar. As of Unicode 15.1
> (2023), 149,813 characters have been assigned, leaving 931,531 scalars
> available for assignment in future versions.
>
> unicode-scalar = %x0-D7FF / %xE000-10FFFF
>
> Scalars are the complete set of values that can be uniquely represented in
> all 3 Unicode encoding forms – UTF-8, UTF-16, and UTF-32 – which use 8-bit,
> 16-bit and 32-bit code units respectively. So scalars are the repertoire
> that works for representing characters in memory, storage, and in network
> protocols regardless of choices to use 8, 16 or 32-bit words.
>
> I’d rename the section to be “Characters and scalars”.
>
> Drop §2.1 “Transformation formats”.
>
> §2.2 “Problematic Code Point Types” could be renamed “Problematic
> characters”.
>
> BOM should be considered a problematic character (and excluded from
> unicode-assignables) as it is used as an encoding-layer signal.
>
> Surrogates are handled at the encoding layer, not the character layer, so
> drop §2.2.1 “Surrogates”. I suggest a new §2.3 “Ill-formed encodings” to
> replace §2.2.1 and most of §3.
>
> 2.3. Ill-formed encodings
>
> A sequence of 8-bit, 16-bit or 32-bit code units representing scalars is a
> well-formed UTF-8, UTF-16, or UTF-32 encoding respectively. However, there
> are other code unit sequences in each of these 3 encodings that don’t map
> to scalars (eg C016 8016 in UTF-8; D80016 in UTF-16; 20FFFF16 in UTF-32).
> Such sequences are call ill-formed. They can exist in practice. Reasonable
> options when interpreting such code unit sequences are signalling an error
> or treating them as "�" (U+FFFD, REPLACEMENT CHARACTER). Silently ignoring
> ill-formed code unit sequences is a known security risk.
>
> Drop §3 “Dealing With Problematic Code Points”.
>
> Typo: \U0089 should be \u0089.
>
> Typo: RFC19413 should be RFC9413.
>
> I’d define unicode-scalar in §2 so we don’t need §4.1 “Unicode Scalars”.
> §4 can say:
>
> Specifications can refer to these by the names “Unicode scalars” (section
> 2), “XML Characters”, and “Unicode Assignables”.
>
> I-JSON can’t be used as an example using unicode-scalar as it explicitly
> excludes noncharacters; and the difference between the repertoire for the
> JSON vs the repertoire for the logical string that can be represented by a
> JSON string is not explained.
>
> i-json-value-repertoire = %x9 / %xA / %xD / %x20-D7FF / %xE000-FFFD /
> %x10000-1FFFD / %x20000-2FFFD / … / %x100000-10FFFD
>
> i-json-logical-string-repertoire = %x0-D7FF / %xE000-FFFD / %x10000-1FFFD
> / %x20000-2FFFD / … / %x100000-10FFFD
>
> --
>
> James Manger
>
>
>
>
>
>
>
> * General From: *art <art-bounces@ietf.org> on behalf of Tim Bray <
> tbray@textuality.com>
> *Date: *Tuesday, 26 September 2023 at 2:51 am
> *To: *i18ndir@ietf.org <i18ndir@ietf.org>, ART Area <art@ietf.org>
> *Subject: *[art] Fwd: New Version Notification for
> draft-bray-unichars-06.txt
>
> [External Email] This email was sent from outside the organisation – be
> cautious, particularly with links and attachments.
>
> What’s new and different here.
>
>
>
>    1. Locked down definition of “problematic”
>    2. Locked down definition of “character repertoire”
>    3. Changed “Useful Assignables” to “Unicode Assignables” (checked with
>    Asmus first)
>
>
>
> A new version of Internet-Draft draft-bray-unichars-06.txt has been
>
> successfully submitted by Paul Hoffman and posted to the
> IETF repository.
>
> Name:     draft-bray-unichars
> Revision: 06
> Title:    Unicode Character Repertoire Subsets
> Date:     2023-09-25
> Group:    Individual Submission
> Pages:    10
> URL:      https://www.ietf.org/archive/id/draft-bray-unichars-06.txt
> Status:   https://datatracker.ietf.org/doc/draft-bray-unichars/
> HTML:     https://www.ietf.org/archive/id/draft-bray-unichars-06.html
> HTMLized: https://datatracker.ietf.org/doc/html/draft-bray-unichars
> Diff:     https://author-tools.ietf.org/iddiff?url2=draft-bray-unichars-06
>
> Abstract:
>
>   This document discusses specifying subsets of the Unicode character
>   repertoire for use in protocols and data formats.
>
>
>
> The IETF Secretariat
>
>

[I18ndir] Fwd: New Version Notification for draft… Tim Bray
Re: [I18ndir] [art] Fwd: New Version Notification… Rob Sayre
Re: [I18ndir] [art] Fwd: New Version Notification… Tim Bray
Re: [I18ndir] [art] Fwd: New Version Notification… Rob Sayre
Re: [I18ndir] [art] Fwd: New Version Notification… Tim Bray
Re: [I18ndir] [art] Fwd: New Version Notification… Asmus Freytag
Re: [I18ndir] [art] New Version Notification for … Carsten Bormann
Re: [I18ndir] [art] New Version Notification for … Claudio Allocchio
Re: [I18ndir] [art] Fwd: New Version Notification… Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Paul Hoffman
Re: [I18ndir] [art] Fwd: New Version Notification… Tim Bray
Re: [I18ndir] [art] Fwd: New Version Notification… Carsten Bormann
Re: [I18ndir] [art] Fwd: New Version Notification… Rob Sayre
Re: [I18ndir] [art] Fwd: New Version Notification… Rob Sayre
Re: [I18ndir] [art] Fwd: New Version Notification… Tim Bray
Re: [I18ndir] [art] Fwd: New Version Notification… Manger, James
Re: [I18ndir] [art] Fwd: New Version Notification… Tim Bray
Re: [I18ndir] [art] Fwd: New Version Notification… Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Carsten Bormann
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Tim Bray
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Tim Bray
Re: [I18ndir] [art] Fwd: New Version Notification… Manger, James
Re: [I18ndir] [art] Fwd: New Version Notification… Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Carsten Bormann
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Tim Bray
Re: [I18ndir] [art] New Version Notification for … Steffen Nurpmeso
Re: [I18ndir] [art] New Version Notification for … Manger, James
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Steffen Nurpmeso
Re: [I18ndir] [art] New Version Notification for … Tim Bray
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Manger, James
Re: [I18ndir] [art] New Version Notification for … Tim Bray
Re: [I18ndir] [art] New Version Notification for … Carsten Bormann
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] Fwd: New Version Notification… Martin J. Dürst
Re: [I18ndir] Fwd: New Version Notification for d… Asmus Freytag
Re: [I18ndir] [art] Fwd: New Version Notification… Manger, James
Re: [I18ndir] [art] New Version Notification for … Tim Bray
Re: [I18ndir] [art] New Version Notification for … Carsten Bormann
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Manger, James
Re: [I18ndir] [art] Fwd: New Version Notification… Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Carsten Bormann
Re: [I18ndir] [art] New Version Notification for … Carsten Bormann
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] Fwd: New Version Notification… Martin J. Dürst
Re: [I18ndir] [art] New Version Notification for … Martin J. Dürst
Re: [I18ndir] [art] New Version Notification for … Rob Sayre
Re: [I18ndir] [art] New Version Notification for … Rob Sayre