Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Asmus Freytag <asmusf@ix.netcom.com> Sat, 09 September 2023 19:42 UTC

Content-Type: multipart/alternative; boundary="------------hLNyFHtaSpOE28ofDrpl0slb"
Message-ID: <bb9d009b-427a-bf4d-952e-263deabe5d94@ix.netcom.com>
Date: Sat, 09 Sep 2023 12:42:23 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0
Content-Language: en-US
To: i18ndir@ietf.org, Tim Bray <tbray@textuality.com>, Paul Hoffman <paul.hoffman@vpnc.org>
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <20230909165843.GlTJy%steffen@sdaoden.eu>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <20230909165843.GlTJy%steffen@sdaoden.eu>
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/9ElUfqJ6Db9BeW_Aluxjs4jVx_s>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
Precedence: list

On 9/9/2023 9:58 AM, Steffen Nurpmeso wrote:
> Tim Bray wrote in
>   <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>:
>   |Seehttps://www.ietf.org/archive/id/draft-bray-unichars-03.html
>   |
>   |A bunch of minor corrections and improvements, thanks to everyone for that,
>   |especially James Manger for noticing that the ABNF was entirely wrong in
>   |one place.
>   |
>   |The word “useless” has been replaced by “legacy”.
>   |
>   |I think the feedback was pretty clear that the draft needed to be more
>   |opinionated; just because we document the existence of the default JSON
>   |repertoire (“all the code points”) doesn’t mean that anyone should use it
>   |in the present or future. So, introduced a new section “Refining Character
>   |Repertoires” to highlight those issues and offer a suggestion.
>
> In 2.2 i would not give the count on code point types.
> Instead i would only give the problem statement "among Unicode
> code point types .. are questionable".  This seems more generic.
Unicode lists 7 code point types and using a count means that those 
types (and no other classification) is meant.
>
> In 2.2.2.2 i would not say "legacy controls", and that they are
> "mostly obsolete".  ECMA-48 is very alive in at least the POSIX
> aka Linux world, for many purposes, for example terminal
> interaction.  "Likely to occur in data as a result of
> a programming error"?  Any preformatted Unix manual page will come
> with lots of CSI sequences, or backspace-based ones.
> ASCII NUL is the base of ISO C-style strings.  In fact many
> network protocols (not enough!!) still seem to use
> KEY=VALUE\0KEY=VALUE\0\0 style transports.
There is a tension between the needs of protocols for text data and 
binary data (including serialized structured data containing text fields).

As I commented earlier, the draft could be improved if it had better 
guidance on how to specify "text plus" style repertoires. There are some 
that may need more of the controls than either the XML subset or the new 
subset provide. It would not be useful to have "canned" subsets for each 
and every possible permutation, but having a specification declare that 
it uses the "useful assignables" repertoire augmented by something like 
"the following set of ..." individually listed code points would go a 
long way to retain the benefit of a common approach to noncharacters and 
surrogates.

Beyond that, one of the other subsets might well be what you would need; 
the draft does list them for a reason.

>
> In 5.:
>
>    [JSON..] It cannot be serialized into legal UTF-8, but many
>    libraries will silently parse this and generate an ill-formed
>    UTF-8 string. Implementors must be prepared to deal with these
>    sorts of problematic code points.
>
> But RFC 3629 is very clear and says in 3. (being lengthy)
>
>     The definition of UTF-8 prohibits encoding character numbers between
>     U+D800 and U+DFFF, which are reserved for use with the UTF-16
>     encoding form (as surrogate pairs) and do not directly represent
> []
>     characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
>     to first decode the UTF-16 data to obtain character numbers, which
>     are then encoded in UTF-8 as described above.  This contrasts with
>     CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
>     ...
>
> So even the weird JSON "string" can be made valid UTF-8, one just
> has to walk around the corner.  (Possibly.)
> Sorry, but _I_ do not get that JSON supports _that_ "string",
> RFC 8259, 7.:
>
>     To escape an extended character that is not in the Basic Multilingual
>     Plane, the character is represented as a 12-character sequence,
>     encoding the UTF-16 surrogate pair.
>
> And then in 8.
>
>    8.  String and Character Issues
>    8.1.  Character Encoding
>       JSON text exchanged between systems that are not part of a
>       closed ecosystem MUST be encoded using UTF-8 [RFC3629].
>
> This is a total contradiction, sorry.  I. Hate. JSON.
> But that does not help anyone.
>
> So i mean _if_ i would write such a RFC _i_ would not hammer your
> sentence on the table, but i would then simply refer to RFC 3629
> and say that implementors shall be prepared to convert the JSON
> standard (grrr) string .. to the UTF-8 standard?
>
> 5. also says
>
>     It is unlikely that anyone specifying a new data format would
>     choose to allow this character repertoire.
>
> And
>
>     A protocol based on JSON could be made more robust and
>     implementor-friendly by requiring that the contents of member
>     names and string values contain only Useful Assignables
>
> No.  Not me.  Sorry .. we are talking string data?
> I mean, with your restriction one (possibly) cannot even generate
> a protocol that carries around Linux/POSIX path names?  Except by
> mangling them to something likely non-reproducible (by leaving off
> "evil" characters, or converting them to a replacement character;
> which one, the Unicode one, or question mark?  Ah, it must be
> ASCII question mark because the Unicode replacement character is
> of the evil sort?).  Or have i misunderstood something ...
> which can very well be the truth, of course.
> So, even if you wipe away all of the above, a hint on replacement
> characters in a document that restricts the usable set of Unicode
> characters is well worth a thought.
>
I understood this comment to mean that there may be reasons to not use 
the full flexibility of the JSON repertoire in a given situation. (In 
many situations, perhaps). And that seems fine. It's not that dissimilar 
from specifications using an XML schema, but then defining further 
constraints on the values of elements or attributes than can be 
expressed in the schema. In such a case, a file could be valid under the 
schema, but not valid under the full specification.

If, on the other hand, you need to write a protocol that needs to be 
able to transport any string that is conformant to Unicode, and have no 
control over the source of your data, then you are better off with a 
subset that is more comprehensive.

The draft could be improved by more explicitly discussing, and 
contrasting these scenarios.

A./

[I18ndir] Just uploaded draft-bray-unichars-03 Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Steffen Nurpmeso
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] Just uploaded draft-bray-unichars-03 Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Manger, James
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Carsten Bormann
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Carsten Bormann
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Carsten Bormann
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Manger, James
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Asmus Freytag
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Carsten Bormann
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Steffen Nurpmeso
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Steffen Nurpmeso
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Steffen Nurpmeso
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Rob Sayre
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Kevin Marks
Re: [I18ndir] [art] Just uploaded draft-bray-unic… Tim Bray