Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Your comments triggered some reflections on the drafts and on comments I 
submitted earlier. See below.
A./

On 9/9/2023 1:56 PM, Rob Sayre wrote:
> I gave it a close read again. I came up with this:
>
>
> 5. Refining Character Repertoires
>
> The IETF typically uses well-known data formats such as JSON, I-JSON, 
> CBOR, YAML, and XML. These formats have default character repertoires. 
> For example, JSON allows member names and string values to include any 
> Unicode code points, including all the problematic types; the 
> following is a legal JSON document:
>
> [ big edit from the current draft, shorter, but take it or leave it. ]
>
>
> {"example": "\u0000\U0089\uDEAD\u7FFFF"}
>
> The value of the "example" field contains the C0 Control NUL, the C1 
> Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired 
> surrogate, and the noncharacter U+7FFFF. It is unlikely to be useful 
> as the value of a text field. It cannot be serialized into legal 
> UTF-8, but many libraries will silently parse this and generate an 
> ill-formed UTF-8 string. Implementors must be prepared to deal with 
> these sorts of problematic code points
>
> [ The first part, "unlikely to be useful as the value of a text 
> field", is good. But, the next part mixes "legal" and "ill-formed", 
> and I don't think that is a good idea. There is still a lowercase 
> requirement after that, and I think I disagree. Implementors do not 
> have to be "prepared to deal with these sorts of problematic code 
> points". Maybe: "Some messages will contain these 
> problematic code points". That is true, but you don't have to deal 
> with them. ]

I agree with you that legal should be changed to "well-formed" which is 
a defined term.

I disagree a bit with your conclusion. Implementers SHOULD always be 
prepared to deal with ill-formed input. It's the nature of how to deal 
with ill-formed input that can vary based on the circumstances and context.

If an protocol declares the "problematic" code points off limits, then 
an implementation might reject such ill-formed or modify it in a way 
that flags it as erroneous. For an example see the way the Unicode 
Standard suggests dealing with ill-formed UTF-8.

There are security concerns attached to how an implementation deals with 
ill-formed input, but we should be clear that producing ill-formed 
output is not "dealing" with the situation. In fact, even "behavior is 
unspecified" is a bad choice from a security point of view.

If you have a data format that puts limits on what is allowed, then if 
you are guaranteed that your input is "well-formed" it would be 
redundant to take elaborate measures to "deal" with ill-formed input. 
However, in that case, rejecting the input or throwing an exception 
would still be more appropriate reactions than ignoring the fact that 
the input is ill-formed.

>
>
>
> It is unlikely that anyone specifying a new data format would choose 
> to allow this character repertoire.
>
> [ Instead: The JSON character repertoire is too permissive, so it's 
> best for new specifications to require that the contents of member 
> names and string values contain only Useful Assignables (see Section 
> 4.2). ]
recte: code point repertoire
>
>
>
> Then, I got to the end, and noticed that "character repertoire" might 
> not be the best choice. "Character encoding" or "character set"? 
> "Vocabulary"? No shade for the authors here, writing about language 
> itself is really difficult.

I'm a proponent of the term "repertoire" for precisely the purpose of 
indicating the contents of a subset of characters.

An "encoding" is associated both with the repertoire and the mapping to 
identifying numbers for each character. As the mapping is supplied by 
the Unicode Standard, a subset would not be a new "encoding".

The term "vocabulary" would seem more appropriate for a collection of 
symbolic names, than individual characters.

I've pointed out in earlier comments to the authors, there are other 
IETF standards that already use the term repertoire in the sense of a 
collection of elements that are based on Unicode characters. In some 
cases the repertoire elements are characters, but depending on the 
specification they may also be character sequences.

Generally, when we speak about a character set, we use that term in a 
way that's not fully congruent with "set of characters"; most commonly, 
the term "character set" refers to formal specification like ASCII, ISO 
8850/1, Unicode, etc, which both define a maximal repertoire, and an 
encoding. Sometimes they also define subsets and different encoding forms.

The one change I would make (and I admit that I overlooked it) would be 
to change "character repertoire" to "code point repertoire" for any 
repertoire that is not limited to code points assigned to characters.

A./

>
> thanks,
> Rob
>
>
>
> On Sat, Sep 9, 2023 at 10:15 AM Steffen Nurpmeso <steffen@sdaoden.eu> 
> wrote:
>
>     Tim Bray wrote in
>      <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>:
>      |See https://www.ietf.org/archive/id/draft-bray-unichars-03.html
>      |
>      |A bunch of minor corrections and improvements, thanks to
>     everyone for that,
>      |especially James Manger for noticing that the ABNF was entirely
>     wrong in
>      |one place.
>      |
>      |The word “useless” has been replaced by “legacy”.
>      |
>      |I think the feedback was pretty clear that the draft needed to
>     be more
>      |opinionated; just because we document the existence of the
>     default JSON
>      |repertoire (“all the code points”) doesn’t mean that anyone
>     should use it
>      |in the present or future. So, introduced a new section “Refining
>     Character
>      |Repertoires” to highlight those issues and offer a suggestion.
>
>     In 2.2 i would not give the count on code point types.
>     Instead i would only give the problem statement "among Unicode
>     code point types .. are questionable".  This seems more generic.
>
>     In 2.2.2.2 i would not say "legacy controls", and that they are
>     "mostly obsolete".  ECMA-48 is very alive in at least the POSIX
>     aka Linux world, for many purposes, for example terminal
>     interaction.  "Likely to occur in data as a result of
>     a programming error"?  Any preformatted Unix manual page will come
>     with lots of CSI sequences, or backspace-based ones.
>     ASCII NUL is the base of ISO C-style strings.  In fact many
>     network protocols (not enough!!) still seem to use
>     KEY=VALUE\0KEY=VALUE\0\0 style transports.
>
>     In 5.:
>
>       [JSON..] It cannot be serialized into legal UTF-8, but many
>       libraries will silently parse this and generate an ill-formed
>       UTF-8 string. Implementors must be prepared to deal with these
>       sorts of problematic code points.
>
>     But RFC 3629 is very clear and says in 3. (being lengthy)
>
>        The definition of UTF-8 prohibits encoding character numbers
>     between
>        U+D800 and U+DFFF, which are reserved for use with the UTF-16
>        encoding form (as surrogate pairs) and do not directly represent
>     []
>        characters.  When encoding in UTF-8 from UTF-16 data, it is
>     necessary
>        to first decode the UTF-16 data to obtain character numbers, which
>        are then encoded in UTF-8 as described above.  This contrasts with
>        CESU-8 [CESU-8], which is a UTF-8-like encoding that is not
>     meant for
>        ...
>
>     So even the weird JSON "string" can be made valid UTF-8, one just
>     has to walk around the corner.  (Possibly.)
>     Sorry, but _I_ do not get that JSON supports _that_ "string",
>     RFC 8259, 7.:
>
>        To escape an extended character that is not in the Basic
>     Multilingual
>        Plane, the character is represented as a 12-character sequence,
>        encoding the UTF-16 surrogate pair.
>
>     And then in 8.
>
>       8.  String and Character Issues
>       8.1.  Character Encoding
>          JSON text exchanged between systems that are not part of a
>          closed ecosystem MUST be encoded using UTF-8 [RFC3629].
>
>     This is a total contradiction, sorry.  I. Hate. JSON.
>     But that does not help anyone.
>
>     So i mean _if_ i would write such a RFC _i_ would not hammer your
>     sentence on the table, but i would then simply refer to RFC 3629
>     and say that implementors shall be prepared to convert the JSON
>     standard (grrr) string .. to the UTF-8 standard?
>
>     5. also says
>
>        It is unlikely that anyone specifying a new data format would
>        choose to allow this character repertoire.
>
>     And
>
>        A protocol based on JSON could be made more robust and
>        implementor-friendly by requiring that the contents of member
>        names and string values contain only Useful Assignables
>
>     No.  Not me.  Sorry .. we are talking string data?
>     I mean, with your restriction one (possibly) cannot even generate
>     a protocol that carries around Linux/POSIX path names?  Except by
>     mangling them to something likely non-reproducible (by leaving off
>     "evil" characters, or converting them to a replacement character;
>     which one, the Unicode one, or question mark?  Ah, it must be
>     ASCII question mark because the Unicode replacement character is
>     of the evil sort?).  Or have i misunderstood something ...
>     which can very well be the truth, of course.
>     So, even if you wipe away all of the above, a hint on replacement
>     characters in a document that restricts the usable set of Unicode
>     characters is well worth a thought.
>
>     Thank you.
>
>     --steffen
>     |
>     |Der Kragenbaer,                The moon bear,
>     |der holt sich munter           he cheerfully and one by one
>     |einen nach dem anderen runter  wa.ks himself off
>     |(By Robert Gernhardt)
>
>     _______________________________________________
>     art mailing list
>     art@ietf.org
>     https://www.ietf.org/mailman/listinfo/art
>
>