Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

I gave it a close read again. I came up with this:

5. Refining Character Repertoires

The IETF typically uses well-known data formats such as JSON, I-JSON, CBOR,
YAML, and XML. These formats have default character repertoires. For
example, JSON allows member names and string values to include any Unicode
code points, including all the problematic types; the following is a legal
JSON document:

[ big edit from the current draft, shorter, but take it or leave it. ]

{"example": "\u0000\U0089\uDEAD\u7FFFF"}

The value of the "example" field contains the C0 Control NUL, the C1
Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate,
and the noncharacter U+7FFFF. It is unlikely to be useful as the value of a
text field. It cannot be serialized into legal UTF-8, but many libraries
will silently parse this and generate an ill-formed UTF-8 string.
Implementors must be prepared to deal with these sorts of problematic code
points

[ The first part, "unlikely to be useful as the value of a text field", is
good. But, the next part mixes "legal" and "ill-formed", and I don't
think that is a good idea. There is still a lowercase requirement after
that, and I think I disagree. Implementors do not have to be "prepared to
deal with these sorts of problematic code points". Maybe: "Some messages
will contain these problematic code points". That is true, but you don't
have to deal with them. ]

It is unlikely that anyone specifying a new data format would choose to
allow this character repertoire.

[ Instead: The JSON character repertoire is too permissive, so it's best
for new specifications to require that the contents of member names and
string values contain only Useful Assignables (see Section 4.2). ]

Then, I got to the end, and noticed that "character repertoire" might not
be the best choice. "Character encoding" or "character set"? "Vocabulary"?
No shade for the authors here, writing about language itself is really
difficult.

thanks,
Rob

On Sat, Sep 9, 2023 at 10:15 AM Steffen Nurpmeso <steffen@sdaoden.eu> wrote:

> Tim Bray wrote in
>  <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>:
>  |See https://www.ietf.org/archive/id/draft-bray-unichars-03.html
>  |
>  |A bunch of minor corrections and improvements, thanks to everyone for
> that,
>  |especially James Manger for noticing that the ABNF was entirely wrong in
>  |one place.
>  |
>  |The word “useless” has been replaced by “legacy”.
>  |
>  |I think the feedback was pretty clear that the draft needed to be more
>  |opinionated; just because we document the existence of the default JSON
>  |repertoire (“all the code points”) doesn’t mean that anyone should use it
>  |in the present or future. So, introduced a new section “Refining
> Character
>  |Repertoires” to highlight those issues and offer a suggestion.
>
> In 2.2 i would not give the count on code point types.
> Instead i would only give the problem statement "among Unicode
> code point types .. are questionable".  This seems more generic.
>
> In 2.2.2.2 i would not say "legacy controls", and that they are
> "mostly obsolete".  ECMA-48 is very alive in at least the POSIX
> aka Linux world, for many purposes, for example terminal
> interaction.  "Likely to occur in data as a result of
> a programming error"?  Any preformatted Unix manual page will come
> with lots of CSI sequences, or backspace-based ones.
> ASCII NUL is the base of ISO C-style strings.  In fact many
> network protocols (not enough!!) still seem to use
> KEY=VALUE\0KEY=VALUE\0\0 style transports.
>
> In 5.:
>
>   [JSON..] It cannot be serialized into legal UTF-8, but many
>   libraries will silently parse this and generate an ill-formed
>   UTF-8 string. Implementors must be prepared to deal with these
>   sorts of problematic code points.
>
> But RFC 3629 is very clear and says in 3. (being lengthy)
>
>    The definition of UTF-8 prohibits encoding character numbers between
>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>    encoding form (as surrogate pairs) and do not directly represent
> []
>    characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
>    to first decode the UTF-16 data to obtain character numbers, which
>    are then encoded in UTF-8 as described above.  This contrasts with
>    CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
>    ...
>
> So even the weird JSON "string" can be made valid UTF-8, one just
> has to walk around the corner.  (Possibly.)
> Sorry, but _I_ do not get that JSON supports _that_ "string",
> RFC 8259, 7.:
>
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a 12-character sequence,
>    encoding the UTF-16 surrogate pair.
>
> And then in 8.
>
>   8.  String and Character Issues
>   8.1.  Character Encoding
>      JSON text exchanged between systems that are not part of a
>      closed ecosystem MUST be encoded using UTF-8 [RFC3629].
>
> This is a total contradiction, sorry.  I. Hate. JSON.
> But that does not help anyone.
>
> So i mean _if_ i would write such a RFC _i_ would not hammer your
> sentence on the table, but i would then simply refer to RFC 3629
> and say that implementors shall be prepared to convert the JSON
> standard (grrr) string .. to the UTF-8 standard?
>
> 5. also says
>
>    It is unlikely that anyone specifying a new data format would
>    choose to allow this character repertoire.
>
> And
>
>    A protocol based on JSON could be made more robust and
>    implementor-friendly by requiring that the contents of member
>    names and string values contain only Useful Assignables
>
> No.  Not me.  Sorry .. we are talking string data?
> I mean, with your restriction one (possibly) cannot even generate
> a protocol that carries around Linux/POSIX path names?  Except by
> mangling them to something likely non-reproducible (by leaving off
> "evil" characters, or converting them to a replacement character;
> which one, the Unicode one, or question mark?  Ah, it must be
> ASCII question mark because the Unicode replacement character is
> of the evil sort?).  Or have i misunderstood something ...
> which can very well be the truth, of course.
> So, even if you wipe away all of the above, a hint on replacement
> characters in a document that restricts the usable set of Unicode
> characters is well worth a thought.
>
> Thank you.
>
> --steffen
> |
> |Der Kragenbaer,                The moon bear,
> |der holt sich munter           he cheerfully and one by one
> |einen nach dem anderen runter  wa.ks himself off
> |(By Robert Gernhardt)
>
> _______________________________________________
> art mailing list
> art@ietf.org
> https://www.ietf.org/mailman/listinfo/art
>