Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Steffen Nurpmeso <steffen@sdaoden.eu> Sat, 09 September 2023 17:15 UTC

Return-Path: <steffen@sdaoden.eu>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D6867C14CE4F; Sat, 9 Sep 2023 10:15:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.908
X-Spam-Level:
X-Spam-Status: No, score=-6.908 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZmBpctFWS3kN; Sat, 9 Sep 2023 10:15:14 -0700 (PDT)
Received: from sdaoden.eu (sdaoden.eu [217.144.132.164]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9BC7AC14F74A; Sat, 9 Sep 2023 10:15:11 -0700 (PDT)
Date: Sat, 09 Sep 2023 18:58:43 +0200
Author: Steffen Nurpmeso <steffen@sdaoden.eu>
From: Steffen Nurpmeso <steffen@sdaoden.eu>
To: Tim Bray <tbray@textuality.com>
Cc: i18ndir@ietf.org, ART Area <art@ietf.org>, Steffen Nurpmeso <steffen@sdaoden.eu>
Message-ID: <20230909165843.GlTJy%steffen@sdaoden.eu>
In-Reply-To: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>
Mail-Followup-To: Tim Bray <tbray@textuality.com>, i18ndir@ietf.org, ART Area <art@ietf.org>, Steffen Nurpmeso <steffen@sdaoden.eu>
User-Agent: s-nail v14.9.24-507-g0e7e3e8c46
OpenPGP: id=EE19E1C1F2F7054F8D3954D8308964B51883A0DD; url=https://ftp.sdaoden.eu/steffen.asc; preference=signencrypt
BlahBlahBlah: Any stupid boy can crush a beetle. But all the professors in the world can make no bugs.
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/nz6mIPrK_RJNzcxtxvAV637sssg>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Sep 2023 17:15:17 -0000

Tim Bray wrote in
 <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>:
 |See https://www.ietf.org/archive/id/draft-bray-unichars-03.html
 |
 |A bunch of minor corrections and improvements, thanks to everyone for that,
 |especially James Manger for noticing that the ABNF was entirely wrong in
 |one place.
 |
 |The word “useless” has been replaced by “legacy”.
 |
 |I think the feedback was pretty clear that the draft needed to be more
 |opinionated; just because we document the existence of the default JSON
 |repertoire (“all the code points”) doesn’t mean that anyone should use it
 |in the present or future. So, introduced a new section “Refining Character
 |Repertoires” to highlight those issues and offer a suggestion.

In 2.2 i would not give the count on code point types.
Instead i would only give the problem statement "among Unicode
code point types .. are questionable".  This seems more generic.

In 2.2.2.2 i would not say "legacy controls", and that they are
"mostly obsolete".  ECMA-48 is very alive in at least the POSIX
aka Linux world, for many purposes, for example terminal
interaction.  "Likely to occur in data as a result of
a programming error"?  Any preformatted Unix manual page will come
with lots of CSI sequences, or backspace-based ones.
ASCII NUL is the base of ISO C-style strings.  In fact many
network protocols (not enough!!) still seem to use
KEY=VALUE\0KEY=VALUE\0\0 style transports.

In 5.:

  [JSON..] It cannot be serialized into legal UTF-8, but many
  libraries will silently parse this and generate an ill-formed
  UTF-8 string. Implementors must be prepared to deal with these
  sorts of problematic code points.

But RFC 3629 is very clear and says in 3. (being lengthy)

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
[]
   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
   to first decode the UTF-16 data to obtain character numbers, which
   are then encoded in UTF-8 as described above.  This contrasts with
   CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
   ...

So even the weird JSON "string" can be made valid UTF-8, one just
has to walk around the corner.  (Possibly.)
Sorry, but _I_ do not get that JSON supports _that_ "string",
RFC 8259, 7.:

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a 12-character sequence,
   encoding the UTF-16 surrogate pair.

And then in 8.

  8.  String and Character Issues
  8.1.  Character Encoding
     JSON text exchanged between systems that are not part of a
     closed ecosystem MUST be encoded using UTF-8 [RFC3629].

This is a total contradiction, sorry.  I. Hate. JSON.
But that does not help anyone.

So i mean _if_ i would write such a RFC _i_ would not hammer your
sentence on the table, but i would then simply refer to RFC 3629
and say that implementors shall be prepared to convert the JSON
standard (grrr) string .. to the UTF-8 standard?

5. also says

   It is unlikely that anyone specifying a new data format would
   choose to allow this character repertoire.

And

   A protocol based on JSON could be made more robust and
   implementor-friendly by requiring that the contents of member
   names and string values contain only Useful Assignables

No.  Not me.  Sorry .. we are talking string data?
I mean, with your restriction one (possibly) cannot even generate
a protocol that carries around Linux/POSIX path names?  Except by
mangling them to something likely non-reproducible (by leaving off
"evil" characters, or converting them to a replacement character;
which one, the Unicode one, or question mark?  Ah, it must be
ASCII question mark because the Unicode replacement character is
of the evil sort?).  Or have i misunderstood something ...
which can very well be the truth, of course.
So, even if you wipe away all of the above, a hint on replacement
characters in a document that restricts the usable set of Unicode
characters is well worth a thought.

Thank you.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)