Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Asmus Freytag <asmusf@ix.netcom.com> Sat, 09 September 2023 19:42 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1CCE5C151527 for <i18ndir@ietfa.amsl.com>; Sat, 9 Sep 2023 12:42:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.997
X-Spam-Level:
X-Spam-Status: No, score=-6.997 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id oXvYfGlw_Xly for <i18ndir@ietfa.amsl.com>; Sat, 9 Sep 2023 12:42:41 -0700 (PDT)
Received: from mta-201a.earthlink-vadesecure.net (mta-201b.earthlink-vadesecure.net [51.81.229.181]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 40879C14CE4D for <i18ndir@ietf.org>; Sat, 9 Sep 2023 12:42:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=fjZSLVjkKT3B3bNV+ROgzxiwU236gFCnqfLUr7 FT+XQ=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1694288560; x=1694893360; b=VSD3LkJGTQN1qW7X8gBIBiWugwW8AXiTtpUyAOkq+ogvQlWt7fDw+/9 3EB/kgx5WfZHCcrPD0rE+Vf0lkwhIB77uBreEH5zpNfKgFAykA/NXoNsNx/4zECKKfeoDVx SGhb1kA2HQ4h+GwYnJ9SrlTIula7/TPvITLYunRUgbNwoA3a24JD67vwAEN0nG6J1bx664o 2I9mGfFg6T8YSLoo513OLDu6fpmNVUJ/hooGPO5I9/NuooytnHzkob4ERrhAZ9CL6kmtzhp qkBkv9qdqUE2NYXYHYH4J6eXzk7/zg+kXxdxsaItQHpiUE3U6bMcFqYlyuQl36CWbJ6LkCj c6Q==
Received: from [10.71.219.206] ([192.252.212.46]) by vsel2nmtao01p.internal.vadesecure.com with ngmta id f37f1680-1783527813383aa8; Sat, 09 Sep 2023 19:42:40 +0000
Content-Type: multipart/alternative; boundary="------------hLNyFHtaSpOE28ofDrpl0slb"
Message-ID: <bb9d009b-427a-bf4d-952e-263deabe5d94@ix.netcom.com>
Date: Sat, 09 Sep 2023 12:42:23 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0
Content-Language: en-US
To: i18ndir@ietf.org, Tim Bray <tbray@textuality.com>, Paul Hoffman <paul.hoffman@vpnc.org>
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <20230909165843.GlTJy%steffen@sdaoden.eu>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <20230909165843.GlTJy%steffen@sdaoden.eu>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/9ElUfqJ6Db9BeW_Aluxjs4jVx_s>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Sep 2023 19:42:45 -0000

On 9/9/2023 9:58 AM, Steffen Nurpmeso wrote:
> Tim Bray wrote in
>   <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>:
>   |Seehttps://www.ietf.org/archive/id/draft-bray-unichars-03.html
>   |
>   |A bunch of minor corrections and improvements, thanks to everyone for that,
>   |especially James Manger for noticing that the ABNF was entirely wrong in
>   |one place.
>   |
>   |The word “useless” has been replaced by “legacy”.
>   |
>   |I think the feedback was pretty clear that the draft needed to be more
>   |opinionated; just because we document the existence of the default JSON
>   |repertoire (“all the code points”) doesn’t mean that anyone should use it
>   |in the present or future. So, introduced a new section “Refining Character
>   |Repertoires” to highlight those issues and offer a suggestion.
>
> In 2.2 i would not give the count on code point types.
> Instead i would only give the problem statement "among Unicode
> code point types .. are questionable".  This seems more generic.
Unicode lists 7 code point types and using a count means that those 
types (and no other classification) is meant.
>
> In 2.2.2.2 i would not say "legacy controls", and that they are
> "mostly obsolete".  ECMA-48 is very alive in at least the POSIX
> aka Linux world, for many purposes, for example terminal
> interaction.  "Likely to occur in data as a result of
> a programming error"?  Any preformatted Unix manual page will come
> with lots of CSI sequences, or backspace-based ones.
> ASCII NUL is the base of ISO C-style strings.  In fact many
> network protocols (not enough!!) still seem to use
> KEY=VALUE\0KEY=VALUE\0\0 style transports.
There is a tension between the needs of protocols for text data and 
binary data (including serialized structured data containing text fields).

As I commented earlier, the draft could be improved if it had better 
guidance on how to specify "text plus" style repertoires. There are some 
that may need more of the controls than either the XML subset or the new 
subset provide. It would not be useful to have "canned" subsets for each 
and every possible permutation, but having a specification declare that 
it uses the "useful assignables" repertoire augmented by something like 
"the following set of ..." individually listed code points would go a 
long way to retain the benefit of a common approach to noncharacters and 
surrogates.

Beyond that, one of the other subsets might well be what you would need; 
the draft does list them for a reason.

>
> In 5.:
>
>    [JSON..] It cannot be serialized into legal UTF-8, but many
>    libraries will silently parse this and generate an ill-formed
>    UTF-8 string. Implementors must be prepared to deal with these
>    sorts of problematic code points.
>
> But RFC 3629 is very clear and says in 3. (being lengthy)
>
>     The definition of UTF-8 prohibits encoding character numbers between
>     U+D800 and U+DFFF, which are reserved for use with the UTF-16
>     encoding form (as surrogate pairs) and do not directly represent
> []
>     characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
>     to first decode the UTF-16 data to obtain character numbers, which
>     are then encoded in UTF-8 as described above.  This contrasts with
>     CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
>     ...
>
> So even the weird JSON "string" can be made valid UTF-8, one just
> has to walk around the corner.  (Possibly.)
> Sorry, but _I_ do not get that JSON supports _that_ "string",
> RFC 8259, 7.:
>
>     To escape an extended character that is not in the Basic Multilingual
>     Plane, the character is represented as a 12-character sequence,
>     encoding the UTF-16 surrogate pair.
>
> And then in 8.
>
>    8.  String and Character Issues
>    8.1.  Character Encoding
>       JSON text exchanged between systems that are not part of a
>       closed ecosystem MUST be encoded using UTF-8 [RFC3629].
>
> This is a total contradiction, sorry.  I. Hate. JSON.
> But that does not help anyone.
>
> So i mean _if_ i would write such a RFC _i_ would not hammer your
> sentence on the table, but i would then simply refer to RFC 3629
> and say that implementors shall be prepared to convert the JSON
> standard (grrr) string .. to the UTF-8 standard?
>
> 5. also says
>
>     It is unlikely that anyone specifying a new data format would
>     choose to allow this character repertoire.
>
> And
>
>     A protocol based on JSON could be made more robust and
>     implementor-friendly by requiring that the contents of member
>     names and string values contain only Useful Assignables
>
> No.  Not me.  Sorry .. we are talking string data?
> I mean, with your restriction one (possibly) cannot even generate
> a protocol that carries around Linux/POSIX path names?  Except by
> mangling them to something likely non-reproducible (by leaving off
> "evil" characters, or converting them to a replacement character;
> which one, the Unicode one, or question mark?  Ah, it must be
> ASCII question mark because the Unicode replacement character is
> of the evil sort?).  Or have i misunderstood something ...
> which can very well be the truth, of course.
> So, even if you wipe away all of the above, a hint on replacement
> characters in a document that restricts the usable set of Unicode
> characters is well worth a thought.
>
I understood this comment to mean that there may be reasons to not use 
the full flexibility of the JSON repertoire in a given situation. (In 
many situations, perhaps). And that seems fine. It's not that dissimilar 
from specifications using an XML schema, but then defining further 
constraints on the values of elements or attributes than can be 
expressed in the schema. In such a case, a file could be valid under the 
schema, but not valid under the full specification.

If, on the other hand, you need to write a protocol that needs to be 
able to transport any string that is conformant to Unicode, and have no 
control over the source of your data, then you are better off with a 
subset that is more comprehensive.

The draft could be improved by more explicitly discussing, and 
contrasting these scenarios.

A./