Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Asmus Freytag <asmusf@ix.netcom.com> Sun, 10 September 2023 06:35 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 81527C151097; Sat, 9 Sep 2023 23:35:29 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.989
X-Spam-Level:
X-Spam-Status: No, score=-1.989 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kd0yTcVRaRIv; Sat, 9 Sep 2023 23:35:28 -0700 (PDT)
Received: from mta-202a.earthlink-vadesecure.net (mta-202a.earthlink-vadesecure.net [51.81.232.240]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A9F70C151083; Sat, 9 Sep 2023 23:35:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=EboladnJuqqi2vRvJUwxac6tvhCy1kWSf7sdzQ phEP4=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1694327720; x=1694932520; b=HKozpW0j6DjO48KCZoZXfRfAobP3UPNBjq7NRZ2QQ6LJgZ0RJMoelj1 uYz4vZrVb5aVKt0NMPo5geOdvjWcSfkEDySW7VPm2tmV3yGZQ6ftpdg7LzViKPkY9AfmtlN Uj8vnP74baCPyVAnYylLx2pChUzW7l5w+LkqK8lRsndatwjyxRdlxAD5qFFPE/c5L7TrTJk fm/r+LeDW8jSDhHcmO/j3Wd8poYUblUP4Fok2r7mtqnUCf02acLxcLO8fRbCBZPwYSfEOhm wSz+3FqkFLovoyqLbn1dWt6TEa6XAGxAt+pV3AIwMpRSJN5hDqxV6dlalz8HL6+8fuBe2AM 6Ug==
Received: from [10.71.219.206] ([198.54.134.179]) by vsel2nmtao02p.internal.vadesecure.com with ngmta id 3b8a21e8-17837615bb8c3c47; Sun, 10 Sep 2023 06:35:20 +0000
Content-Type: multipart/alternative; boundary="------------cHsuMLxQ1tODopU27curxICT"
Message-ID: <d9d5dee0-24d1-54f0-dde9-4bb9ad2e56e7@ix.netcom.com>
Date: Sat, 09 Sep 2023 23:35:21 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0
Content-Language: en-US
To: Tim Bray <tbray@textuality.com>, Steffen Nurpmeso <steffen@sdaoden.eu>
Cc: i18ndir@ietf.org, ART Area <art@ietf.org>
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <20230909165843.GlTJy%steffen@sdaoden.eu> <CAHBU6iuixTeS=X1kccw11zEnHVG5tx9aHUC-pH00ociBmukhGQ@mail.gmail.com>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <CAHBU6iuixTeS=X1kccw11zEnHVG5tx9aHUC-pH00ociBmukhGQ@mail.gmail.com>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/Hl1fbhxDJjEm3OQE3_0DW6ilIKg>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Sep 2023 06:35:29 -0000

On 9/9/2023 4:44 PM, Tim Bray wrote:
> On Sep 9, 2023 at 9:58:43 AM, Steffen Nurpmeso <steffen@sdaoden.eu> wrote:
>>
>> In 2.2 i would not give the count on code point types.
>> Instead i would only give the problem statement "among Unicode
>> code point types .. are questionable".  This seems more generic.
>
> Left to myself I’d leave it as-is, but I don’t care that much, if 
> anyone else agrees with Steffen I don’t mind changing it.
>
I tend to disagree. Unicode explicitly defines 7 types and by giving 
that count you reinforce that these are predefined and not some generic 
typology.
>
>> In 2.2.2.2 i would not say "legacy controls", and that they are
>> "mostly obsolete".  ECMA-48 is very alive in at least the POSIX
>> aka Linux world, for many purposes, for example terminal
>> interaction.  "Likely to occur in data as a result of
>> a programming error"?  Any preformatted Unix manual page will come
>> with lots of CSI sequences, or backspace-based ones.
>> ASCII NUL is the base of ISO C-style strings.  In fact many
>> network protocols (not enough!!) still seem to use
>> KEY=VALUE\0KEY=VALUE\0\0 style transports.
>
> So, in section 23.1 of [UNICODE] it says "There are 65 code points set 
> aside in the Unicode Standard for compatibility with the C0 and C1 
> control codes defined in the ISO/IEC 2022 framework.”  Some shreds of 
> ISO 2022 practices may survive in legacy systems but I am pretty sure 
> they are not relevant to any protocol or data format the IETF might 
> work on. This clearly feels like “legacy” and “mostly obsolete” to me. 
> (For those who don’t know, ISO 2022 worked with the "ISO-Latin" 8859-1 
> code pages to shift between them in the middle of a string. It is now 
> forgotten for a good reason.) I’d never heard of ECMA 48 and I am 
> dubious that that use of control codes would be appropriate in 
> anything the IETF would specify these days.

Yes, for text data, these are legacy.

However, there are modern implementations of terminals and their 
protocols would use ECMA48. (There's an ISO standard that matches ECMA48 
and which is cited in the Unicode Standard, I think it's 6942).

I keep suggesting that the way out for this is to acknowledge that some 
data streams have needs for these code points and if your spec must 
include them, but you would like to avoid surrogates and non-characters, 
that the answer would be to construct your own subset by extending the 
Useful-Assignables with as many control codes as you need (in an 
explicit list).

That gets your draft out of the business of creating subsets for edge 
cases, while simultaneously staying relevant even for people writing 
such specs. (I'd see that as a win-win).



>
>> In 5.:
>>
>>  [JSON..] It cannot be serialized into legal UTF-8, but many
>>  libraries will silently parse this and generate an ill-formed
>>  UTF-8 string. Implementors must be prepared to deal with these
>>  sorts of problematic code points.
>>
>> But RFC 3629 is very clear and says in 3. (being lengthy)
>
> You are correct, but the assertion in 5. is also empirically correct, 
> this is exactly what many (most?) existing libraries will do. 
> Presumably on the basis of Postel’s law?
>
>> So even the weird JSON "string" can be made valid UTF-8, one just
>> has to walk around the corner.  (Possibly.)
>
> In fact, does that happen? I haven’t seen it but I suppose it is 
> somewhat possible.
>
>> Sorry, but _I_ do not get that JSON supports _that_ "string",
>> RFC 8259, 7.:
>>
>>   To escape an extended character that is not in the Basic Multilingual
>>   Plane, the character is represented as a 12-character sequence,
>>   encoding the UTF-16 surrogate pair.
>
> Once again, you are probably correct, but the caution that 
> implementers will have to be “prepared to deal with” this is true, 
> because of what implementations do.  You are correct that it would 
> probably be better if JSON parsers were stricter.

The focus is subsets.

Data formats for which implementations allow (whether sanctioned or not) 
to specify strings that are outside the repertoire of the subset (even 
outside some subset like the useful assignables) basically allow the 
creation of ill-formed data.

And that should be made very explicit.

In the context of a syntax that allows an escape to represent an 
noncharacter, any data stream that contains that escape would be 
ill-formed with respect to the "useful-assignable" subset repertoire. 
And would need to be treated as such.


>
>> And then in 8.
>>
>>  8.  String and Character Issues
>>  8.1.  Character Encoding
>>     JSON text exchanged between systems that are not part of a
>>     closed ecosystem MUST be encoded using UTF-8 [RFC3629].
>
> As Rob Sayre said above, the proposed document probably has to address 
> the issue of JSON escapes and emphasize that they are not relevant to 
> code point subsets.
No they are. If you are in a JSON-based environment, but have restricted 
your repertoire, then even if JSON allows an escape, it's invalid if it 
violates your restriction. And your new specification SHOULD require 
something definite and drastic to happen in that case.
>
>
A./