Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Asmus Freytag <asmusf@ix.netcom.com> Sun, 10 September 2023 08:43 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 84029C14CE52; Sun, 10 Sep 2023 01:43:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.996
X-Spam-Level:
X-Spam-Status: No, score=-1.996 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id o_XWb9hyIhvK; Sun, 10 Sep 2023 01:43:13 -0700 (PDT)
Received: from mta-102a.earthlink-vadesecure.net (mta-102b.earthlink-vadesecure.net [51.81.61.67]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BF4DBC14CE4A; Sun, 10 Sep 2023 01:43:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=xPyXunGcL0dO3YbFGbbcEgT7Omw9eTqC7mfbjT 5KZ0w=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1694335388; x=1694940188; b=mwA0ugsgRQcvAVt3usWUvuFFgmnWEGh+OZCGHxk6VzY4V8xLFP/EcWE 4mvWr3roioGsuOoMbf4NTj/hqSUyn7Ydzg00cOJbLBaZ2uNjLUi4sUvf4BcgFtJmhPqTOsJ rChRWc+vMPad2E9a55DMmPNJGE2SNB6qj5iOwbZ5pw7OcekZN6zI1UdzEgLVdDTRgqQnYlY pJbSFj1/0ts9gOau4yeWxhP3ydJPBn6oKyKXWryijX01LA34BXWQaIXTtTNhYpJATdgr/// Xeorwn/XdrAWxh0Y7G6qNhzb/nu12d3QiO8cUat3yMuPYgeFsSd9j7oNjQaSnJnlzVIs1tu I/A==
Received: from [10.71.219.206] ([198.54.134.179]) by vsel1nmtao02p.internal.vadesecure.com with ngmta id 4b7664f3-17837d0edabdcea4; Sun, 10 Sep 2023 08:43:08 +0000
Content-Type: multipart/alternative; boundary="------------9ahcrgW10XMYaAk8t300SbIS"
Message-ID: <ff2df364-ecc6-d4f5-2f87-ad94295f102c@ix.netcom.com>
Date: Sun, 10 Sep 2023 01:43:07 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0
Content-Language: en-US
To: Tim Bray <tbray@textuality.com>
Cc: Steffen Nurpmeso <steffen@sdaoden.eu>, i18ndir@ietf.org, ART Area <art@ietf.org>, Rob Sayre <sayrer@gmail.com>
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <20230909165843.GlTJy%steffen@sdaoden.eu> <CAHBU6iuixTeS=X1kccw11zEnHVG5tx9aHUC-pH00ociBmukhGQ@mail.gmail.com> <d9d5dee0-24d1-54f0-dde9-4bb9ad2e56e7@ix.netcom.com> <CAChr6Sygs5=fyQ7ZJSVoV5EY9hDZWRkj78r9yH2539vtNTT=aQ@mail.gmail.com>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <CAChr6Sygs5=fyQ7ZJSVoV5EY9hDZWRkj78r9yH2539vtNTT=aQ@mail.gmail.com>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/VBPVNquRABnzviiH81qW-EoIu2A>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Sep 2023 08:43:17 -0000

On 9/10/2023 12:05 AM, Rob Sayre wrote:
>
>
>>>     And then in 8.
>>>
>>>      8.  String and Character Issues
>>>      8.1.  Character Encoding
>>>         JSON text exchanged between systems that are not part of a
>>>         closed ecosystem MUST be encoded using UTF-8 [RFC3629].
>>
>>     As Rob Sayre said above, the proposed document probably has to
>>     address the issue of JSON escapes and emphasize that they are not
>>     relevant to code point subsets.
>     No they are. If you are in a JSON-based environment, but have
>     restricted your repertoire, then even if JSON allows an escape,
>     it's invalid if it violates your restriction. And your new
>     specification SHOULD require something definite and drastic to
>     happen in that case.\
>
>
> That is exactly the point. The current draft mentions "Transformation 
> Formats", but doesn't mention that these transformation formats can 
> and do further encode questionable Unicode via escape sequences. 
> The draft should mention it.
>
> Unfortunate, but true. So, you can have a perfect UTF-8 document that 
> represents a bunch of unpaired surrogate code points.

The term "transformation format" is not formally defined  in the Unicode 
Standard. It is noted in chapter 3 as being a (somewhat ambiguous) alias 
for two other terms, encoding format and encoding scheme. ("schemes" 
have a fixed byte order for code units and "forms" are generic mappings 
to an integral type irrespective of byte order).

The key is that the definition of both schemes and forms specifies them 
as a mapping from a single code point (actually: scalar value) to a 
"sequence of code units", while the term "transformation format"  as 
defined in the Unicode glossary is a more general mapping of character 
sequence to sequence of code units.

    /Transformation Format
    <https://www.unicode.org/glossary/#transformation_format>/. A
    mapping from a coded character sequence to a unique sequence of code
    units (typically bytes).

The draft currently states:

> Unicode describes a variety of "transformation formats", ways to 
> encode code points in bytes of computer memory. A survey of 
> transformation formats is beyond the scope of this document.

If we accept the term "transformation format" as an alias for the terms 
that are actually defined in the standard (encoding form or scheme) then 
that usage of "transformation format" _excludes _escape sequences and 
similar mappings where the source of the mapping can be more than one 
code point. The UTF-8, UTF-16 and UTF-32 encoding forms will be formally 
called out as "standard encoding forms" in the next revision of the 
Unicode Standard.

There are non-standard ones, such as UTF-7, CESU-8 and whatnot, but none 
of them have escape sequences either.

While the in memory form of a string syntax with an escape sequence is 
technically an example of a "transformation format" as that term is 
described in the glossary, such a syntax is neither an encoding form, 
nor an encoding scheme, and those are the things the Unicode Standard 
actually describes. Because, according to the definition, the escape 
sequence would represent multiple code points and is therefore not a 
direct representation of a character under one of the standard encoding 
forms.

Rob's comments make clear that using the term "transformation formats" 
in the way quoted above can lead to misunderstandings.

I sympathize with the desire not to bring in the entire formalism, so 
the following suggestion might suffice to limit the sense of 
"transformation format" that is intended here.

> Unicode describes a variety of "transformation formats", ways to 
> *uniquely *encode *each **scalar value* into bytes of computer memory. 
> A survey of transformation formats is beyond the scope of this document.

If you like to avoid the term "scalar value" - it is synonymous to 
"non-surrogate code point".

A./


>
> thanks,
> Rob
>
>