Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Asmus Freytag <asmusf@ix.netcom.com> Sun, 10 September 2023 06:18 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B29E2C1519B6; Sat, 9 Sep 2023 23:18:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.989
X-Spam-Level:
X-Spam-Status: No, score=-1.989 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, NICE_REPLY_A=-0.091, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=earthlink.net
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WMbTJrqEFhub; Sat, 9 Sep 2023 23:18:07 -0700 (PDT)
Received: from mta-101a.earthlink-vadesecure.net (mta-101a.earthlink-vadesecure.net [51.81.61.60]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 187C1C1519A1; Sat, 9 Sep 2023 23:18:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; bh=qgtWPnN/HiEOwAzFc0C8rC4m49ceC0PNdAcqW4 t4qbg=; c=relaxed/relaxed; d=earthlink.net; h=from:reply-to:subject: date:to:cc:resent-date:resent-from:resent-to:resent-cc:in-reply-to: references:list-id:list-help:list-unsubscribe:list-subscribe:list-post: list-owner:list-archive; q=dns/txt; s=dk12062016; t=1694326684; x=1694931484; b=ne0BHOE4BhzrK6DwGNbf09HvQpgOXUsphyh8Hsn1bxdspXS+TWxa7zx dGYa0/RUVSogqy2HIpqsN6fjHcshubGCMhspOJqUNbrOL5emg3lgux+lEgEiPjBEPvUoO9H ZOwWNNwb4pgQmH2ukuoXSNuiM5tBqLnnxKQ0zGi0O/vhBe8ZccZmGZ/+NLTdL9oj0Pz3+sk HT3VOGRkdymdnyym0C1lHwLs3E1UPFtcwzL8cZw0l4oaqHcReFGR/5os/Mbi6SdeljgK2eh zWcnpux/5HPtFa0nHplu4RFBkze7kQFCNiI9a9xyIZsCLKSAJxxONJ/KqXiRm8x3XZ2YwWk zHQ==
Received: from [10.71.219.206] ([198.54.134.179]) by vsel1nmtao01p.internal.vadesecure.com with ngmta id 7b75d22f-178375246f2e40c2; Sun, 10 Sep 2023 06:18:04 +0000
Content-Type: multipart/alternative; boundary="------------m12gsNmVBLKLfDiXz8zWUO6N"
Message-ID: <e726394d-8962-7665-97ad-30de70c179e9@ix.netcom.com>
Date: Sat, 09 Sep 2023 23:18:04 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0
Content-Language: en-US
To: Tim Bray <tbray@textuality.com>, Rob Sayre <sayrer@gmail.com>
Cc: i18ndir@ietf.org, ART Area <art@ietf.org>, Steffen Nurpmeso <steffen@sdaoden.eu>
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <20230909165843.GlTJy%steffen@sdaoden.eu> <CAChr6Sz=rMqwp3GOoqGgSsxE8Pqe3GCTqaOLpBO=YN+7v1Ui8Q@mail.gmail.com> <CAHBU6itKdzNdEpvq8m2vGmvFtRKDSeSvAaLM0CFqa3aQJUoheg@mail.gmail.com>
From: Asmus Freytag <asmusf@ix.netcom.com>
In-Reply-To: <CAHBU6itKdzNdEpvq8m2vGmvFtRKDSeSvAaLM0CFqa3aQJUoheg@mail.gmail.com>
Authentication-Results: earthlink-vadesecure.net; auth=pass smtp.auth=asmusf@ix.netcom.com smtp.mailfrom=asmusf@ix.netcom.com;
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/pCd6AvRvtvBUjaiCwcOtBJvhmnA>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Sep 2023 06:18:11 -0000

On 9/9/2023 4:50 PM, Tim Bray wrote:
> On Sep 9, 2023 at 1:56:47 PM, Rob Sayre <sayrer@gmail.com> wrote:
>> 5. Refining Character Repertoires
>>
>> The IETF typically uses well-known data formats such as JSON, I-JSON, 
>> CBOR, YAML, and XML. These formats have default character 
>> repertoires. For example, JSON allows member names and string values 
>> to include any Unicode code points, including all the problematic 
>> types; the following is a legal JSON document:
>
> I’m OK with Rob’s redraft, if only to lose the fuzzy word 
> “typically”.  Anyone object?

Unless there are obvious exceptions that otherwise would have to be 
covered, I hate using weasel words myself.

Sometimes, "almost always" is a better fit than "typically" - again, 
depending on whether it fits the fact.

> .
>> {"example": "\u0000\U0089\uDEAD\u7FFFF"}
>>
>> The value of the "example" field contains the C0 Control NUL, the C1 
>> Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired 
>> surrogate, and the noncharacter U+7FFFF. It is unlikely to be useful 
>> as the value of a text field. It cannot be serialized into legal 
>> UTF-8, but many libraries will silently parse this and generate an 
>> ill-formed UTF-8 string. Implementors must be prepared to deal with 
>> these sorts of problematic code points
>>
>> [ The first part, "unlikely to be useful as the value of a text 
>> field", is good. But, the next part mixes "legal" and "ill-formed", 
>> and I don't think that is a good idea. There is still a lowercase 
>> requirement after that, and I think I disagree. Implementors do not 
>> have to be "prepared to deal with these sorts of problematic code 
>> points". Maybe: "Some messages will contain these 
>> problematic code points". That is true, but you don't have to deal 
>> with them. ]
>
> Yeah, I’m pretty sure you do, but by “deal with” I include rejecting 
> such input docs, e.g. by raising a well-known exception with a helpful 
> error message.  I guess that should be spelled out explicitly?

I agree that there's a difference between somehow "doing the right 
thing" when input is broken and various ways of "rejecting" ill-formed 
input. I wrote a lengthy comment about this in a message that didn't 
have you on the CC, so I don't know whether it got to you (I later 
forwarded some).

Strong recommendation about how to process ill-formed input is 
beneficial from a security perspective. Look up what Unicode writes 
about doing with ill-formed UTF-8 as an example of what to do when you 
can't throw an exception.


>
>> It is unlikely that anyone specifying a new data format would choose 
>> to allow this character repertoire.
>>
>> [ Instead: The JSON character repertoire is too permissive, so it's 
>> best for new specifications to require that the contents of member 
>> names and string values contain only Useful Assignables (see Section 
>> 4.2). ]
>
> I prefer the current language. Anyone prefer Rob’s or to propose 
> another option?

I think that the proposed language has the advantage of being more 
concrete. That makes it easier to follow and to remember. (Unless, of 
course, the additional detail is incorrect).

I think "anyone specifying a new data format built on JSON should/SHOULD 
limit the contents...." might be a way to grab the best of both.

>
>> Then, I got to the end, and noticed that "character repertoire" might 
>> not be the best choice. "Character encoding" or "character set"? 
>> "Vocabulary"? No shade for the authors here, writing about language 
>> itself is really difficult.
>
> I personally want to stay with “repertoire” if only by extension from 
> its well-understood meaning when applied to encoding systems.
>>
>>
>