Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Tim Bray <tbray@textuality.com> Sat, 09 September 2023 23:50 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8DD95C151095 for <i18ndir@ietfa.amsl.com>; Sat, 9 Sep 2023 16:50:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.095
X-Spam-Level:
X-Spam-Status: No, score=-2.095 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id icz1Jve5BSDr for <i18ndir@ietfa.amsl.com>; Sat, 9 Sep 2023 16:50:30 -0700 (PDT)
Received: from mail-wr1-x435.google.com (mail-wr1-x435.google.com [IPv6:2a00:1450:4864:20::435]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 86C08C151522 for <i18ndir@ietf.org>; Sat, 9 Sep 2023 16:50:30 -0700 (PDT)
Received: by mail-wr1-x435.google.com with SMTP id ffacd0b85a97d-307d20548adso3237212f8f.0 for <i18ndir@ietf.org>; Sat, 09 Sep 2023 16:50:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1694303428; x=1694908228; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=lRj177YTkTGxPhI7z406B34ttkWjkM4XdKO9JrPOhYQ=; b=H7C1T0Yur5mKWb8u40ssDn8yvArHnjddnMkSkS+OYr+c2cJbWlkmwiHAOozsqxnVup /ousnc3FduZA74Y4cwbGocUsrVaFMCYuWB9oJ8G+Q0D4z2tlpFD2E1DRp+tipHDUYrZ3 KRQ563NlOtkU0u2JRB/D6DaJWIIbZmU1zg97E=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694303428; x=1694908228; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=lRj177YTkTGxPhI7z406B34ttkWjkM4XdKO9JrPOhYQ=; b=bIROLe9aCARPS4+Xd4b0eznCR1lEBScduCsDDybmoR1ZXiyPCck3IZvviIIn9w0Ujf dWymNvu2L0iBE2gOdgw3XRdgBEq4gdgHUvGVzOL7LT6Vu650hiqRje6+4E21ekihsWqj apACPN167CVvcsSN20Wemak4bGU2G/ZD6GL0MU41ehvmn6yFGbs0c5mTPpc0taE4ZFp2 1/Q/96xetlvthbvNbVnipTH568r7e74F0Yk9zET++K3OrS0LsQUay4p8cRQP5LjNlB1/ pxC+8CLLKZuyT699uhVhtZzXfvvgqRWFSv1pscRZ3FOr/4sqspirGKTl00oLqUu66JIP fy/Q==
X-Gm-Message-State: AOJu0Yw9GgJjMevasRKqc6X7JGv/oixLfYCZx4Oktb3vZqsYga29Q03N c217kX+97ebgqLxYe3RLMccapzeHVLSdC07ZPPnYlQ==
X-Google-Smtp-Source: AGHT+IHnSlBTUhCLK7IAqNDgMNGW916kJkzRScUMQqOJXoZWZXxlFuHKfB667txY0w7lmwzTNiQE9XzUAc2hxW/j/pg=
X-Received: by 2002:a5d:5643:0:b0:319:7471:2965 with SMTP id j3-20020a5d5643000000b0031974712965mr4967450wrw.21.1694303428190; Sat, 09 Sep 2023 16:50:28 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sat, 9 Sep 2023 16:50:27 -0700
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sat, 9 Sep 2023 16:50:24 -0700
Mime-Version: 1.0 (Mimestream 1.1.1)
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <20230909165843.GlTJy%steffen@sdaoden.eu> <CAChr6Sz=rMqwp3GOoqGgSsxE8Pqe3GCTqaOLpBO=YN+7v1Ui8Q@mail.gmail.com>
In-Reply-To: <CAChr6Sz=rMqwp3GOoqGgSsxE8Pqe3GCTqaOLpBO=YN+7v1Ui8Q@mail.gmail.com>
From: Tim Bray <tbray@textuality.com>
Date: Sat, 09 Sep 2023 16:50:27 -0700
Message-ID: <CAHBU6itKdzNdEpvq8m2vGmvFtRKDSeSvAaLM0CFqa3aQJUoheg@mail.gmail.com>
To: Rob Sayre <sayrer@gmail.com>
Cc: i18ndir@ietf.org, ART Area <art@ietf.org>, Steffen Nurpmeso <steffen@sdaoden.eu>
Content-Type: multipart/alternative; boundary="000000000000faba6d0604f5c165"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/MS23QVkoMtPbrehubj0pgDNZ4Fo>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Sep 2023 23:50:34 -0000

On Sep 9, 2023 at 1:56:47 PM, Rob Sayre <sayrer@gmail.com> wrote:

> 5. Refining Character Repertoires
>
> The IETF typically uses well-known data formats such as JSON, I-JSON,
> CBOR, YAML, and XML. These formats have default character repertoires. For
> example, JSON allows member names and string values to include any Unicode
> code points, including all the problematic types; the following is a legal
> JSON document:
>

I’m OK with Rob’s redraft, if only to lose the fuzzy word “typically”.
Anyone object?

{"example": "\u0000\U0089\uDEAD\u7FFFF"}
>
> The value of the "example" field contains the C0 Control NUL, the C1
> Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate,
> and the noncharacter U+7FFFF. It is unlikely to be useful as the value of a
> text field. It cannot be serialized into legal UTF-8, but many libraries
> will silently parse this and generate an ill-formed UTF-8 string.
> Implementors must be prepared to deal with these sorts of problematic code
> points
>
> [ The first part, "unlikely to be useful as the value of a text field", is
> good. But, the next part mixes "legal" and "ill-formed", and I don't
> think that is a good idea. There is still a lowercase requirement after
> that, and I think I disagree. Implementors do not have to be "prepared to
> deal with these sorts of problematic code points". Maybe: "Some messages
> will contain these problematic code points". That is true, but you don't
> have to deal with them. ]
>

Yeah, I’m pretty sure you do, but by “deal with” I include rejecting such
input docs, e.g. by raising a well-known exception with a helpful error
message.  I guess that should be spelled out explicitly?

It is unlikely that anyone specifying a new data format would choose to
> allow this character repertoire.
>
> [ Instead: The JSON character repertoire is too permissive, so it's best
> for new specifications to require that the contents of member names and
> string values contain only Useful Assignables (see Section 4.2). ]
>

I prefer the current language. Anyone prefer Rob’s or to propose another
option?

Then, I got to the end, and noticed that "character repertoire" might not
> be the best choice. "Character encoding" or "character set"? "Vocabulary"?
> No shade for the authors here, writing about language itself is really
> difficult.
>

I personally want to stay with “repertoire” if only by extension from its
well-understood meaning when applied to encoding systems.

>
>>