Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-04.txt

Tim Bray <tbray@textuality.com> Mon, 18 September 2023 18:29 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 78623C151988 for <i18ndir@ietfa.amsl.com>; Mon, 18 Sep 2023 11:29:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.106
X-Spam-Level:
X-Spam-Status: No, score=-7.106 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id krQxrISen-qc for <i18ndir@ietfa.amsl.com>; Mon, 18 Sep 2023 11:29:13 -0700 (PDT)
Received: from mail-lf1-x12b.google.com (mail-lf1-x12b.google.com [IPv6:2a00:1450:4864:20::12b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A278BC14CE25 for <i18ndir@ietf.org>; Mon, 18 Sep 2023 11:29:13 -0700 (PDT)
Received: by mail-lf1-x12b.google.com with SMTP id 2adb3069b0e04-50300e9e75bso3737585e87.1 for <i18ndir@ietf.org>; Mon, 18 Sep 2023 11:29:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1695061752; x=1695666552; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=H+9F9E7Rh2IOp3H1+ydGQuFCbCWzqWtyqCebKc1cj+E=; b=aiE0f7tWBkCY8K54xZeZLRdDlB+0JPK9vQYBaQCgu1a7GULI8OJNcM2M/eEjuR1Fgk 5CBvcUSX36lnClUDZFBHwNtp9pVkZ/MnpyxsqI2sl9LvIAwmQHKH6457OdTg8MAvGtdK 3iyzhd/OuRidZ1xLO+SG2LxciQTHM1DSXSUHQ=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695061752; x=1695666552; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=H+9F9E7Rh2IOp3H1+ydGQuFCbCWzqWtyqCebKc1cj+E=; b=PO5hQeymQ6skVfcLvWKlWFhKMjZse0J2Dph5uG74DMvdxIfxjiRuHEtNtyHuRZYGPr Y10O8C1hEVqQbFl3Q+6zkNZ5UfAc6Swmbk5qHvmeKbZYBRgHqCb5YFlOnO4Qmkxcme/R 5kgcbbjw2s8or+emE9F9svpuEaHkb8S73PbPjinyDVgbDB+NSugifRMcCcOx4FkOWcSm AmXBkgTXBFLpNqCkfcXc73S2TuyFh+8w7IfIggWjLeHIXRymu6d4PCwx+3udY5o9/YgH 7FDTBXtkg/E0G0kDBRA86S8hbaNNRAalc40UNE7bTg5fX2VdtIlBewKpvqTAkTR98gUm vnbQ==
X-Gm-Message-State: AOJu0YwctAThZBxISGzs9K/T8OFJiZYCL0GU3MSOT3OrTzg8bY4xjlh1 u5DpXuVy51Ta+3jru5ewLjpcMp26mus8gM18oWt5CQ==
X-Google-Smtp-Source: AGHT+IGPIahHlqO1w9uVpwOOvPv3XOdfZz65O/T/ZutuCusm5EeoMoUjz2gdQtc/WyXDsKxHrJ9wNp84XoJidWC8yRY=
X-Received: by 2002:ac2:4f0a:0:b0:503:522:9ca5 with SMTP id k10-20020ac24f0a000000b0050305229ca5mr6416498lfr.27.1695061751564; Mon, 18 Sep 2023 11:29:11 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Mon, 18 Sep 2023 11:29:11 -0700
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Mon, 18 Sep 2023 11:29:08 -0700
Mime-Version: 1.0 (Mimestream 1.1.1)
References: <169479938668.18742.9199862891950651366@ietfa.amsl.com> <CAHBU6ivzUV947N+n7AoYkCFT3ZfaLobCQ4fBXw3dvkqTT=LBAw@mail.gmail.com> <SY4PR01MB5980D8DDE229D1C57AEDFB55E5FBA@SY4PR01MB5980.ausprd01.prod.outlook.com>
In-Reply-To: <SY4PR01MB5980D8DDE229D1C57AEDFB55E5FBA@SY4PR01MB5980.ausprd01.prod.outlook.com>
From: Tim Bray <tbray@textuality.com>
Date: Mon, 18 Sep 2023 11:29:11 -0700
Message-ID: <CAHBU6iv=JhwNXgs1QpCXKJynR+ZU6G9L_R1xCcvqC4+8E4Jnfw@mail.gmail.com>
To: "Manger, James" <James.H.Manger@team.telstra.com>
Cc: ART Area <art@ietf.org>, "i18ndir@ietf.org" <i18ndir@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000093211e0605a65190"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/V7cBT78xV2EKL2FwzBuukLTXdpc>
Subject: Re: [I18ndir] [art] Fwd: New Version Notification for draft-bray-unichars-04.txt
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Sep 2023 18:29:18 -0000

 OK, I think your core point is that the current structure of the draft
could lead someone to pick U+0000-U+10FFFF as a reasonable character
repertoire.  Neither Paul nor I read it that way, but you’re not crazy and
if it reads that way to not-crazy people, then we should fix it.

We still have to describe U+0000-U+10FFFF, if only because if you specify
the use of JSON and don’t specify a subset repertoire, that’s what you’re
going to get, as for example the people who implement AWS services know,
and have to explicitly defend themselves.  But if we stop calling it a
subset and restructure the document, we can probably address your concern
without pretending that U+0000-U+10FFFF doesn’t exist in the wild.

On Sep 18, 2023 at 7:05:05 AM, "Manger, James" <
James.H.Manger@team.telstra.com> wrote:

> I don’t think draft-bray-unichars-04
> <https://www.ietf.org/archive/id/draft-bray-unichars-04.html> should
> advance in the IETF.
>
>
>
> Defining unicode-scalar-values, xml-chars, and useful-assignables are 3
> helpful “subsets of Unicode characters” that can be used in protocols and
> data formats.
>
> Defining unicode-code-points as though it is similar is a category error,
> however.
>
>
>
> Section 3.1 and 3.2 deliberately make %x0-10FFFF and %x0-D7FF /
> %xE000-10FFFF looks like similar repertoires, which is too misleading.
>
> UTF-8, UTF-16 and UTF-32 are only defined for the latter. There may be
> “obvious” extensions of UTF-8 (WTF-8) and UTF-32 that can cover %x0-10FFFF,
> but they are simply not widely supported in modern software, so those
> extensions are no use in an IETF standard. And the “obvious” extension to
> UTF-16 gives something that no longer works as you expect from a repertoire.
>
>
>
> An implementation that accepts surrogates cannot distinguish a
> {high-surrogate, low-surrogate} pair from a non-BMP character. ECMA-404 is
> clear on this when it says, “whether a processor of JSON texts interprets
> such a surrogate pair (“\uD834\uDD1E”) as a single code point (U+1D11E) or
> as an explicit surrogate pair is a semantic decision that is determined by
> the specific processor”. That is totally unexpected from seeing %x0-10FFFF
> as a seemingly simple repertoire.
>
>
>
> It makes sense for a spec to define:
>
>   unicode-scalar-value = %x0-D7FF / %xE000-10FFFF
>
>   string = *unicode-scalar-value
>
>
>
> It does not make sense for a spec to define:
>
>   unicode-code-point = %x0-10FFFF
>
>   string = *unicode-code-point
>
> because, for instance, %xD834 %xDD1E and %x1D11E are separate values of
> that ABNF grammar but will not be treated that way by implementations. In
> implementations they will be indistinguishable strings. An internal 16-bit
> format will store the same two 16-bit words for both. Only 1 form can come
> out.
>
>
>
> For understandable reasons, JSON supports both *(%x0-D7FF / %xE000-10FFFF)
> and *(%x0-FFFF) (arbitrary 16-bit data) as models for the logical strings
> it can represent. An implementation can pick either. They don’t exactly
> overlap. There is probably a complicated ABNF that can cover both involving
> *(%x0-D7FF / %xE000-10FFFF)  and unpaired-surrogate, but it would be
> non-trivial (and not that practical). And that ABNF for the logical strings
> JSON can represent is different from the ABNF for JSON text itself, which
> excludes controls other than whitespace and has escape sequences.
> *(%x0-10FFFF) – as implied by 3.1 – doesn’t match any concept here.
>
>
>
> The Unicode spec (chapter 2 General Structure
> <https://www.unicode.org/versions/Unicode15.0.0/ch02.pdf>) covers
> abstract characters, code points, codespace, code unit, encoding form,
> encoding scheme etc. draft-bray-unichars-04 is shorter and simpler, but the
> simplifications aren’t quite precise enough to add clarity.
>
>
>
> P.S. The ABNF productions in draft-bray-unichars-04 should be singular not
> plural, eg unicode-scalar-value.
>
>
>
> --
>
> James Manger
>
>
>
> General
>