Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Tim Bray <tbray@textuality.com> Sat, 09 September 2023 23:44 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F0FFFC151095 for <i18ndir@ietfa.amsl.com>; Sat, 9 Sep 2023 16:44:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.096
X-Spam-Level:
X-Spam-Status: No, score=-7.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=textuality.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0T6NS-GqQLj4 for <i18ndir@ietfa.amsl.com>; Sat, 9 Sep 2023 16:44:26 -0700 (PDT)
Received: from mail-wr1-x432.google.com (mail-wr1-x432.google.com [IPv6:2a00:1450:4864:20::432]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 85D90C151080 for <i18ndir@ietf.org>; Sat, 9 Sep 2023 16:44:26 -0700 (PDT)
Received: by mail-wr1-x432.google.com with SMTP id ffacd0b85a97d-31f71b25a99so2857617f8f.2 for <i18ndir@ietf.org>; Sat, 09 Sep 2023 16:44:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=textuality.com; s=google; t=1694303064; x=1694907864; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Kt5tSNBu/xgho3tzctHF8XPlXfUnBa2qWuXPFxHkhyI=; b=ZHB8MR1mhsuVmSaI4n9xcluWXDgjcnFpm+XwvMuqtAPqqEmwuJHEPXfsAWjJuVmc6n C38xDIKgi9uSmX3MXyYDweMWmHDSoXB7vDnIf+10tx6/MKl6dMJVpmfb1B78BOGGMwRG sS98UIxskbaT5y26DWAHZ3D1g9GX796aUdxyA=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694303064; x=1694907864; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Kt5tSNBu/xgho3tzctHF8XPlXfUnBa2qWuXPFxHkhyI=; b=N2nosjctPXOuQn84nCrLsDbIaG3ioauR37DjB8iFJc9uN4jNMP1uWa17VDRSvqwMTv i8iCNX45q6jXq+PyVN+wRfwaiTGwUfflzQK1MSpRgvIwEv1uCczN3oJYHVwm346pS/w3 Lxa2tLTiB55AXOULrVIF0DZsyHtW+7ni4eKN4iURaKqXAzXSMsYQQo+LXi3DOc6OtGZi DSBMpfViEElWL+A6P1TSEnb+ZZ0sTg/dgdMuX5gVYVV9GZFtkL3e0HIxsKgSpQ9g314c 6RBHIg5fhTnWNHAM0GymEtFzGnpStjDr0osMDnw3mnz/fN1nHObIoMgGTfykAW1X8ndR Luew==
X-Gm-Message-State: AOJu0YylHI0txb3Ox2ZUXmCO7H+wgETb4BYutGKsFDeLHN//7QcXCxhq jyql233viGUCtlU07/LyKi7njm/Ez+Ju0jHR4nlL5Q==
X-Google-Smtp-Source: AGHT+IFtUhYppMQSj1z22W5xW8tZXXpdMF43R/jHQnuHgxnkJue1miZlStLMeLA//ivykZrJ8qDMNlLf6HpwhSUPQ08=
X-Received: by 2002:a5d:40cb:0:b0:317:7af4:5297 with SMTP id b11-20020a5d40cb000000b003177af45297mr4990862wrq.62.1694303064357; Sat, 09 Sep 2023 16:44:24 -0700 (PDT)
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sat, 9 Sep 2023 16:44:23 -0700
Received: from 1064022179695 named unknown by gmailapi.google.com with HTTPREST; Sat, 9 Sep 2023 16:44:20 -0700
Mime-Version: 1.0 (Mimestream 1.1.1)
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <20230909165843.GlTJy%steffen@sdaoden.eu>
In-Reply-To: <20230909165843.GlTJy%steffen@sdaoden.eu>
From: Tim Bray <tbray@textuality.com>
Date: Sat, 09 Sep 2023 16:44:23 -0700
Message-ID: <CAHBU6iuixTeS=X1kccw11zEnHVG5tx9aHUC-pH00ociBmukhGQ@mail.gmail.com>
To: Steffen Nurpmeso <steffen@sdaoden.eu>
Cc: i18ndir@ietf.org, ART Area <art@ietf.org>
Content-Type: multipart/alternative; boundary="0000000000004b178d0604f5ac2f"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/iS59-WDJcxRdFyPPzLHGnYua7vk>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Sep 2023 23:44:31 -0000

On Sep 9, 2023 at 9:58:43 AM, Steffen Nurpmeso <steffen@sdaoden.eu> wrote:

>
> In 2.2 i would not give the count on code point types.
> Instead i would only give the problem statement "among Unicode
> code point types .. are questionable".  This seems more generic.
>

Left to myself I’d leave it as-is, but I don’t care that much, if anyone
else agrees with Steffen I don’t mind changing it.

In 2.2.2.2 i would not say "legacy controls", and that they are
> "mostly obsolete".  ECMA-48 is very alive in at least the POSIX
> aka Linux world, for many purposes, for example terminal
> interaction.  "Likely to occur in data as a result of
> a programming error"?  Any preformatted Unix manual page will come
> with lots of CSI sequences, or backspace-based ones.
> ASCII NUL is the base of ISO C-style strings.  In fact many
> network protocols (not enough!!) still seem to use
> KEY=VALUE\0KEY=VALUE\0\0 style transports.
>

So, in section 23.1 of [UNICODE] it says "There are 65 code points set
aside in the Unicode Standard for compatibility with the C0 and C1 control
codes defined in the ISO/IEC 2022 framework.”  Some shreds of ISO 2022
practices may survive in legacy systems but I am pretty sure they are not
relevant to any protocol or data format the IETF might work on. This
clearly feels like “legacy” and “mostly obsolete” to me. (For those who
don’t know, ISO 2022 worked with the "ISO-Latin" 8859-1 code pages to shift
between them in the middle of a string. It is now forgotten for a good
reason.) I’d never heard of ECMA 48 and I am dubious that that use of
control codes would be appropriate in anything the IETF would specify these
days.

In 5.:
>
>  [JSON..] It cannot be serialized into legal UTF-8, but many
>  libraries will silently parse this and generate an ill-formed
>  UTF-8 string. Implementors must be prepared to deal with these
>  sorts of problematic code points.
>
> But RFC 3629 is very clear and says in 3. (being lengthy)
>

You are correct, but the assertion in 5. is also empirically correct, this
is exactly what many (most?) existing libraries will do. Presumably on the
basis of Postel’s law?

So even the weird JSON "string" can be made valid UTF-8, one just
> has to walk around the corner.  (Possibly.)
>

In fact, does that happen? I haven’t seen it but I suppose it is somewhat
possible.

Sorry, but _I_ do not get that JSON supports _that_ "string",
> RFC 8259, 7.:
>
>   To escape an extended character that is not in the Basic Multilingual
>   Plane, the character is represented as a 12-character sequence,
>   encoding the UTF-16 surrogate pair.
>

Once again, you are probably correct, but the caution that implementers
will have to be “prepared to deal with” this is true, because of what
implementations do.  You are correct that it would probably be better if
JSON parsers were stricter.

And then in 8.
>
>  8.  String and Character Issues
>  8.1.  Character Encoding
>     JSON text exchanged between systems that are not part of a
>     closed ecosystem MUST be encoded using UTF-8 [RFC3629].
>

As Rob Sayre said above, the proposed document probably has to address the
issue of JSON escapes and emphasize that they are not relevant to code
point subsets.

5. also says
>
>   It is unlikely that anyone specifying a new data format would
>   choose to allow this character repertoire.
>
> And
>
>   A protocol based on JSON could be made more robust and
>   implementor-friendly by requiring that the contents of member
>   names and string values contain only Useful Assignables
>
> No.  Not me.  Sorry .. we are talking string data?
> I mean, with your restriction one (possibly) cannot even generate
> a protocol that carries around Linux/POSIX path names?
>

Which characters necessary for Linux/POSIX path names are being excluded?
I went looking for the appropriate specs but ended up getting lost in
opengroup.org pages.  I note that for 25 years, the lifetime of XML, the
exclusion of control codes with the exception of \r, \t, and \n, seems to
have worked fine.

>
>