Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03

Rob Sayre <sayrer@gmail.com> Sat, 09 September 2023 20:57 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0C136C151080; Sat, 9 Sep 2023 13:57:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.104
X-Spam-Level:
X-Spam-Status: No, score=-2.104 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 99wJI2t79JVb; Sat, 9 Sep 2023 13:57:01 -0700 (PDT)
Received: from mail-ed1-x52d.google.com (mail-ed1-x52d.google.com [IPv6:2a00:1450:4864:20::52d]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0344EC14CE4D; Sat, 9 Sep 2023 13:57:01 -0700 (PDT)
Received: by mail-ed1-x52d.google.com with SMTP id 4fb4d7f45d1cf-52a5c0d949eso4053887a12.0; Sat, 09 Sep 2023 13:57:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1694293019; x=1694897819; darn=ietf.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=73QSHttewGwygtdILL8mDStxvA7vfmmahy0E1e2SmBw=; b=b4zOf6U5GFs0pv0g+7rVX/q1WqRU+51XnAXRIoF15ejP5jyv3UJNqZM18pjO0I3Vaj H55Lscc/k7xcQoIYb85kAkU+Dw4FuVC+Gzn7lYtBeWpoZV4colS5WrVT1edrC7AAOkLB H7IH3r2oI46K53WpdS4vjzD/b0WTbjLjoSmthdSwzyghRkNFzJnNmFsMosJ6NZmpw6Q3 274J03y5Eg48UOO/vOJkqjLqRuWkmoT20aq8bV+btkhys95Wx4myZOwuT0OkztnynotT VU/Cum1rd4+aaOY7PuZ6vZx7d9Ai8sBTuYMeHnsS7oJzlynRZbQVn83Vp9AtF0cqeeLB prMg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694293019; x=1694897819; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=73QSHttewGwygtdILL8mDStxvA7vfmmahy0E1e2SmBw=; b=JvMPsBPTSZC5/wGEMyfgKRlGInx1FiEbOtSPoXSLYUEE9EEL8cRgO/2/sWyR4CTr+A uSmc39L6+GwUMcUvy+cn1h53nlOyuAbLc/zz/Tft61XZrs61ylR1uY+79HPY3oSTsys7 PxnuH4cOZqdnbMNdz1YQ60ZubIInVkPUJsBJjdhPDbl1hSBoZ5jbbLkKIbmdJvOILJrp JzAThl+yhmX1plG+V0Ej6JZT3HRy/99vtSdsFJr2v2toik0qvVX0VYmPXI60N4avadQ3 VLJMHbphOHIdMoSsB7x2px2fRDe2hV3lrGmWDdU71Rf/9Cn3Mh2VjheVqprLNIOlU7Xz J2hQ==
X-Gm-Message-State: AOJu0YxOQLNsTlNAcKMi25ftUWpfn9b9a9IL4EVH/ZNwXqqzPOV6OFjZ sGU2ShGSiyPBVaKaJSHTiHpkQIuVl3fr1n85jsg=
X-Google-Smtp-Source: AGHT+IFPge6HBPVW19Vcx72dn8cU4IELOBiOmRdr/u13kAdfwsCLUqN+1Qc0R1OxAoYvVSic/IynGYu931bWPgtiukk=
X-Received: by 2002:aa7:da10:0:b0:523:95e:c2c0 with SMTP id r16-20020aa7da10000000b00523095ec2c0mr3837041eds.42.1694293018957; Sat, 09 Sep 2023 13:56:58 -0700 (PDT)
MIME-Version: 1.0
References: <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com> <20230909165843.GlTJy%steffen@sdaoden.eu>
In-Reply-To: <20230909165843.GlTJy%steffen@sdaoden.eu>
From: Rob Sayre <sayrer@gmail.com>
Date: Sat, 09 Sep 2023 13:56:47 -0700
Message-ID: <CAChr6Sz=rMqwp3GOoqGgSsxE8Pqe3GCTqaOLpBO=YN+7v1Ui8Q@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>, i18ndir@ietf.org, ART Area <art@ietf.org>, Steffen Nurpmeso <steffen@sdaoden.eu>
Content-Type: multipart/alternative; boundary="0000000000008a5be60604f35506"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/1TWdxmKBcqDtn7Cy7PLv6dLaWYU>
Subject: Re: [I18ndir] [art] Just uploaded draft-bray-unichars-03
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Sep 2023 20:57:05 -0000

I gave it a close read again. I came up with this:


5. Refining Character Repertoires

The IETF typically uses well-known data formats such as JSON, I-JSON, CBOR,
YAML, and XML. These formats have default character repertoires. For
example, JSON allows member names and string values to include any Unicode
code points, including all the problematic types; the following is a legal
JSON document:

[ big edit from the current draft, shorter, but take it or leave it. ]


{"example": "\u0000\U0089\uDEAD\u7FFFF"}

The value of the "example" field contains the C0 Control NUL, the C1
Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate,
and the noncharacter U+7FFFF. It is unlikely to be useful as the value of a
text field. It cannot be serialized into legal UTF-8, but many libraries
will silently parse this and generate an ill-formed UTF-8 string.
Implementors must be prepared to deal with these sorts of problematic code
points

[ The first part, "unlikely to be useful as the value of a text field", is
good. But, the next part mixes "legal" and "ill-formed", and I don't
think that is a good idea. There is still a lowercase requirement after
that, and I think I disagree. Implementors do not have to be "prepared to
deal with these sorts of problematic code points". Maybe: "Some messages
will contain these problematic code points". That is true, but you don't
have to deal with them. ]



It is unlikely that anyone specifying a new data format would choose to
allow this character repertoire.

[ Instead: The JSON character repertoire is too permissive, so it's best
for new specifications to require that the contents of member names and
string values contain only Useful Assignables (see Section 4.2). ]



Then, I got to the end, and noticed that "character repertoire" might not
be the best choice. "Character encoding" or "character set"? "Vocabulary"?
No shade for the authors here, writing about language itself is really
difficult.

thanks,
Rob



On Sat, Sep 9, 2023 at 10:15 AM Steffen Nurpmeso <steffen@sdaoden.eu> wrote:

> Tim Bray wrote in
>  <CAHBU6is50TkpDsqXTp6WxdVSgE66j3gGHZ60ey2jFYbefaHFJw@mail.gmail.com>:
>  |See https://www.ietf.org/archive/id/draft-bray-unichars-03.html
>  |
>  |A bunch of minor corrections and improvements, thanks to everyone for
> that,
>  |especially James Manger for noticing that the ABNF was entirely wrong in
>  |one place.
>  |
>  |The word “useless” has been replaced by “legacy”.
>  |
>  |I think the feedback was pretty clear that the draft needed to be more
>  |opinionated; just because we document the existence of the default JSON
>  |repertoire (“all the code points”) doesn’t mean that anyone should use it
>  |in the present or future. So, introduced a new section “Refining
> Character
>  |Repertoires” to highlight those issues and offer a suggestion.
>
> In 2.2 i would not give the count on code point types.
> Instead i would only give the problem statement "among Unicode
> code point types .. are questionable".  This seems more generic.
>
> In 2.2.2.2 i would not say "legacy controls", and that they are
> "mostly obsolete".  ECMA-48 is very alive in at least the POSIX
> aka Linux world, for many purposes, for example terminal
> interaction.  "Likely to occur in data as a result of
> a programming error"?  Any preformatted Unix manual page will come
> with lots of CSI sequences, or backspace-based ones.
> ASCII NUL is the base of ISO C-style strings.  In fact many
> network protocols (not enough!!) still seem to use
> KEY=VALUE\0KEY=VALUE\0\0 style transports.
>
> In 5.:
>
>   [JSON..] It cannot be serialized into legal UTF-8, but many
>   libraries will silently parse this and generate an ill-formed
>   UTF-8 string. Implementors must be prepared to deal with these
>   sorts of problematic code points.
>
> But RFC 3629 is very clear and says in 3. (being lengthy)
>
>    The definition of UTF-8 prohibits encoding character numbers between
>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>    encoding form (as surrogate pairs) and do not directly represent
> []
>    characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
>    to first decode the UTF-16 data to obtain character numbers, which
>    are then encoded in UTF-8 as described above.  This contrasts with
>    CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for
>    ...
>
> So even the weird JSON "string" can be made valid UTF-8, one just
> has to walk around the corner.  (Possibly.)
> Sorry, but _I_ do not get that JSON supports _that_ "string",
> RFC 8259, 7.:
>
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a 12-character sequence,
>    encoding the UTF-16 surrogate pair.
>
> And then in 8.
>
>   8.  String and Character Issues
>   8.1.  Character Encoding
>      JSON text exchanged between systems that are not part of a
>      closed ecosystem MUST be encoded using UTF-8 [RFC3629].
>
> This is a total contradiction, sorry.  I. Hate. JSON.
> But that does not help anyone.
>
> So i mean _if_ i would write such a RFC _i_ would not hammer your
> sentence on the table, but i would then simply refer to RFC 3629
> and say that implementors shall be prepared to convert the JSON
> standard (grrr) string .. to the UTF-8 standard?
>
> 5. also says
>
>    It is unlikely that anyone specifying a new data format would
>    choose to allow this character repertoire.
>
> And
>
>    A protocol based on JSON could be made more robust and
>    implementor-friendly by requiring that the contents of member
>    names and string values contain only Useful Assignables
>
> No.  Not me.  Sorry .. we are talking string data?
> I mean, with your restriction one (possibly) cannot even generate
> a protocol that carries around Linux/POSIX path names?  Except by
> mangling them to something likely non-reproducible (by leaving off
> "evil" characters, or converting them to a replacement character;
> which one, the Unicode one, or question mark?  Ah, it must be
> ASCII question mark because the Unicode replacement character is
> of the evil sort?).  Or have i misunderstood something ...
> which can very well be the truth, of course.
> So, even if you wipe away all of the above, a hint on replacement
> characters in a document that restricts the usable set of Unicode
> characters is well worth a thought.
>
> Thank you.
>
> --steffen
> |
> |Der Kragenbaer,                The moon bear,
> |der holt sich munter           he cheerfully and one by one
> |einen nach dem anderen runter  wa.ks himself off
> |(By Robert Gernhardt)
>
> _______________________________________________
> art mailing list
> art@ietf.org
> https://www.ietf.org/mailman/listinfo/art
>