Re: [Json] Proposal for strings/Unicode text

Carsten Bormann <cabo@tzi.org> Wed, 12 June 2013 23:41 UTC

Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Content-Type: text/plain; charset="iso-8859-1"
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAHBU6ivw=4WfTyXdBns-i30fvzhkb+Zs_puj=YhFw+fh6n3R7A@mail.gmail.com>
Date: Thu, 13 Jun 2013 01:40:49 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <D6191FD1-7DF9-4433-8427-48F866A3DBBC@tzi.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <3E2A194C-789F-4E75-ABB0-CE966319463E@tzi.org> <CAHBU6ivw=4WfTyXdBns-i30fvzhkb+Zs_puj=YhFw+fh6n3R7A@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
Precedence: list

On Jun 13, 2013, at 00:49, Tim Bray <tbray@textuality.com> wrote:

> Strings are delimited with quotation marks (U+0022 QUOTATION MARK). They are intended,to contain sequences of Unicode characters. Note however that the normative ABNF in this section allows the inclusion of 16-bit quantities, for example unpaired surrogate-block code points, in ways which can never be useful for representing characters and is likely to cause breakage in software designed to process Unicode text.

A kitchen knife is intended to cut food. Note however that the normative cutting edge of the knife also allows the use against intruding burglars, for example by slashing open their guts, which can never be useful for food and is likely to cause you trouble in a number of ways, including bloodstains on your shirt.

I don't want to read this sentence in the introduction of the manual for my kitchen knife, please.

Again, can we focus on the correct usage and ban avenues for misuse to an appendix?
It could even be called "the JSON character model" to obscure its grisly purpose.

You are not going to get both "correct" and "understandable" if you combine the intended usage with all the arcane legacy issues into one piece of text.

(If I'm the only one who cares about this editorial issue, this is going to be my last comment on this... if this WG does manage to turn one of the nicest-written RFCs into goulash, I can always point my students to read the original.)

Grüße, Carsten

PS.: "16-bit quantities" is highly confusing. Outside of mathematics and physics, "quantity" gives the amount of something. There is absolutely no reason not to use the term "code point" from Unicode.

»In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF[16], comprising 1,114,112 code points available for assigning the repertoire of abstract characters.«

See also Table 2-3.

Yes, all of them can be used in a JSON document...
Some of them must be escaped, some of them are non-characters but otherwise innocuous, some of them (surrogates) actively try to slash your guts, but they are all code points.

»Surrogate code points cannot be conformantly interchanged using Unicode encoding forms. They do not correspond to Unicode scalar values and thus do not have well-formed representations in any Unicode encoding form. (See Section 3.8, Surrogates.)«

So they must be escaped (unless you are using them for their intended purpose in a UTF-16-encoded JSON document, a rare species). There also is no way to notate a sequence of a high surrogate and a low surrogate using escape sequences in JSON, because that always stands for the non-BMP code point that results from the values of the two surrogates. All that should be mentioned somewhere, because it takes a couple of hours of analysis to ascertain that. (Maybe also mention that that doesn't hurt the "JavaScript string as vector of 16-bit numbers" use case, because the non-BMP characters can be (will be!) taken apart again by the recipient. So you don't even have to escape them... More stuff for the appendix, or maybe for a special "worst practice" document.)

[Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text R S
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Bjoern Hoehrmann
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Nico Williams
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Manger, James H
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg