Re: [Json] Proposal for strings/Unicode text

Carsten Bormann <cabo@tzi.org> Wed, 12 June 2013 22:30 UTC

Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Content-Type: text/plain; charset="windows-1252"
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
Date: Thu, 13 Jun 2013 00:30:10 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <3E2A194C-789F-4E75-ABB0-CE966319463E@tzi.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
Precedence: list

Hmm.

Somehow I think the JSON specification should focus on describing what the intended usage is.

I strongly prefer adding an appendix M for things that you can do with the ABNF that are almost, but not entirely unlike JSON.

Grüße, Carsten

PS.: I support jettisoning liturgical language in standards, and I applaud Douglas for slipping with respect to the liturgical term "octet" only twice (both in the same paragraph).  But a document must also speak the language of its editor, and if Douglas thinks "reverse solidus" is the best way to speak about what some Germans (but never me) would call "Rückschräger", that's fine, as long as it is consistent.

PPS.: On the specific wording:
> Before:
>    A string is a sequence of zero or more Unicode characters [UNICODE].
> After:
>    A string is intended to contain sequences of zero or more Unicode characters [UNICODE 6.2]

A string is a sequence of characters.  [not sequences of them]
Add something like: "To reduce the burden on implementations, JSON is less selective in what it accepts as a character than Unicode itself is.  See also Appendix M."

> Strings begin and end with quotation marks. 

The representation does, the string rarely does.
RFC4627 got this right consistently, but in a tight language where extracting a single sentence may lose the necessary context.

> They are intended,to contain sequences of Unicode characters; Note however that the ABNF in this section allows the inclusion of 16-bit quantities in ways which can never be useful for representing characters and is likely to cause breakage in software designed to process Unicode text.

This is where I would simply point to Appendix M.

> The ABNF allows the use of many Unicode code points that could be used in future to represent Unicode characters, but have not yet been assigned. Therefore, this specification should not need revision as the Unicode character repertoire continues to grow.

This is something that even could be said in the introduction.
Or in a section about stability and protocol evolution (the same section that is needed to say that [past and future] changes in JavaScript don't change JSON).

> 16-bit quantities

These are Unicode code points.

> (normally Unicode characters from the Basic Multingual Pane(U+0000 through U+FFFF) may be “escaped”, or represented as a six-character sequence: a backslash (U+005C REVERSE SOLIDUS), followed  by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.  The hexadecimal letters A though F can be upper or lower case.  So, for example, a string containing only a single backslash may be represented as "\u005C".
> 
> Alternatively, there are two-character sequence escape representations of some popular characters.  So, for example, a string containing only a single backslash may be represented more compactly as "\\".
> 
>  To escape an extended character

Non-BMP characters aren't "extended".  They have rights, too!
You MUST rid your mind of discriminating against them.
(I know, this was just copied over...)

> that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.  So, for example, a string containing only U+1D11E G CLEF may be represented as
>    "\uD834\uDD1E".

(This could add a small apologetic clause pointing out the UTF-16 roots of the weird notation.  Or not.)
This needs another pointer to appendix M.

[Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text R S
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Bjoern Hoehrmann
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Nico Williams
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Manger, James H
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg