Re: [Json] Proposal for strings/Unicode text

Norbert Lindenberg <ietf@lindenbergsoftware.com> Wed, 12 June 2013 22:46 UTC

Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset="utf-8"
From: Norbert Lindenberg <ietf@lindenbergsoftware.com>
In-Reply-To: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
Date: Wed, 12 Jun 2013 15:46:07 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
Cc: Norbert Lindenberg <ietf@lindenbergsoftware.com>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
Precedence: list

On Jun 12, 2013, at 11:46 , Tim Bray wrote:

> Rationale:
> - emphasize the important fact that Strings are *intended* for Unicode characters
> - document the important fact that the rules allow horrible Unicode practices
> - say “backslash” instead of “reverse solidus” :)

The JSON RFC seems to use Unicode character names, in this case case "reverse solidus".

> In section 1, introduction
> 
> Before:
>    A string is a sequence of zero or more Unicode characters [UNICODE].
> After:
>    A string is intended to contain sequences of zero or more Unicode characters [UNICODE 6.2]
> 
> Rewrite section 2.5 as follows:
> 
> Strings begin and end with quotation marks.  They are intended,to contain sequences of Unicode characters; Note however that the ABNF in this section allows the inclusion of 16-bit quantities in ways which can never be useful for representing characters and is likely to cause breakage in software designed to process Unicode text.

This warning is too vague to be useful. Which specific risks do you think need to be discussed here? Also, the ABNF doesn't do anything specifically for 16-bit quantities, as far as I can see.

> The ABNF allows the use of many Unicode code points that could be used in future to represent Unicode characters, but have not yet been assigned. Therefore, this specification should not need revision as the Unicode character repertoire continues to grow.
> 
> 16-bit quantities (normally Unicode characters from the Basic Multingual Pane(U+0000 through U+FFFF) may be “escaped”, or represented as a six-character sequence: a backslash (U+005C REVERSE SOLIDUS), followed  by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.  The hexadecimal letters A though F can be upper or lower case.  So, for example, a string containing only a single backslash may be represented as "\u005C".

These escape sequences aren't about 16-bit quantities - they represent Unicode BMP code points. If the parser inserts a BMP code point into an output string in UTF-16 (in JavaScript, Java, and others), then the result will be a 16-bit quantity. If it inserts the code point into an output string in UTF-8 (in Python and others), then the result will be a sequence of 1-3 bytes.

If we discuss these escape sequences in prose, then the sequences for BMP and supplementary characters need to be discussed together so that it's clear what's a surrogate pair and how unpaired surrogates are handled.

JSON syntax allows almost all Unicode code points (with the exceptions visible in the "unescaped" production) to be inserted into strings directly, so escape sequences are mostly a convenience for using JSON in environments that restrict the set of allowed characters (such as RFCs), don't have the necessary fonts and input methods installed, or benefit from making the code points visible (e.g., in test cases).

> Alternatively, there are two-character sequence escape representations of some popular characters.  So, for example, a string containing only a single backslash may be represented more compactly as "\\".

I'd assume that this is not about popularity, but about the need to represent control characters and characters that are also used within the JSON syntax. I don't see two-character escapes for "e" or "的".

Norbert

[Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text R S
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Bjoern Hoehrmann
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Tim Bray
Re: [Json] Proposal for strings/Unicode text Nico Williams
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Manger, James H
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text John Cowan
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Joe Hildebrand (jhildebr)
Re: [Json] Proposal for strings/Unicode text Carsten Bormann
Re: [Json] Proposal for strings/Unicode text Paul Hoffman
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg
Re: [Json] Proposal for strings/Unicode text Norbert Lindenberg