Re: [Json] Proposed minimal change for strings

Norbert Lindenberg <ietf@lindenbergsoftware.com> Thu, 04 July 2013 05:06 UTC

Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
From: Norbert Lindenberg <ietf@lindenbergsoftware.com>
In-Reply-To: <9BACB3F2-F9BF-40C7-B4BA-C0C2F33E4278@vpnc.org>
Date: Wed, 03 Jul 2013 22:06:36 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <600B717E-2F49-4D5F-835C-C90218396E75@lindenbergsoftware.com>
References: <9BACB3F2-F9BF-40C7-B4BA-C0C2F33E4278@vpnc.org>
To: "json@ietf.org WG" <json@ietf.org>
Cc: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Subject: Re: [Json] Proposed minimal change for strings
Precedence: list

On Jul 2, 2013, at 16:27 , Paul Hoffman <paul.hoffman@vpnc.org> wrote:

> Proposal 1 (allow all code units in their unescaped form):
> 
> In section 1 (Introduction):
> Change the sentence about Unicode characters to:
>   A string is a sequence of zero or more Unicode code units [UNICODE].

This should be "code points", not "code units".

If you allow a sequence of code units, you might get UTF-8 code units in random sequence, or a mix of UTF-8 code units and UTF-16 code units, which wouldn't make sense at all.

Also, JSON documents have to be understood at the level of code points, because their code unit sequences are often changed in transit - e.g., from UTF-16 code units in JavaScript to UTF-8 code units over the network back to UTF-16 code units in Java - with no intended change in content.

See also
http://www.unicode.org/glossary/#code_point
http://www.unicode.org/glossary/#code_unit

> In section 2.2 (Strings):
> Leave the production for "unescaped" as-is.
> In section 3 (Encoding):
> Add "Some strings, notably those that have unescaped surrogate code units
> (value 0xD800 to 0xDFFF), cannot be encoded in UTF-8."

As above, "surrogate code units" should be "surrogate code points", therefore U+D800 to U+DFFF. Also, it's exactly the strings containing unescaped surrogate code points that cannot be encoded in well-formed UTF-8, so we can get rid of "some" and "notably those".

There are a few additional references to "character" that would have to be changed to "Unicode code point" to make this proposal complete. See
http://www.ietf.org/mail-archive/web/json/current/msg00870.html

Norbert

Re: [Json] Proposed minimal change for strings John Cowan
[Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Nico Williams
Re: [Json] Proposed minimal change for strings Manger, James H
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings Nico Williams
Re: [Json] Proposed minimal change for strings Norbert Lindenberg
Re: [Json] Proposed minimal change for strings Joe Hildebrand (jhildebr)
Re: [Json] Proposed minimal change for strings Jorge
Re: [Json] Proposed minimal change for strings Stephen Dolan
Re: [Json] Proposed minimal change for strings Bjoern Hoehrmann
Re: [Json] Proposed minimal change for strings Jorge
Re: [Json] Proposed minimal change for strings Manger, James H
Re: [Json] Proposed minimal change for strings Bjoern Hoehrmann