Re: [Json] Proposed minimal change for strings

Nico Williams <nico@cryptonector.com> Wed, 03 July 2013 16:41 UTC

MIME-Version: 1.0
In-Reply-To: <0194C74E-3866-48B1-A6F8-69802FA30609@vpnc.org>
References: <9BACB3F2-F9BF-40C7-B4BA-C0C2F33E4278@vpnc.org> <CAK3OfOgN5SKOet5bvN1fpxj6UsvUdcOUxvETYxUmsWH_3sarcA@mail.gmail.com> <0194C74E-3866-48B1-A6F8-69802FA30609@vpnc.org>
Date: Wed, 03 Jul 2013 11:41:45 -0500
Message-ID: <CAK3OfOhzU2M+uL+HsWwJ=sL2j3mxJWTxSdtFrTXwizZJDVwayA@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
To: Paul Hoffman <paul.hoffman@vpnc.org>
Content-Type: text/plain; charset="UTF-8"
Cc: "json@ietf.org WG" <json@ietf.org>
Subject: Re: [Json] Proposed minimal change for strings
Precedence: list

On Wed, Jul 3, 2013 at 10:28 AM, Paul Hoffman <paul.hoffman@vpnc.org> wrote:
> On Jul 2, 2013, at 8:44 PM, Nico Williams <nico@cryptonector.com> wrote:
>> Huh?  Do you mean that any code unit may be allowed if escaped?
>
> That is exactly what the current document says, I believe. Do you see anything in the grammar that says differently?

No, but I couldn't understand what you wrote, which was

| Proposal 1 (allow all code units in their unescaped form):

surely you did not mean that all 16-bit code units could be sent
unescaped.  But that is how I read that.  I figured you'd typoed.

>>> In section 2.2 (Strings):
>>>  Leave the production for "unescaped" as-is.
>>> In section 3 (Encoding):
>>>  Add "Some strings, notably those that have unescaped surrogate code units
>>>  (value 0xD800 to 0xDFFF), cannot be encoded in UTF-8."
>>
>> Unescaped and *unpaired*.
>
> No, any surrogate code point. RFC 3629, the IETF's definition of UTF-8, says:
>    The definition of UTF-8 prohibits encoding character numbers between
>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>    encoding form (as surrogate pairs) and do not directly represent
>    characters.
> Similar language is used in The Unicode Standard's definition of UTF-8 (D92), and of UTF-32 (D90).

Yet there are ECMAScript applications that send these.  In their
escaped form there's no conflict with UTF-8 (nor UTF-32, nor even in
UTF-16 if unpaired).  In their unescaped form there is most definitely
a conflict.

Which brings up: what should a parser do when it sees escaped code
units?  Should it attempt to unescape them in the parsed result?  What
if the escaped code unit cannot be unescaped because it would result
in invalid UTF-8/16/32?  RFC4627 says nothing about this.

My proposal is to allow any code units as long as they are escaped if
they are unpaired surrogate code points or as long as they are
unescaped and properly represented in the UTF of the JSON document
otherwise (i.e., if a pair of surragates appear in UTF-16 then when
re-encoding to UTF-8 the result must be UTF-8, not CESU-8, and the
pair must be decoded into a code point then re-encoded in UTF-8).

Nico
--

Re: [Json] Proposed minimal change for strings John Cowan
[Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Nico Williams
Re: [Json] Proposed minimal change for strings Manger, James H
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings Nico Williams
Re: [Json] Proposed minimal change for strings Norbert Lindenberg
Re: [Json] Proposed minimal change for strings Joe Hildebrand (jhildebr)
Re: [Json] Proposed minimal change for strings Jorge
Re: [Json] Proposed minimal change for strings Stephen Dolan
Re: [Json] Proposed minimal change for strings Bjoern Hoehrmann
Re: [Json] Proposed minimal change for strings Jorge
Re: [Json] Proposed minimal change for strings Manger, James H
Re: [Json] Proposed minimal change for strings Bjoern Hoehrmann