Re: [Json] A possible summary of the discussion so far on code points and characters

R S <sayrer@gmail.com> Sun, 09 June 2013 04:48 UTC

MIME-Version: 1.0
In-Reply-To: <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org>
References: <AF793CAF-B30B-44A7-B864-82CEF79EA34D@vpnc.org> <CAChr6SwLDCUk0DC9pGTKqUu_V5vJHvs7Sgv4EneTJMryn1iKSA@mail.gmail.com> <D27EA9DC-9EFE-419B-BC34-3BF3FC8F5260@vpnc.org> <EF244D9B-29E2-40E4-99FF-810A28091106@tzi.org> <CAChr6Sxwhdn8CshU92y6fcoovzzhcayg3MECP7Hg=UXX390z=w@mail.gmail.com> <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org>
Date: Sat, 08 Jun 2013 21:48:04 -0700
Message-ID: <CAChr6SzTHkbfXgUxYWLijyoYz0ug2TMjoVzFgDEF+Mz+idZ1Yg@mail.gmail.com>
From: R S <sayrer@gmail.com>
To: Carsten Bormann <cabo@tzi.org>
Content-Type: multipart/alternative; boundary="047d7ba9751895783d04deb15e02"
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
Precedence: list

On Sat, Jun 8, 2013 at 7:29 PM, Carsten Bormann <cabo@tzi.org> wrote:
>
>
> Changing the JSON spec retroactively to put in a requirement for handling
> strings in UTF-16 code units so that unpaired surrogates might work more
> uniformly is something different.
>

I haven't proposed a change to the spec--have you? I'm fine with the status
quo: vaguely referring to Unicode characters with the full knowledge that
JSON is intended to produce identical results to JavaScript's eval function
for the subset of JavaScript syntax that JSON supports.

>
> (BTW, your examples show that two JSON implementations handle Unicode
> non-characters nicely, which is great and probably something to be
> recommended, but doesn't have anything to do with switching to UTF-16 code
> units.  Now let's put in a couple of (paired!) surrogates to show how well
> the code units work:
>
> >>> print(json.loads('{"a": "\ud800\udd51" }')["a"])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
> position 0: surrogates not allowed
> >>> print(json.loads('{"a": "\\ud800\\udd51" }')["a"])
> 𐅑
>
> ... which demonstrates nicely what I have been saying: Don't put unescaped
> surrogates into your JSON texts, because there is no equivalence at the
> UTF-16 code unit level.)
>

That must be Python3. That shell session is a little misleading, because
it's the print function throwing the exception. The Python3 JSON parser
accepts both, and the Python3 JSON encoder produces identical output from
those inputs. Although they do in Python 2.7, the two inputs don't compare
as equal in Python 3. However, both JSON messages seem unambiguous to
Python's JSON encoder.

~$ python
Python 2.7.3 (default, Sep 26 2012, 21:53:58)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.loads('{"a": "\\ud800\\udd51"}')["a"] ==  json.loads('{"a":
"\ud800\udd51"}')["a"]
True
>>>

~$ python3
Python 3.2.3 (default, Sep 30 2012, 16:43:30)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.loads('{"a": "\\ud800\\udd51"}')["a"] ==  json.loads('{"a":
"\ud800\udd51"}')["a"]
False
>>> json.dumps(json.loads('{"a": "\ud800\udd51"}')["a"])
'"\\ud800\\udd51"'
>>> json.dumps(json.loads('{"a": "\\ud800\\udd51"}')["a"])
'"\\ud800\\udd51"'

Here's Node.js / V8 for comparison:

~$ node --version
v0.8.9
~$ node
> JSON.parse('{"a": "\ud800\udd51"}')["a"] ==  JSON.parse('{"a":
"\\ud800\\udd51"}')["a"]
true
> JSON.stringify(JSON.parse('{"a": "\ud800\udd51"}')["a"])
'"𐅑"'
> JSON.stringify(JSON.parse('{"a": "\\ud800\\udd51"}')["a"])
'"𐅑"'

>
> There is only a single place where the UTF-16 legacy of JavaScript shines
> through in JSON today:
>
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a twelve-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
>
> And that is OK because it is just a slightly weird representation of the
> character.
>
> > > (It is not aligned with JSON's main purpose.)
> >
> > I am not sure what the rationale for that statement is.
>
> The first sentence of RFC 4627:
>

I don't find that convincing at all. It is obvious that strings in JSON are
meant to encode all JavaScript strings, as if they were passed to
JavaScript's eval function (there is one unrelated bug here). RFC 4627 even
refers to JSON as a JavaScript subset, and references the eval function in
the security considerations.

If we must "improve" the current text, I have a suggested addition which
borrows from your emails. I'm not sure where to add it, because it doesn't
fit well with the current structure of the document.

"At their most basic level, JSON strings represent a vector of
unconstrained 16-bit values which largely map to UCS-2. Implementations MAY
apply more stringent Unicode validation."

- Rob

[Json] A possible summary of the discussion so fa… Paul Hoffman
Re: [Json] A possible summary of the discussion s… R S
Re: [Json] A possible summary of the discussion s… Paul Hoffman
Re: [Json] A possible summary of the discussion s… Stephen Dolan
Re: [Json] A possible summary of the discussion s… R S
Re: [Json] A possible summary of the discussion s… Carsten Bormann
Re: [Json] A possible summary of the discussion s… R S
Re: [Json] A possible summary of the discussion s… Carsten Bormann
Re: [Json] A possible summary of the discussion s… R S
Re: [Json] A possible summary of the discussion s… Tim Bray
Re: [Json] A possible summary of the discussion s… Stephen Dolan
Re: [Json] A possible summary of the discussion s… Norbert Lindenberg