Re: [Json] A possible summary of the discussion so far on code points and characters

R S <sayrer@gmail.com> Sun, 09 June 2013 04:48 UTC

Return-Path: <sayrer@gmail.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C483021F92B8 for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 21:48:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.383
X-Spam-Level:
X-Spam-Status: No, score=-2.383 tagged_above=-999 required=5 tests=[AWL=0.216, BAYES_00=-2.599, HTML_MESSAGE=0.001, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SYGsfIpwX17W for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 21:48:06 -0700 (PDT)
Received: from mail-wg0-x235.google.com (mail-wg0-x235.google.com [IPv6:2a00:1450:400c:c00::235]) by ietfa.amsl.com (Postfix) with ESMTP id 2A98721F91B1 for <json@ietf.org>; Sat, 8 Jun 2013 21:48:05 -0700 (PDT)
Received: by mail-wg0-f53.google.com with SMTP id y10so1616565wgg.20 for <json@ietf.org>; Sat, 08 Jun 2013 21:48:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=jFSwIaNubTzPLpdIl8MuHMNPUlYRXAsOF0yjF1PoGzI=; b=EGkG9rOyTr4pr4AJsQt3X/8XEIWkD76ltezX0I81OrmK+mKXYx06eVHl2sVSF7p1Jl RiQPd1OBhIng+/rADoL6Jed8KCAEtuiTAbTfdmHu8jaZzQg7nLISecmJoF14F8KO7e45 zMOQ6LDHZZyXEhb9XXi21Umv90Qv0gqEZ9fcvGQQhlHkj3VtL9cjWJhTX5pY1qz9Ink5 FE1ys6LV7SWCLKQoy+K4vcMZiHXX0tIy53ve5PECyeY+rMP/+xOyj/j18SNbMsPgBajY Sa5wqOfO8oQzTWMyRwN99NcLLalWC74TVFKAVBu8zaKatj2SNPGKtvdDTRTOry3cg2Qj Y49w==
MIME-Version: 1.0
X-Received: by 10.194.63.229 with SMTP id j5mr2592658wjs.79.1370753284261; Sat, 08 Jun 2013 21:48:04 -0700 (PDT)
Received: by 10.194.83.35 with HTTP; Sat, 8 Jun 2013 21:48:04 -0700 (PDT)
In-Reply-To: <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org>
References: <AF793CAF-B30B-44A7-B864-82CEF79EA34D@vpnc.org> <CAChr6SwLDCUk0DC9pGTKqUu_V5vJHvs7Sgv4EneTJMryn1iKSA@mail.gmail.com> <D27EA9DC-9EFE-419B-BC34-3BF3FC8F5260@vpnc.org> <EF244D9B-29E2-40E4-99FF-810A28091106@tzi.org> <CAChr6Sxwhdn8CshU92y6fcoovzzhcayg3MECP7Hg=UXX390z=w@mail.gmail.com> <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org>
Date: Sat, 8 Jun 2013 21:48:04 -0700
Message-ID: <CAChr6SzTHkbfXgUxYWLijyoYz0ug2TMjoVzFgDEF+Mz+idZ1Yg@mail.gmail.com>
From: R S <sayrer@gmail.com>
To: Carsten Bormann <cabo@tzi.org>
Content-Type: multipart/alternative; boundary=047d7ba9751895783d04deb15e02
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Jun 2013 04:48:09 -0000

On Sat, Jun 8, 2013 at 7:29 PM, Carsten Bormann <cabo@tzi.org> wrote:
>
>
> Changing the JSON spec retroactively to put in a requirement for handling
> strings in UTF-16 code units so that unpaired surrogates might work more
> uniformly is something different.
>


I haven't proposed a change to the spec--have you? I'm fine with the status
quo: vaguely referring to Unicode characters with the full knowledge that
JSON is intended to produce identical results to JavaScript's eval function
for the subset of JavaScript syntax that JSON supports.



>
> (BTW, your examples show that two JSON implementations handle Unicode
> non-characters nicely, which is great and probably something to be
> recommended, but doesn't have anything to do with switching to UTF-16 code
> units.  Now let's put in a couple of (paired!) surrogates to show how well
> the code units work:
>
> >>> print(json.loads('{"a": "\ud800\udd51" }')["a"])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
> position 0: surrogates not allowed
> >>> print(json.loads('{"a": "\\ud800\\udd51" }')["a"])
> 𐅑
>
> ... which demonstrates nicely what I have been saying: Don't put unescaped
> surrogates into your JSON texts, because there is no equivalence at the
> UTF-16 code unit level.)
>


That must be Python3. That shell session is a little misleading, because
it's the print function throwing the exception. The Python3 JSON parser
accepts both, and the Python3 JSON encoder produces identical output from
those inputs. Although they do in Python 2.7, the two inputs don't compare
as equal in Python 3. However, both JSON messages seem unambiguous to
Python's JSON encoder.

~$ python
Python 2.7.3 (default, Sep 26 2012, 21:53:58)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.loads('{"a": "\\ud800\\udd51"}')["a"] ==  json.loads('{"a":
"\ud800\udd51"}')["a"]
True
>>>

~$ python3
Python 3.2.3 (default, Sep 30 2012, 16:43:30)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> json.loads('{"a": "\\ud800\\udd51"}')["a"] ==  json.loads('{"a":
"\ud800\udd51"}')["a"]
False
>>> json.dumps(json.loads('{"a": "\ud800\udd51"}')["a"])
'"\\ud800\\udd51"'
>>> json.dumps(json.loads('{"a": "\\ud800\\udd51"}')["a"])
'"\\ud800\\udd51"'

Here's Node.js / V8 for comparison:

~$ node --version
v0.8.9
~$ node
> JSON.parse('{"a": "\ud800\udd51"}')["a"] ==  JSON.parse('{"a":
"\\ud800\\udd51"}')["a"]
true
> JSON.stringify(JSON.parse('{"a": "\ud800\udd51"}')["a"])
'"𐅑"'
> JSON.stringify(JSON.parse('{"a": "\\ud800\\udd51"}')["a"])
'"𐅑"'



>
> There is only a single place where the UTF-16 legacy of JavaScript shines
> through in JSON today:
>
>    To escape an extended character that is not in the Basic Multilingual
>    Plane, the character is represented as a twelve-character sequence,
>    encoding the UTF-16 surrogate pair.  So, for example, a string
>    containing only the G clef character (U+1D11E) may be represented as
>    "\uD834\uDD1E".
>
> And that is OK because it is just a slightly weird representation of the
> character.
>
> > > (It is not aligned with JSON's main purpose.)
> >
> > I am not sure what the rationale for that statement is.
>
> The first sentence of RFC 4627:
>


I don't find that convincing at all. It is obvious that strings in JSON are
meant to encode all JavaScript strings, as if they were passed to
JavaScript's eval function (there is one unrelated bug here). RFC 4627 even
refers to JSON as a JavaScript subset, and references the eval function in
the security considerations.

If we must "improve" the current text, I have a suggested addition which
borrows from your emails. I'm not sure where to add it, because it doesn't
fit well with the current structure of the document.

"At their most basic level, JSON strings represent a vector of
unconstrained 16-bit values which largely map to UCS-2. Implementations MAY
apply more stringent Unicode validation."

- Rob