Re: [Json] A possible summary of the discussion so far on code points and characters
R S <sayrer@gmail.com> Sun, 09 June 2013 04:48 UTC
Return-Path: <sayrer@gmail.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C483021F92B8 for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 21:48:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.383
X-Spam-Level:
X-Spam-Status: No, score=-2.383 tagged_above=-999 required=5 tests=[AWL=0.216, BAYES_00=-2.599, HTML_MESSAGE=0.001, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SYGsfIpwX17W for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 21:48:06 -0700 (PDT)
Received: from mail-wg0-x235.google.com (mail-wg0-x235.google.com [IPv6:2a00:1450:400c:c00::235]) by ietfa.amsl.com (Postfix) with ESMTP id 2A98721F91B1 for <json@ietf.org>; Sat, 8 Jun 2013 21:48:05 -0700 (PDT)
Received: by mail-wg0-f53.google.com with SMTP id y10so1616565wgg.20 for <json@ietf.org>; Sat, 08 Jun 2013 21:48:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=jFSwIaNubTzPLpdIl8MuHMNPUlYRXAsOF0yjF1PoGzI=; b=EGkG9rOyTr4pr4AJsQt3X/8XEIWkD76ltezX0I81OrmK+mKXYx06eVHl2sVSF7p1Jl RiQPd1OBhIng+/rADoL6Jed8KCAEtuiTAbTfdmHu8jaZzQg7nLISecmJoF14F8KO7e45 zMOQ6LDHZZyXEhb9XXi21Umv90Qv0gqEZ9fcvGQQhlHkj3VtL9cjWJhTX5pY1qz9Ink5 FE1ys6LV7SWCLKQoy+K4vcMZiHXX0tIy53ve5PECyeY+rMP/+xOyj/j18SNbMsPgBajY Sa5wqOfO8oQzTWMyRwN99NcLLalWC74TVFKAVBu8zaKatj2SNPGKtvdDTRTOry3cg2Qj Y49w==
MIME-Version: 1.0
X-Received: by 10.194.63.229 with SMTP id j5mr2592658wjs.79.1370753284261; Sat, 08 Jun 2013 21:48:04 -0700 (PDT)
Received: by 10.194.83.35 with HTTP; Sat, 8 Jun 2013 21:48:04 -0700 (PDT)
In-Reply-To: <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org>
References: <AF793CAF-B30B-44A7-B864-82CEF79EA34D@vpnc.org> <CAChr6SwLDCUk0DC9pGTKqUu_V5vJHvs7Sgv4EneTJMryn1iKSA@mail.gmail.com> <D27EA9DC-9EFE-419B-BC34-3BF3FC8F5260@vpnc.org> <EF244D9B-29E2-40E4-99FF-810A28091106@tzi.org> <CAChr6Sxwhdn8CshU92y6fcoovzzhcayg3MECP7Hg=UXX390z=w@mail.gmail.com> <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org>
Date: Sat, 08 Jun 2013 21:48:04 -0700
Message-ID: <CAChr6SzTHkbfXgUxYWLijyoYz0ug2TMjoVzFgDEF+Mz+idZ1Yg@mail.gmail.com>
From: R S <sayrer@gmail.com>
To: Carsten Bormann <cabo@tzi.org>
Content-Type: multipart/alternative; boundary="047d7ba9751895783d04deb15e02"
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Jun 2013 04:48:09 -0000
On Sat, Jun 8, 2013 at 7:29 PM, Carsten Bormann <cabo@tzi.org> wrote: > > > Changing the JSON spec retroactively to put in a requirement for handling > strings in UTF-16 code units so that unpaired surrogates might work more > uniformly is something different. > I haven't proposed a change to the spec--have you? I'm fine with the status quo: vaguely referring to Unicode characters with the full knowledge that JSON is intended to produce identical results to JavaScript's eval function for the subset of JavaScript syntax that JSON supports. > > (BTW, your examples show that two JSON implementations handle Unicode > non-characters nicely, which is great and probably something to be > recommended, but doesn't have anything to do with switching to UTF-16 code > units. Now let's put in a couple of (paired!) surrogates to show how well > the code units work: > > >>> print(json.loads('{"a": "\ud800\udd51" }')["a"]) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in > position 0: surrogates not allowed > >>> print(json.loads('{"a": "\\ud800\\udd51" }')["a"]) > 𐅑 > > ... which demonstrates nicely what I have been saying: Don't put unescaped > surrogates into your JSON texts, because there is no equivalence at the > UTF-16 code unit level.) > That must be Python3. That shell session is a little misleading, because it's the print function throwing the exception. The Python3 JSON parser accepts both, and the Python3 JSON encoder produces identical output from those inputs. Although they do in Python 2.7, the two inputs don't compare as equal in Python 3. However, both JSON messages seem unambiguous to Python's JSON encoder. ~$ python Python 2.7.3 (default, Sep 26 2012, 21:53:58) [GCC 4.7.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import json >>> json.loads('{"a": "\\ud800\\udd51"}')["a"] == json.loads('{"a": "\ud800\udd51"}')["a"] True >>> ~$ python3 Python 3.2.3 (default, Sep 30 2012, 16:43:30) [GCC 4.7.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import json >>> json.loads('{"a": "\\ud800\\udd51"}')["a"] == json.loads('{"a": "\ud800\udd51"}')["a"] False >>> json.dumps(json.loads('{"a": "\ud800\udd51"}')["a"]) '"\\ud800\\udd51"' >>> json.dumps(json.loads('{"a": "\\ud800\\udd51"}')["a"]) '"\\ud800\\udd51"' Here's Node.js / V8 for comparison: ~$ node --version v0.8.9 ~$ node > JSON.parse('{"a": "\ud800\udd51"}')["a"] == JSON.parse('{"a": "\\ud800\\udd51"}')["a"] true > JSON.stringify(JSON.parse('{"a": "\ud800\udd51"}')["a"]) '"𐅑"' > JSON.stringify(JSON.parse('{"a": "\\ud800\\udd51"}')["a"]) '"𐅑"' > > There is only a single place where the UTF-16 legacy of JavaScript shines > through in JSON today: > > To escape an extended character that is not in the Basic Multilingual > Plane, the character is represented as a twelve-character sequence, > encoding the UTF-16 surrogate pair. So, for example, a string > containing only the G clef character (U+1D11E) may be represented as > "\uD834\uDD1E". > > And that is OK because it is just a slightly weird representation of the > character. > > > > (It is not aligned with JSON's main purpose.) > > > > I am not sure what the rationale for that statement is. > > The first sentence of RFC 4627: > I don't find that convincing at all. It is obvious that strings in JSON are meant to encode all JavaScript strings, as if they were passed to JavaScript's eval function (there is one unrelated bug here). RFC 4627 even refers to JSON as a JavaScript subset, and references the eval function in the security considerations. If we must "improve" the current text, I have a suggested addition which borrows from your emails. I'm not sure where to add it, because it doesn't fit well with the current structure of the document. "At their most basic level, JSON strings represent a vector of unconstrained 16-bit values which largely map to UCS-2. Implementations MAY apply more stringent Unicode validation." - Rob
- [Json] A possible summary of the discussion so fa… Paul Hoffman
- Re: [Json] A possible summary of the discussion s… R S
- Re: [Json] A possible summary of the discussion s… Paul Hoffman
- Re: [Json] A possible summary of the discussion s… Stephen Dolan
- Re: [Json] A possible summary of the discussion s… R S
- Re: [Json] A possible summary of the discussion s… Carsten Bormann
- Re: [Json] A possible summary of the discussion s… R S
- Re: [Json] A possible summary of the discussion s… Carsten Bormann
- Re: [Json] A possible summary of the discussion s… R S
- Re: [Json] A possible summary of the discussion s… Tim Bray
- Re: [Json] A possible summary of the discussion s… Stephen Dolan
- Re: [Json] A possible summary of the discussion s… Norbert Lindenberg