Re: [Json] A possible summary of the discussion so far on code points and characters

Tim Bray <tbray@textuality.com> Sun, 09 June 2013 07:08 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 270FE21F8FB6 for <json@ietfa.amsl.com>; Sun, 9 Jun 2013 00:08:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.385
X-Spam-Level:
X-Spam-Status: No, score=-1.385 tagged_above=-999 required=5 tests=[AWL=1.591, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TzcaMHL5EDBU for <json@ietfa.amsl.com>; Sun, 9 Jun 2013 00:08:27 -0700 (PDT)
Received: from mail-vc0-f176.google.com (mail-vc0-f176.google.com [209.85.220.176]) by ietfa.amsl.com (Postfix) with ESMTP id DB9CF21F9642 for <json@ietf.org>; Sun, 9 Jun 2013 00:08:21 -0700 (PDT)
Received: by mail-vc0-f176.google.com with SMTP id ha12so225326vcb.35 for <json@ietf.org>; Sun, 09 Jun 2013 00:08:20 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:date :message-id:subject:from:to:cc:content-type:x-gm-message-state; bh=s944TA9E4tabq1bvqBr3DPEeUQTgloQNwHS8vNoeBgQ=; b=h+HLy5wc7134REcznndIgyt4dAyHp/FZMBZDKV6MT+iTko0RG3cueyx2V21NqfxL8a tDND5x0LCfEv++dr59r/QFd+0gqkhvMdEHJ0ZkmqdJBrjunnKmoOQqDzFFk5T87o87EP EvcPrVlTCfZBxL7tZtTdXCPuhe+WRQARuS8ECxINniQhL+zcptiRH8IG0e1AHej5n5xc J9WgVS+YmW2sfvDT2jCTj/rRPZUVHp0XsLtC52vG3E0W50Kcv3aTqPaKySHB0bu6wStu /O46xbPcAVegkktWK5IEJAPk/rj1qy2HZwpEkdHbVazIMbebSWCcSrS8vxg3ftup2IpM CqAg==
MIME-Version: 1.0
X-Received: by 10.220.193.202 with SMTP id dv10mr2842579vcb.24.1370761700146; Sun, 09 Jun 2013 00:08:20 -0700 (PDT)
Received: by 10.220.48.14 with HTTP; Sun, 9 Jun 2013 00:08:20 -0700 (PDT)
X-Originating-IP: [24.84.235.32]
In-Reply-To: <CAChr6SzTHkbfXgUxYWLijyoYz0ug2TMjoVzFgDEF+Mz+idZ1Yg@mail.gmail.com>
References: <AF793CAF-B30B-44A7-B864-82CEF79EA34D@vpnc.org> <CAChr6SwLDCUk0DC9pGTKqUu_V5vJHvs7Sgv4EneTJMryn1iKSA@mail.gmail.com> <D27EA9DC-9EFE-419B-BC34-3BF3FC8F5260@vpnc.org> <EF244D9B-29E2-40E4-99FF-810A28091106@tzi.org> <CAChr6Sxwhdn8CshU92y6fcoovzzhcayg3MECP7Hg=UXX390z=w@mail.gmail.com> <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org> <CAChr6SzTHkbfXgUxYWLijyoYz0ug2TMjoVzFgDEF+Mz+idZ1Yg@mail.gmail.com>
Date: Sun, 9 Jun 2013 00:08:20 -0700
Message-ID: <CAHBU6it6Zf3gFkUjcBq+xJxPj=SBopD=RyA=B5r=243hYnQRWA@mail.gmail.com>
From: Tim Bray <tbray@textuality.com>
To: R S <sayrer@gmail.com>
Content-Type: multipart/alternative; boundary=047d7b673dba35949d04deb3546f
X-Gm-Message-State: ALoCoQmcj4cy8UhgXkW34vKkznulgPeIKoKhf5I/wlHFo8UCox95v0JYEoKzjrDgtb4WKr9kRs4V
Cc: Carsten Bormann <cabo@tzi.org>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Jun 2013 07:08:32 -0000

It seems clear that the intent of JSON, judging by the language in 4627,
and the observed usage in a zillion RESTful protocols currently in
production, is that JSON strings be used to interchange Unicode character
sequences.

It seems clear that (at least partly as a side-effect of the JavaScript
“character” model) there is no normative requirement to avoid Unicode abuse
such as the use of non-character codepoints and naked surrogates, which
will predictably lead to consequences such as Carsten’s exploding-python
example.

So maybe just leave the spec more or less the way it is. Say in the
introduction that strings are for interchanging Unicode characters, observe
in the fine print that the specification does not forbid the use of things
that cannot be useful in the Unicode context and will quite likely cause
software breakage. And in the best-practices doc, say “Encode only Unicode
codepoints, and use only UTF-8 to do it.”

Also, I think that any significant change to the sentence in the
Introduction: “A string is a sequence of zero or more Unicode characters
[UNICODE].” would represent a dramatic change in the specification of JSON,
and thus be out of scope for this WG.

-T


On Sat, Jun 8, 2013 at 9:48 PM, R S <sayrer@gmail.com> wrote:

>
>
>
> On Sat, Jun 8, 2013 at 7:29 PM, Carsten Bormann <cabo@tzi.org> wrote:
>>
>>
>> Changing the JSON spec retroactively to put in a requirement for handling
>> strings in UTF-16 code units so that unpaired surrogates might work more
>> uniformly is something different.
>>
>
>
> I haven't proposed a change to the spec--have you? I'm fine with the
> status quo: vaguely referring to Unicode characters with the full knowledge
> that JSON is intended to produce identical results to JavaScript's eval
> function for the subset of JavaScript syntax that JSON supports.
>
>
>
>>
>> (BTW, your examples show that two JSON implementations handle Unicode
>> non-characters nicely, which is great and probably something to be
>> recommended, but doesn't have anything to do with switching to UTF-16 code
>> units.  Now let's put in a couple of (paired!) surrogates to show how well
>> the code units work:
>>
>> >>> print(json.loads('{"a": "\ud800\udd51" }')["a"])
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
>> position 0: surrogates not allowed
>> >>> print(json.loads('{"a": "\\ud800\\udd51" }')["a"])
>> 𐅑
>>
>> ... which demonstrates nicely what I have been saying: Don't put
>> unescaped surrogates into your JSON texts, because there is no equivalence
>> at the UTF-16 code unit level.)
>>
>
>
> That must be Python3. That shell session is a little misleading, because
> it's the print function throwing the exception. The Python3 JSON parser
> accepts both, and the Python3 JSON encoder produces identical output from
> those inputs. Although they do in Python 2.7, the two inputs don't compare
> as equal in Python 3. However, both JSON messages seem unambiguous to
> Python's JSON encoder.
>
> ~$ python
> Python 2.7.3 (default, Sep 26 2012, 21:53:58)
> [GCC 4.7.2] on linux2
>
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import json
> >>> json.loads('{"a": "\\ud800\\udd51"}')["a"] ==  json.loads('{"a":
> "\ud800\udd51"}')["a"]
> True
> >>>
>
> ~$ python3
> Python 3.2.3 (default, Sep 30 2012, 16:43:30)
> [GCC 4.7.2] on linux2
>
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import json
> >>> json.loads('{"a": "\\ud800\\udd51"}')["a"] ==  json.loads('{"a":
> "\ud800\udd51"}')["a"]
> False
> >>> json.dumps(json.loads('{"a": "\ud800\udd51"}')["a"])
> '"\\ud800\\udd51"'
> >>> json.dumps(json.loads('{"a": "\\ud800\\udd51"}')["a"])
> '"\\ud800\\udd51"'
>
> Here's Node.js / V8 for comparison:
>
> ~$ node --version
> v0.8.9
> ~$ node
> > JSON.parse('{"a": "\ud800\udd51"}')["a"] ==  JSON.parse('{"a":
> "\\ud800\\udd51"}')["a"]
> true
> > JSON.stringify(JSON.parse('{"a": "\ud800\udd51"}')["a"])
> '"𐅑"'
> > JSON.stringify(JSON.parse('{"a": "\\ud800\\udd51"}')["a"])
> '"𐅑"'
>
>
>
>>
>> There is only a single place where the UTF-16 legacy of JavaScript shines
>> through in JSON today:
>>
>>    To escape an extended character that is not in the Basic Multilingual
>>    Plane, the character is represented as a twelve-character sequence,
>>    encoding the UTF-16 surrogate pair.  So, for example, a string
>>    containing only the G clef character (U+1D11E) may be represented as
>>    "\uD834\uDD1E".
>>
>> And that is OK because it is just a slightly weird representation of the
>> character.
>>
>> > > (It is not aligned with JSON's main purpose.)
>> >
>> > I am not sure what the rationale for that statement is.
>>
>> The first sentence of RFC 4627:
>>
>
>
> I don't find that convincing at all. It is obvious that strings in JSON
> are meant to encode all JavaScript strings, as if they were passed to
> JavaScript's eval function (there is one unrelated bug here). RFC 4627 even
> refers to JSON as a JavaScript subset, and references the eval function in
> the security considerations.
>
> If we must "improve" the current text, I have a suggested addition which
> borrows from your emails. I'm not sure where to add it, because it doesn't
> fit well with the current structure of the document.
>
> "At their most basic level, JSON strings represent a vector of
> unconstrained 16-bit values which largely map to UCS-2. Implementations MAY
> apply more stringent Unicode validation."
>
> - Rob
>
>
>
> _______________________________________________
> json mailing list
> json@ietf.org
> https://www.ietf.org/mailman/listinfo/json
>
>