Re: [Json] A possible summary of the discussion so far on code points and characters

Carsten Bormann <cabo@tzi.org> Sun, 09 June 2013 02:30 UTC

Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Content-Type: text/plain; charset="utf-8"
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAChr6Sxwhdn8CshU92y6fcoovzzhcayg3MECP7Hg=UXX390z=w@mail.gmail.com>
Date: Sun, 09 Jun 2013 04:29:58 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org>
References: <AF793CAF-B30B-44A7-B864-82CEF79EA34D@vpnc.org> <CAChr6SwLDCUk0DC9pGTKqUu_V5vJHvs7Sgv4EneTJMryn1iKSA@mail.gmail.com> <D27EA9DC-9EFE-419B-BC34-3BF3FC8F5260@vpnc.org> <EF244D9B-29E2-40E4-99FF-810A28091106@tzi.org> <CAChr6Sxwhdn8CshU92y6fcoovzzhcayg3MECP7Hg=UXX390z=w@mail.gmail.com>
To: R S <sayrer@gmail.com>
Cc: json@ietf.org
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
Precedence: list

On Jun 9, 2013, at 03:24, R S <sayrer@gmail.com> wrote:

> We should document what currently works.

A survey of which implementations react to what input in which way would be nice, indeed, but is not the purpose of this spec update.

Changing the JSON spec retroactively to put in a requirement for handling strings in UTF-16 code units so that unpaired surrogates might work more uniformly is something different.

(BTW, your examples show that two JSON implementations handle Unicode non-characters nicely, which is great and probably something to be recommended, but doesn't have anything to do with switching to UTF-16 code units.  Now let's put in a couple of (paired!) surrogates to show how well the code units work:

>>> print(json.loads('{"a": "\ud800\udd51" }')["a"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
>>> print(json.loads('{"a": "\\ud800\\udd51" }')["a"])
𐅑

... which demonstrates nicely what I have been saying: Don't put unescaped surrogates into your JSON texts, because there is no equivalence at the UTF-16 code unit level.)

There is only a single place where the UTF-16 legacy of JavaScript shines through in JSON today:

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

And that is OK because it is just a slightly weird representation of the character.

> > (It is not aligned with JSON's main purpose.)
> 
> I am not sure what the rationale for that statement is.

The first sentence of RFC 4627:

Abstract

   JavaScript Object Notation (JSON) is a lightweight, text-based,
   language-independent data interchange format.

Grüße, Carsten

[Json] A possible summary of the discussion so fa… Paul Hoffman
Re: [Json] A possible summary of the discussion s… R S
Re: [Json] A possible summary of the discussion s… Paul Hoffman
Re: [Json] A possible summary of the discussion s… Stephen Dolan
Re: [Json] A possible summary of the discussion s… R S
Re: [Json] A possible summary of the discussion s… Carsten Bormann
Re: [Json] A possible summary of the discussion s… R S
Re: [Json] A possible summary of the discussion s… Carsten Bormann
Re: [Json] A possible summary of the discussion s… R S
Re: [Json] A possible summary of the discussion s… Tim Bray
Re: [Json] A possible summary of the discussion s… Stephen Dolan
Re: [Json] A possible summary of the discussion s… Norbert Lindenberg