Re: [Json] A possible summary of the discussion so far on code points and characters

Carsten Bormann <cabo@tzi.org> Sun, 09 June 2013 02:30 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0CE3621F973A for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 19:30:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -106.019
X-Spam-Level:
X-Spam-Status: No, score=-106.019 tagged_above=-999 required=5 tests=[AWL=0.230, BAYES_00=-2.599, HELO_EQ_DE=0.35, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rIdZt-D8ldK5 for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 19:30:02 -0700 (PDT)
Received: from informatik.uni-bremen.de (mailhost.informatik.uni-bremen.de [IPv6:2001:638:708:30c9::12]) by ietfa.amsl.com (Postfix) with ESMTP id DA91C21F972E for <json@ietf.org>; Sat, 8 Jun 2013 19:30:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at informatik.uni-bremen.de
Received: from smtp-fb3.informatik.uni-bremen.de (smtp-fb3.informatik.uni-bremen.de [134.102.224.120]) by informatik.uni-bremen.de (8.14.4/8.14.4) with ESMTP id r592Txd7006955; Sun, 9 Jun 2013 04:29:59 +0200 (CEST)
Received: from [192.168.217.105] (p54893DC9.dip0.t-ipconnect.de [84.137.61.201]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp-fb3.informatik.uni-bremen.de (Postfix) with ESMTPSA id 4AD65367F; Sun, 9 Jun 2013 04:29:59 +0200 (CEST)
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Content-Type: text/plain; charset="utf-8"
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAChr6Sxwhdn8CshU92y6fcoovzzhcayg3MECP7Hg=UXX390z=w@mail.gmail.com>
Date: Sun, 09 Jun 2013 04:29:58 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <8C87F4D2-CABE-4F26-A5B1-6BC9C759C7CD@tzi.org>
References: <AF793CAF-B30B-44A7-B864-82CEF79EA34D@vpnc.org> <CAChr6SwLDCUk0DC9pGTKqUu_V5vJHvs7Sgv4EneTJMryn1iKSA@mail.gmail.com> <D27EA9DC-9EFE-419B-BC34-3BF3FC8F5260@vpnc.org> <EF244D9B-29E2-40E4-99FF-810A28091106@tzi.org> <CAChr6Sxwhdn8CshU92y6fcoovzzhcayg3MECP7Hg=UXX390z=w@mail.gmail.com>
To: R S <sayrer@gmail.com>
X-Mailer: Apple Mail (2.1503)
Cc: json@ietf.org
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Jun 2013 02:30:08 -0000

On Jun 9, 2013, at 03:24, R S <sayrer@gmail.com> wrote:

> We should document what currently works.

A survey of which implementations react to what input in which way would be nice, indeed, but is not the purpose of this spec update.

Changing the JSON spec retroactively to put in a requirement for handling strings in UTF-16 code units so that unpaired surrogates might work more uniformly is something different.

(BTW, your examples show that two JSON implementations handle Unicode non-characters nicely, which is great and probably something to be recommended, but doesn't have anything to do with switching to UTF-16 code units.  Now let's put in a couple of (paired!) surrogates to show how well the code units work:

>>> print(json.loads('{"a": "\ud800\udd51" }')["a"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
>>> print(json.loads('{"a": "\\ud800\\udd51" }')["a"])
𐅑

... which demonstrates nicely what I have been saying: Don't put unescaped surrogates into your JSON texts, because there is no equivalence at the UTF-16 code unit level.)

There is only a single place where the UTF-16 legacy of JavaScript shines through in JSON today:

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

And that is OK because it is just a slightly weird representation of the character.

> > (It is not aligned with JSON's main purpose.)
> 
> I am not sure what the rationale for that statement is.

The first sentence of RFC 4627:

Abstract

   JavaScript Object Notation (JSON) is a lightweight, text-based,
   language-independent data interchange format.

Grüße, Carsten