Re: [Json] A possible summary of the discussion so far on code points and characters

Norbert Lindenberg <> Wed, 12 June 2013 16:51 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 903BA21F9AD1 for <>; Wed, 12 Jun 2013 09:51:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -3.299
X-Spam-Status: No, score=-3.299 tagged_above=-999 required=5 tests=[AWL=0.300, BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id OLusOmZ41O3x for <>; Wed, 12 Jun 2013 09:51:35 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 46C4821F9ACC for <>; Wed, 12 Jun 2013 09:51:34 -0700 (PDT)
Received: from ([]:55250 helo=[]) by with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80) (envelope-from <>) id 1UmoGb-003nTq-Vx; Wed, 12 Jun 2013 09:51:26 -0700
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset=windows-1252
From: Norbert Lindenberg <>
In-Reply-To: <>
Date: Wed, 12 Jun 2013 09:51:22 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <> <> <> <> <> <>
To: Tim Bray <>, R S <>
X-Mailer: Apple Mail (2.1283)
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname -
X-AntiAbuse: Original Domain -
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain -
X-Get-Message-Sender-Via: authenticated_id:
Cc: Norbert Lindenberg <>, Carsten Bormann <>,
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 12 Jun 2013 16:51:48 -0000

On Jun 8, 2013, at 21:48 , R S wrote:

> If we must "improve" the current text, I have a suggested addition which borrows from your emails. I'm not sure where to add it, because it doesn't fit well with the current structure of the document.
> "At their most basic level, JSON strings represent a vector of unconstrained 16-bit values which largely map to UCS-2. Implementations MAY apply more stringent Unicode validation."

JSON is in no way constrained to the Basic Multilingual Plane, so if we discuss 16-bit values in JSON, they're UTF-16 code units, not UCS-2.

On Jun 9, 2013, at 0:08 , Tim Bray wrote:

> It seems clear that the intent of JSON, judging by the language in 4627, and the observed usage in a zillion RESTful protocols currently in production, is that JSON strings be used to interchange Unicode character sequences.

I agree.

> It seems clear that (at least partly as a side-effect of the JavaScript “character” model) there is no normative requirement to avoid Unicode abuse such as the use of non-character codepoints and naked surrogates, which will predictably lead to consequences such as Carsten’s exploding-python example.
> So maybe just leave the spec more or less the way it is. Say in the introduction that strings are for interchanging Unicode characters, observe in the fine print that the specification does not forbid the use of things that cannot be useful in the Unicode context and will quite likely cause software breakage.

It might be better to say "Unicode code points" rather than "Unicode characters":

- This makes the spec independent of Unicode versions - the set of Unicode code points is fixed (U+0000 to U+10FFFF), while the set of assigned characters in Unicode keeps growing, and different systems communicating via JSON may not be based on the same Unicode version.

- It makes clear that noncharacters, unassigned code points, and surrogate code points are all allowed in JSON (although subject to the limitations imposed by parsers, communication channels, security systems, or the character encoding used).

> And in the best-practices doc, say “Encode only Unicode codepoints, and use only UTF-8 to do it.”

UTF-8 is the right encoding to use over the wire or in files, but at runtime many systems (including all that implement the ECMAScript or DOM specifications) have to use UTF-16 semantics.