Re: [Json] Proposal for strings/Unicode text

John Cowan <> Thu, 13 June 2013 12:16 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 0E02021F9921 for <>; Thu, 13 Jun 2013 05:16:27 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -3.43
X-Spam-Status: No, score=-3.43 tagged_above=-999 required=5 tests=[AWL=0.169, BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id 50FssTEyQvJZ for <>; Thu, 13 Jun 2013 05:16:22 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id AC39F21F8B51 for <>; Thu, 13 Jun 2013 05:16:21 -0700 (PDT)
Received: from cowan by with local (Exim 4.72) (envelope-from <>) id 1Un6Rw-0000K3-87; Thu, 13 Jun 2013 08:16:20 -0400
Date: Thu, 13 Jun 2013 08:16:20 -0400
From: John Cowan <>
To: Bjoern Hoehrmann <>
Message-ID: <>
References: <> <> <> <>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <>
Cc: "" <>
Subject: Re: [Json] Proposal for strings/Unicode text
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 13 Jun 2013 12:16:27 -0000

Bjoern Hoehrmann scripsit:

> >Officially, yes.  But surrogate code points cannot be inserted directly
> >if the representation is UTF-8 (otherwise it becomes CESU-8 instead)
> >or UTF-16 (otherwise it is broken UTF-16) or random non-Unicode encodings.
> >So UTF-32 is the only encoding into which a surrogate code point can be
> >inserted directly -- and nobody uses it.
>     • Because surrogate code points are not included in the set of 
>       Unicode scalar values, UTF-32 code units in the range
>       0000D800_16 .. 0000DFFF_16 are ill-formed.

Well, sure.  Note the fine distinction between "can" (in physical
fact) and "may" (in the RFC 2119 sense).  It's invalid to have unpaired
surrogates in *any* context.  But at least it is safe and possible to do
so in UTF-32.  In UTF-8, there is no representation at all, and in UTF-16
you can't tell the difference between two consecutive unpaired surrogates
of opposite polarities and a surrogate pair.  (Though come to think of
it, escaping doesn't allow two consecutive unpaired surrogates either,
so maybe we can fairly say that either UTF-16 or UTF-32 allow them.)

The point is that if JSON is encoded in UTF-8, any surrogate code points
MUST be escaped, even though the grammar does not say so.

John Cowan  
Uneasy lies the head that wears the Editor's hat! --Eddie Foirbeis Climo