Re: [Json] Proposal for strings/Unicode text

John Cowan <cowan@mercury.ccil.org> Thu, 13 June 2013 12:16 UTC

Return-Path: <cowan@ccil.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0E02021F9921 for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 05:16:27 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.43
X-Spam-Level:
X-Spam-Status: No, score=-3.43 tagged_above=-999 required=5 tests=[AWL=0.169, BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 50FssTEyQvJZ for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 05:16:22 -0700 (PDT)
Received: from earth.ccil.org (earth.ccil.org [192.190.237.11]) by ietfa.amsl.com (Postfix) with ESMTP id AC39F21F8B51 for <json@ietf.org>; Thu, 13 Jun 2013 05:16:21 -0700 (PDT)
Received: from cowan by earth.ccil.org with local (Exim 4.72) (envelope-from <cowan@ccil.org>) id 1Un6Rw-0000K3-87; Thu, 13 Jun 2013 08:16:20 -0400
Date: Thu, 13 Jun 2013 08:16:20 -0400
From: John Cowan <cowan@mercury.ccil.org>
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Message-ID: <20130613121620.GB11739@mercury.ccil.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com> <20130613003213.GA26989@mercury.ccil.org> <jr5jr85h6pig2cr9id5hf1eh586g0u09i7@hive.bjoern.hoehrmann.de>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <jr5jr85h6pig2cr9id5hf1eh586g0u09i7@hive.bjoern.hoehrmann.de>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <cowan@ccil.org>
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Jun 2013 12:16:27 -0000

Bjoern Hoehrmann scripsit:

> >Officially, yes.  But surrogate code points cannot be inserted directly
> >if the representation is UTF-8 (otherwise it becomes CESU-8 instead)
> >or UTF-16 (otherwise it is broken UTF-16) or random non-Unicode encodings.
> >So UTF-32 is the only encoding into which a surrogate code point can be
> >inserted directly -- and nobody uses it.
> 
>     • Because surrogate code points are not included in the set of 
>       Unicode scalar values, UTF-32 code units in the range
>       0000D800_16 .. 0000DFFF_16 are ill-formed.

Well, sure.  Note the fine distinction between "can" (in physical
fact) and "may" (in the RFC 2119 sense).  It's invalid to have unpaired
surrogates in *any* context.  But at least it is safe and possible to do
so in UTF-32.  In UTF-8, there is no representation at all, and in UTF-16
you can't tell the difference between two consecutive unpaired surrogates
of opposite polarities and a surrogate pair.  (Though come to think of
it, escaping doesn't allow two consecutive unpaired surrogates either,
so maybe we can fairly say that either UTF-16 or UTF-32 allow them.)

The point is that if JSON is encoded in UTF-8, any surrogate code points
MUST be escaped, even though the grammar does not say so.

-- 
John Cowan            http://www.ccil.org/~cowan     cowan@ccil.org
Uneasy lies the head that wears the Editor's hat! --Eddie Foirbeis Climo