Re: [Json] Proposal for strings/Unicode text

John Cowan <cowan@mercury.ccil.org> Thu, 13 June 2013 00:32 UTC

Return-Path: <cowan@ccil.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9B71B21E8095 for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 17:32:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.425
X-Spam-Level:
X-Spam-Status: No, score=-3.425 tagged_above=-999 required=5 tests=[AWL=0.174, BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id k2J14PdA7NDT for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 17:32:16 -0700 (PDT)
Received: from earth.ccil.org (earth.ccil.org [192.190.237.11]) by ietfa.amsl.com (Postfix) with ESMTP id 45C0621E808E for <json@ietf.org>; Wed, 12 Jun 2013 17:32:16 -0700 (PDT)
Received: from cowan by earth.ccil.org with local (Exim 4.72) (envelope-from <cowan@ccil.org>) id 1UmvSX-0007Xv-IY; Wed, 12 Jun 2013 20:32:13 -0400
Date: Wed, 12 Jun 2013 20:32:13 -0400
From: John Cowan <cowan@mercury.ccil.org>
To: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Message-ID: <20130613003213.GA26989@mercury.ccil.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <cowan@ccil.org>
Cc: Tim Bray <tbray@textuality.com>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Jun 2013 00:32:20 -0000

Norbert Lindenberg scripsit:

> JSON syntax allows almost all Unicode code points (with the exceptions
> visible in the "unescaped" production) to be inserted into strings
> directly, so escape sequences are mostly a convenience for using JSON
> in environments that restrict the set of allowed characters (such as
> RFCs), don't have the necessary fonts and input methods installed,
> or benefit from making the code points visible (e.g., in test cases).

Officially, yes.  But surrogate code points cannot be inserted directly
if the representation is UTF-8 (otherwise it becomes CESU-8 instead)
or UTF-16 (otherwise it is broken UTF-16) or random non-Unicode encodings.
So UTF-32 is the only encoding into which a surrogate code point can be
inserted directly -- and nobody uses it.

In practice, therefore, code points in the range D800-DFFF must be escaped
if they are to be used.

-- 
Mark Twain on Cecil Rhodes:                    John Cowan
I admire him, I freely admit it,               http://www.ccil.org/~cowan
and when his time comes I shall                cowan@ccil.org
buy a piece of the rope for a keepsake.