Re: [Json] On characters and code points

John Cowan <cowan@mercury.ccil.org> Fri, 07 June 2013 17:20 UTC

Return-Path: <cowan@ccil.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DBAB521F91BC for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 10:20:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.978
X-Spam-Level:
X-Spam-Status: No, score=-2.978 tagged_above=-999 required=5 tests=[AWL=0.021, BAYES_00=-2.599, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Oo09jLyBgtBV for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 10:20:01 -0700 (PDT)
Received: from earth.ccil.org (earth.ccil.org [192.190.237.11]) by ietfa.amsl.com (Postfix) with ESMTP id 0817421F8EC3 for <json@ietf.org>; Fri, 7 Jun 2013 10:20:00 -0700 (PDT)
Received: from cowan by earth.ccil.org with local (Exim 4.72) (envelope-from <cowan@ccil.org>) id 1Ul0KN-0007uR-42; Fri, 07 Jun 2013 13:19:51 -0400
Date: Fri, 07 Jun 2013 13:19:51 -0400
From: John Cowan <cowan@mercury.ccil.org>
To: Tim Bray <tbray@textuality.com>
Message-ID: <20130607171950.GD13569@mercury.ccil.org>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2E7E1@xmb-rcd-x10.cisco.com> <51B06F38.8050707@crockford.com> <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com> <51B1B4E7.8090101@it.aoyama.ac.jp> <9ld3r8pc0tufif18dohb2fmi0ijna1vs4n@hive.bjoern.hoehrmann.de> <56A163E9-E7CD-46B3-9984-8F009EBFF500@vpnc.org> <CAHBU6ivG=ONc8roT7W=LdpKYNMqRH_d5BobZ=pHnk=mVaKZKaA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <CAHBU6ivG=ONc8roT7W=LdpKYNMqRH_d5BobZ=pHnk=mVaKZKaA@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <cowan@ccil.org>
Cc: Paul Hoffman <paul.hoffman@vpnc.org>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] On characters and code points
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Jun 2013 17:20:05 -0000

Tim Bray scripsit:

> >    { "End of data marker": "\uFFFF" }
> >
> 
> Yes, I *really* want to prohibit that. The one corner case it buys you is
> outweighed by a factor of a thousand or so in not being able to use
> general-purpose string processing software to deal with JSON payloads.

Most general-purpose string processing software is perfectly happy
with U+FFFF.  There are three different kinds of code points here,
and it doesn't help to conflate them:

1) Surrogate code points.  These will never be assigned to any characters,
and reserved for use as UTF-16 code units.  There are exactly 2048 of
these, from U+DC00 to U+DFFF.

2) Non-character code points.  These will never be assigned to any
characters, and are not meant to be interchanged, but internal software
is expected to handle them.  There are exactly 66 of these, and U+FFFF
is one.  See <http://www.unicode.org/faq/private_use.html#noncharacters>
for more about this group.

3) Unassigned code points.  These are not assigned to any characters
today, but may be assigned in future.  They may be interchanged.
Internal libraries should process them.

My view is that group 1 are and should be disallowed in JSON; others
disagree.  Group 2 should be avoided by JSON creators, but accepted by
JSON parsers, which may choose to change them to U+FFFD (replacement
character).  Group 3 are and should be valid in JSON.

-- 
Values of beeta will give rise to dom!          John Cowan
(5th/6th edition 'mv' said this if you tried    http://www.ccil.org/~cowan
to rename '.' or '..' entries; see              cowan@ccil.org
http://cm.bell-labs.com/cm/cs/who/dmr/odd.html)