Re: [Json] Unpaired surrogates in JSON strings

Douglas Crockford <douglas@crockford.com> Wed, 05 June 2013 18:20 UTC

Return-Path: <douglas@crockford.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 952CA21F9B7D for <json@ietfa.amsl.com>; Wed, 5 Jun 2013 11:20:41 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.599
X-Spam-Level:
X-Spam-Status: No, score=-2.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZZIItcQdHiKO for <json@ietfa.amsl.com>; Wed, 5 Jun 2013 11:20:35 -0700 (PDT)
Received: from mout.perfora.net (mout.perfora.net [74.208.4.195]) by ietfa.amsl.com (Postfix) with ESMTP id 67C4A21F9B2F for <json@ietf.org>; Wed, 5 Jun 2013 11:20:35 -0700 (PDT)
Received: from [192.168.114.223] ([216.113.168.135]) by mrelay.perfora.net (node=mrus1) with ESMTP (Nemesis) id 0MfWk7-1V3Mb32Xn9-00OrvL; Wed, 05 Jun 2013 14:20:04 -0400
Message-ID: <51AF8149.5090907@crockford.com>
Date: Wed, 05 Jun 2013 11:19:53 -0700
From: Douglas Crockford <douglas@crockford.com>
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6
MIME-Version: 1.0
To: Paul Hoffman <paul.hoffman@vpnc.org>
References: <20130605162246.GG3680@mercury.ccil.org> <51AF7988.6040009@crockford.com> <61407E6F-4178-471A-931C-D98E6F0C9756@vpnc.org>
In-Reply-To: <61407E6F-4178-471A-931C-D98E6F0C9756@vpnc.org>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Provags-ID: V02:K0:DXY2xwjgGo5d9857EzFG+EdGBOHkxhaWfwSMiXHGCtT k1u/ny764HjsEc9Sq6aEDFCeYRytsGBskef0hCGBocIZBE1No2 IoZNVwcDEKrduZGk7GrzTvRTIYI18eiQldv5nrkENhWrFrvXvi MXJgSgsuBVdMzhzDgNEkAedZ1IDhXZKxjJrWBbmeSSSf4slylr zbD73VAyugYcWgxqL2Ss9hyK5YqhJ9hiN/vhyWN6UZTuiYZ1db lkajmmTpNR+9R09ymUd1D3MJBN5ASrpIHflOZxeF8NV2icxJJ2 jRhUjNOs9RWl1uhhypEosbCPErGXmEYdrshbp/Ja2cLezKGkGW XBOmp885mo9WrHm3xL4+l3VaCT+Eu5p+nNdokJYec
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Unpaired surrogates in JSON strings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Jun 2013 18:20:42 -0000

On 6/5/2013 11:01 AM, Paul Hoffman wrote:
> <definitely no hat>
>
> On Jun 5, 2013, at 10:46 AM, Douglas Crockford <douglas@crockford.com> wrote:
>
>> On 6/5/2013 9:22 AM, John Cowan wrote:
>>> RFC 4627 Section 1 says:
>>>
>>>      A string is a sequence of zero or more Unicode characters.
>>>
>>> However, the grammar of strings permits things like "foo\uDC00bar",
>>> which contains an escape sequence that does not correspond to any
>>> Unicode character.  This provides backward compatibility with JavaScript,
>>> where a string is not a sequence of characters but a sequence of UTF-16
>>> code units.
>>>
>>> If Section 1 is normative, then there is a contradiction with Section 4,
>>> which says:
>>>
>>>      A JSON parser MUST accept all texts that conform to the JSON
>>>      grammar.
>>>
>>> In my view, JSONbis processors should be REQUIRED to produce only strings
>>> that conform to Section 1.
>>>
>> Such a requirement will be breaking. Breaking changes are out of scope.
> How is that "breaking"? Section 1 has a definition of strings, and Section 4 says that the parser must accept all texts that conform to the grammar. Surrogate code points are not characters, according to the Unicode spec.
>
>> I like the suggestion that section 1 should be talking about code points instead of characters.
> That seems like a significant change that would cause parsers that currently follow Section 1 to fail.
>
It is not a change, it is a clarification. JavaScript, Java, and many 
other languages have strings of code points, owing to their being set 
before Unicode grew surrogate pairs. So strings in those languages are 
composed of code points. JSON is tolerant of that reality.