Re: [Json] Unpaired surrogates in JSON strings

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Fri, 07 June 2013 10:25 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3CF4421F8F2F for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 03:25:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -103.457
X-Spam-Level:
X-Spam-Status: No, score=-103.457 tagged_above=-999 required=5 tests=[AWL=0.333, BAYES_00=-2.599, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HHNb8vh1S+KD for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 03:25:11 -0700 (PDT)
Received: from scintmta02.scbb.aoyama.ac.jp (scintmta02.scbb.aoyama.ac.jp [133.2.253.34]) by ietfa.amsl.com (Postfix) with ESMTP id 1043821F8E89 for <json@ietf.org>; Fri, 7 Jun 2013 03:25:10 -0700 (PDT)
Received: from scmse02.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta02.scbb.aoyama.ac.jp (secret/secret) with SMTP id r57AOw1X002780; Fri, 7 Jun 2013 19:24:58 +0900
Received: from (unknown [133.2.206.134]) by scmse02.scbb.aoyama.ac.jp with smtp id 3a18_1b26_81159aa0_cf5c_11e2_aaec_001e6722eec2; Fri, 07 Jun 2013 19:24:58 +0900
Received: from [IPv6:::1] (unknown [133.2.210.1]) by itmail2.it.aoyama.ac.jp (Postfix) with ESMTP id 65FA0BF521; Fri, 7 Jun 2013 19:23:50 +0900 (JST)
Message-ID: <51B1B4E7.8090101@it.aoyama.ac.jp>
Date: Fri, 07 Jun 2013 19:24:39 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: Tim Bray <tbray@textuality.com>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2E7E1@xmb-rcd-x10.cisco.com> <51B06F38.8050707@crockford.com> <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com>
In-Reply-To: <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: quoted-printable
Cc: "json@ietf.org" <json@ietf.org>, Paul Hoffman <paul.hoffman@vpnc.org>, Douglas Crockford <douglas@crockford.com>, "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>
Subject: Re: [Json] Unpaired surrogates in JSON strings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Jun 2013 10:25:17 -0000

On 2013/06/06 23:57, Tim Bray wrote:
> F0, 90, 8D, 86
> On Thu, Jun 6, 2013 at 4:15 AM, Douglas Crockford<douglas@crockford.com>wrote:
>
>> What  then is the standard name for a 16-bit element of text? When
>> JavaScript was created, that word was character. What is the word now?
>>
>
> The only somewhat-standardized term would be “UTF-16 codepoint”.  But
> that’s not really a “unit of text” any more than the 2nd byte of a
> character encoded in 3 bytes with UTF-8 is.
>
> I’m fairly shocked.  I have always believed that JSON encodes what its
> introduction (and section 2.5 "Strings") say it encodes, Unicode
> characters.
>
> If it is a requirement to accommodate the class of bug where languages that
> use UTF-16 (Java, JavaScript, C#) can emit unpaired UTF-16 surrogates, the
> spec needs to be clear that the INTENT is actually to support Unicode
> characters, and that unpaired surrogates are always evidence of a bug, and
> there can be no expectation that any software receiving such buggy data
> will be able to do anything useful with it, or even avoid crashing in a
> hard-to-debug way down in the bowels of a library routine.  -T

I fully agree with what Tim says above: We know (and to a certain extent 
have to accept) that there are implementations that, surely way more by 
accident than by any kind of intent, send unpaired surrogates. But we 
should try to do whatever we can in the spec to make it perfectly clear 
that there are no good reasons whatsoever to actually do that.

Although we may not end up with exactly parallel or equivalent language, 
I think the situation is fairly similar to the one regarding duplicate 
keys: The current spec isn't totally clear, and in practice it happens, 
and for some implementations, it may be an unreasonable burden to 
require to check everything on sending, but it's not something that one 
would or should do on purpose.

A second point:

Just to get the correct definitions from the Unicode side, here are the 
easiest references, everything on a page:

http://www.unicode.org/glossary/

In a few of my mails today, I have also used "code point", but I didn't 
intend to include surrogates.

There's a better term, namely Unicode Scalar Value 
(http://www.unicode.org/glossary/#unicode_scalar_value):

Any Unicode code point except high-surrogate and low-surrogate code 
points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 
10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding 
Forms.)

Regards,    Martin.