Re: [Json] Unpaired surrogates in JSON strings

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Fri, 07 June 2013 10:25 UTC

Message-ID: <51B1B4E7.8090101@it.aoyama.ac.jp>
Date: Fri, 07 Jun 2013 19:24:39 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: Tim Bray <tbray@textuality.com>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2E7E1@xmb-rcd-x10.cisco.com> <51B06F38.8050707@crockford.com> <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com>
In-Reply-To: <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: quoted-printable
Cc: "json@ietf.org" <json@ietf.org>, Paul Hoffman <paul.hoffman@vpnc.org>, Douglas Crockford <douglas@crockford.com>, "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>
Subject: Re: [Json] Unpaired surrogates in JSON strings
Precedence: list

On 2013/06/06 23:57, Tim Bray wrote:
> F0, 90, 8D, 86
> On Thu, Jun 6, 2013 at 4:15 AM, Douglas Crockford<douglas@crockford.com>wrote:
>
>> What  then is the standard name for a 16-bit element of text? When
>> JavaScript was created, that word was character. What is the word now?
>>
>
> The only somewhat-standardized term would be “UTF-16 codepoint”.  But
> that’s not really a “unit of text” any more than the 2nd byte of a
> character encoded in 3 bytes with UTF-8 is.
>
> I’m fairly shocked.  I have always believed that JSON encodes what its
> introduction (and section 2.5 "Strings") say it encodes, Unicode
> characters.
>
> If it is a requirement to accommodate the class of bug where languages that
> use UTF-16 (Java, JavaScript, C#) can emit unpaired UTF-16 surrogates, the
> spec needs to be clear that the INTENT is actually to support Unicode
> characters, and that unpaired surrogates are always evidence of a bug, and
> there can be no expectation that any software receiving such buggy data
> will be able to do anything useful with it, or even avoid crashing in a
> hard-to-debug way down in the bowels of a library routine.  -T

I fully agree with what Tim says above: We know (and to a certain extent 
have to accept) that there are implementations that, surely way more by 
accident than by any kind of intent, send unpaired surrogates. But we 
should try to do whatever we can in the spec to make it perfectly clear 
that there are no good reasons whatsoever to actually do that.

Although we may not end up with exactly parallel or equivalent language, 
I think the situation is fairly similar to the one regarding duplicate 
keys: The current spec isn't totally clear, and in practice it happens, 
and for some implementations, it may be an unreasonable burden to 
require to check everything on sending, but it's not something that one 
would or should do on purpose.

A second point:

Just to get the correct definitions from the Unicode side, here are the 
easiest references, everything on a page:

http://www.unicode.org/glossary/

In a few of my mails today, I have also used "code point", but I didn't 
intend to include surrogates.

There's a better term, namely Unicode Scalar Value 
(http://www.unicode.org/glossary/#unicode_scalar_value):

Any Unicode code point except high-surrogate and low-surrogate code 
points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 
10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding 
Forms.)

Regards,    Martin.

[Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Douglas Crockford
Re: [Json] Unpaired surrogates in JSON strings Paul Hoffman
Re: [Json] Unpaired surrogates in JSON strings Douglas Crockford
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Douglas Crockford
Re: [Json] Unpaired surrogates in JSON strings Joe Hildebrand (jhildebr)
Re: [Json] Unpaired surrogates in JSON strings Paul Hoffman
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Tim Bray
Re: [Json] Unpaired surrogates in JSON strings Paul Hoffman
Re: [Json] Unpaired surrogates in JSON strings Tim Bray
Re: [Json] Unpaired surrogates in JSON strings Douglas Crockford
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings R S
Re: [Json] Unpaired surrogates in JSON strings Carsten Bormann
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Tim Bray
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Carsten Bormann
Re: [Json] Unpaired surrogates in JSON strings Joe Hildebrand (jhildebr)
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Joe Hildebrand (jhildebr)
Re: [Json] Unpaired surrogates in JSON strings Joe Hildebrand (jhildebr)
Re: [Json] Unpaired surrogates in JSON strings Douglas Crockford
Re: [Json] Unpaired surrogates in JSON strings Douglas Crockford
Re: [Json] Unpaired surrogates in JSON strings Douglas Crockford
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Tim Bray
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Paul Hoffman
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] Unpaired surrogates in JSON strings Joe Hildebrand (jhildebr)
Re: [Json] Unpaired surrogates in JSON strings Joe Hildebrand (jhildebr)
Re: [Json] Unpaired surrogates in JSON strings Martin J. Dürst
Re: [Json] Unpaired surrogates in JSON strings Bjoern Hoehrmann
[Json] On characters and code points Paul Hoffman
Re: [Json] On characters and code points Tim Bray
Re: [Json] On characters and code points Stephen Dolan
Re: [Json] On characters and code points Stefan Drees
Re: [Json] On characters and code points Tim Bray
Re: [Json] On characters and code points Stefan Drees
Re: [Json] On characters and code points Tim Bray
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] On characters and code points John Cowan
Re: [Json] On characters and code points John Cowan
Re: [Json] On characters and code points Tim Bray
Re: [Json] On characters and code points John Cowan
Re: [Json] Unpaired surrogates in JSON strings Nico Williams
Re: [Json] Unpaired surrogates in JSON strings Nico Williams
Re: [Json] Unpaired surrogates in JSON strings Tatu Saloranta
Re: [Json] Unpaired surrogates in JSON strings Joe Hildebrand (jhildebr)
Re: [Json] On characters and code points Bjoern Hoehrmann
Re: [Json] On characters and code points Tim Bray
Re: [Json] Unpaired surrogates in JSON strings John Cowan
Re: [Json] On characters and code points Nico Williams
Re: [Json] On characters and code points John Cowan
Re: [Json] On characters and code points Bjoern Hoehrmann
Re: [Json] On characters and code points Carsten Bormann
Re: [Json] On characters and code points Stefan Drees
Re: [Json] On characters and code points Paul Hoffman
Re: [Json] On characters and code points Carsten Bormann
Re: [Json] On characters and code points Nico Williams