Re: [Json] Proposed minimal change for strings

John Cowan <cowan@mercury.ccil.org> Wed, 03 July 2013 16:02 UTC

Date: Wed, 03 Jul 2013 12:02:28 -0400
From: John Cowan <cowan@mercury.ccil.org>
To: Paul Hoffman <paul.hoffman@vpnc.org>
Message-ID: <20130703160228.GA32044@mercury.ccil.org>
References: <9BACB3F2-F9BF-40C7-B4BA-C0C2F33E4278@vpnc.org> <CAK3OfOgN5SKOet5bvN1fpxj6UsvUdcOUxvETYxUmsWH_3sarcA@mail.gmail.com> <0194C74E-3866-48B1-A6F8-69802FA30609@vpnc.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <0194C74E-3866-48B1-A6F8-69802FA30609@vpnc.org>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <cowan@ccil.org>
Cc: Nico Williams <nico@cryptonector.com>, "json@ietf.org WG" <json@ietf.org>
Subject: Re: [Json] Proposed minimal change for strings
Precedence: list

Paul Hoffman scripsit:

> >>  Add "Some strings, notably those that have unescaped surrogate code units
> >>  (value 0xD800 to 0xDFFF), cannot be encoded in UTF-8."
> > 
> > Unescaped and *unpaired*.
> 
> No, any surrogate code point. RFC 3629, the IETF's definition of UTF-8, says:
>    The definition of UTF-8 prohibits encoding character numbers between
>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>    encoding form (as surrogate pairs) and do not directly represent
>    characters.

You are conflating code points (integers in the range 0-0x10FFFF,
definition D10) with code units (bit-strings of a specified length,
definition D77).  The code *points* corresponding to surrogates
(0xD800-0xDFFF) cannot be encoded with any encoding.  The 16-bit code
*units* corresponding to surrogates are used in pairs to represent
characters from U+10000 to U+10FFFF.  There are, obviously, no such
8-bit code units, as the numbers involved will not fit in 8 bits, and
the equivalent 32-bit code units are not used.

Now you make me wonder if your proposal 1 was supposed to be about code
points rather than code units.

(from another email in this thread)

> That appears to be the case for the current document. If you have a
> preferred [code unit] size, by all means propose it to the list. If
> we can get rough consensus on that, it would help this discussion.

In the JSON context, the only code unit size that makes sense is 16-bit,
because JavaScript (from which JSON comes) deals in 16-bit code units,
sequences of which are called "strings".  But if you mean to talk of code
points rather than units, then of course there is no size to specify,
as integers don't have a size.

> > As noted, that should have been "unpaired unescaped surrogate
> > code units".
> 
> That is only true for UTF-16. The definitions of UTF-8 and UTF-32 say
> that you cannot encode any surrogate code points, paired or not.

This is the same conflation.

-- 
John Cowan <cowan@ccil.org>             http://www.ccil.org/~cowan
Sir, I quite agree with you, but what are we two against so many?
    --George Bernard Shaw,
         to a man booing at the opening of _Arms and the Man_

Re: [Json] Proposed minimal change for strings John Cowan
[Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Nico Williams
Re: [Json] Proposed minimal change for strings Manger, James H
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings Paul Hoffman
Re: [Json] Proposed minimal change for strings John Cowan
Re: [Json] Proposed minimal change for strings Nico Williams
Re: [Json] Proposed minimal change for strings Norbert Lindenberg
Re: [Json] Proposed minimal change for strings Joe Hildebrand (jhildebr)
Re: [Json] Proposed minimal change for strings Jorge
Re: [Json] Proposed minimal change for strings Stephen Dolan
Re: [Json] Proposed minimal change for strings Bjoern Hoehrmann
Re: [Json] Proposed minimal change for strings Jorge
Re: [Json] Proposed minimal change for strings Manger, James H
Re: [Json] Proposed minimal change for strings Bjoern Hoehrmann