Re: [Json] Proposed minimal change for strings

John Cowan <cowan@mercury.ccil.org> Wed, 03 July 2013 16:02 UTC

Return-Path: <cowan@ccil.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 974CE11E81EC for <json@ietfa.amsl.com>; Wed, 3 Jul 2013 09:02:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.299
X-Spam-Level:
X-Spam-Status: No, score=-3.299 tagged_above=-999 required=5 tests=[AWL=-0.300, BAYES_00=-2.599, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id IOJ+I24CTjtM for <json@ietfa.amsl.com>; Wed, 3 Jul 2013 09:02:32 -0700 (PDT)
Received: from earth.ccil.org (earth.ccil.org [192.190.237.11]) by ietfa.amsl.com (Postfix) with ESMTP id 40F2E11E81C8 for <json@ietf.org>; Wed, 3 Jul 2013 09:02:32 -0700 (PDT)
Received: from cowan by earth.ccil.org with local (Exim 4.72) (envelope-from <cowan@ccil.org>) id 1UuPVk-0006Ki-Tp; Wed, 03 Jul 2013 12:02:29 -0400
Date: Wed, 03 Jul 2013 12:02:28 -0400
From: John Cowan <cowan@mercury.ccil.org>
To: Paul Hoffman <paul.hoffman@vpnc.org>
Message-ID: <20130703160228.GA32044@mercury.ccil.org>
References: <9BACB3F2-F9BF-40C7-B4BA-C0C2F33E4278@vpnc.org> <CAK3OfOgN5SKOet5bvN1fpxj6UsvUdcOUxvETYxUmsWH_3sarcA@mail.gmail.com> <0194C74E-3866-48B1-A6F8-69802FA30609@vpnc.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <0194C74E-3866-48B1-A6F8-69802FA30609@vpnc.org>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: John Cowan <cowan@ccil.org>
Cc: Nico Williams <nico@cryptonector.com>, "json@ietf.org WG" <json@ietf.org>
Subject: Re: [Json] Proposed minimal change for strings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Jul 2013 16:02:36 -0000

Paul Hoffman scripsit:

> >>  Add "Some strings, notably those that have unescaped surrogate code units
> >>  (value 0xD800 to 0xDFFF), cannot be encoded in UTF-8."
> > 
> > Unescaped and *unpaired*.
> 
> No, any surrogate code point. RFC 3629, the IETF's definition of UTF-8, says:
>    The definition of UTF-8 prohibits encoding character numbers between
>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>    encoding form (as surrogate pairs) and do not directly represent
>    characters.

You are conflating code points (integers in the range 0-0x10FFFF,
definition D10) with code units (bit-strings of a specified length,
definition D77).  The code *points* corresponding to surrogates
(0xD800-0xDFFF) cannot be encoded with any encoding.  The 16-bit code
*units* corresponding to surrogates are used in pairs to represent
characters from U+10000 to U+10FFFF.  There are, obviously, no such
8-bit code units, as the numbers involved will not fit in 8 bits, and
the equivalent 32-bit code units are not used.

Now you make me wonder if your proposal 1 was supposed to be about code
points rather than code units.

(from another email in this thread)

> That appears to be the case for the current document. If you have a
> preferred [code unit] size, by all means propose it to the list. If
> we can get rough consensus on that, it would help this discussion.

In the JSON context, the only code unit size that makes sense is 16-bit,
because JavaScript (from which JSON comes) deals in 16-bit code units,
sequences of which are called "strings".  But if you mean to talk of code
points rather than units, then of course there is no size to specify,
as integers don't have a size.

> > As noted, that should have been "unpaired unescaped surrogate
> > code units".
> 
> That is only true for UTF-16. The definitions of UTF-8 and UTF-32 say
> that you cannot encode any surrogate code points, paired or not.

This is the same conflation.

-- 
John Cowan <cowan@ccil.org>             http://www.ccil.org/~cowan
Sir, I quite agree with you, but what are we two against so many?
    --George Bernard Shaw,
         to a man booing at the opening of _Arms and the Man_