Re: [Json] A possible summary of the discussion so far on code points and characters

Carsten Bormann <cabo@tzi.org> Sun, 09 June 2013 00:39 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8A60A21F944F for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 17:39:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -105.732
X-Spam-Level:
X-Spam-Status: No, score=-105.732 tagged_above=-999 required=5 tests=[AWL=-0.083, BAYES_00=-2.599, HELO_EQ_DE=0.35, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xsXgQKYzHoAM for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 17:39:29 -0700 (PDT)
Received: from informatik.uni-bremen.de (mailhost.informatik.uni-bremen.de [IPv6:2001:638:708:30c9::12]) by ietfa.amsl.com (Postfix) with ESMTP id 64C1621F92B8 for <json@ietf.org>; Sat, 8 Jun 2013 17:39:29 -0700 (PDT)
X-Virus-Scanned: amavisd-new at informatik.uni-bremen.de
Received: from smtp-fb3.informatik.uni-bremen.de (smtp-fb3.informatik.uni-bremen.de [134.102.224.120]) by informatik.uni-bremen.de (8.14.4/8.14.4) with ESMTP id r590dKXi027556; Sun, 9 Jun 2013 02:39:20 +0200 (CEST)
Received: from [192.168.217.105] (p54893DC9.dip0.t-ipconnect.de [84.137.61.201]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp-fb3.informatik.uni-bremen.de (Postfix) with ESMTPSA id 45BAD3679; Sun, 9 Jun 2013 02:39:20 +0200 (CEST)
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Content-Type: text/plain; charset=iso-8859-1
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <D27EA9DC-9EFE-419B-BC34-3BF3FC8F5260@vpnc.org>
Date: Sun, 9 Jun 2013 02:39:19 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <EF244D9B-29E2-40E4-99FF-810A28091106@tzi.org>
References: <AF793CAF-B30B-44A7-B864-82CEF79EA34D@vpnc.org> <CAChr6SwLDCUk0DC9pGTKqUu_V5vJHvs7Sgv4EneTJMryn1iKSA@mail.gmail.com> <D27EA9DC-9EFE-419B-BC34-3BF3FC8F5260@vpnc.org>
To: Paul Hoffman <paul.hoffman@vpnc.org>
X-Mailer: Apple Mail (2.1503)
Cc: "json@ietf.org" <json@ietf.org>, R S <sayrer@gmail.com>
Subject: Re: [Json] A possible summary of the discussion so far on code points and characters
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Jun 2013 00:39:36 -0000

On Jun 8, 2013, at 23:09, Paul Hoffman <paul.hoffman@vpnc.org> wrote:

> On Jun 8, 2013, at 1:52 PM, R S <sayrer@gmail.com> wrote:
> 
>> A seventh point of view, which I happen to agree with: JSON strings are a sequence of code units.

That is not a very precise or useful statement if it refers to Unicode "code units", because code units can be bytes ("UTF-8 code units"), 16-bit values (UTF-16), or 32-bit values (UTF-32).
In actual JSON interchange, the code units are bytes (unless you are using the hypothetical UTF-16 or UTF-32 form of JSON).
Rob might have been trying to say "UTF-16 code units".

This view is most natural to those who think JSON should be useful as an interchange format for the "use a JavaScript string as a vector of unconstrained 16-bit values" hack.
(It is not aligned with JSON's main purpose.)

> That aligns with (2).

Not at all: It is different, and it is at a different level.

>> 2) The ABNF is more liberal about what can be in a string than that statement:
>>      char = unescaped /
>>          escape ( ...
>>              %x75 4HEXDIG )  ; uXXXX                U+XXXX
>>      ...
>>      unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

RFC 4234 is about characters and this excerpt from RFC 4627 makes it clear that its usage of ABNF is based on Unicode characters/Unicode "code points", not about UTF-16 "code units" (which would stop at 0xFFFF).

Now the ABNF is about the representation, not about the data model, so this has no bearing on whether at the data model level the "characters" in a string are Unicode characters, Unicode code points, or UTF-16 code units.

(For the latter case, if you happen to have a sequence of a high-surrogate (0xD800..0xDBFF) and a low-surrogate (0xDC00..0xDFFF), that sequence will look like a single 0x10000 to 0x10FFFF in the Unicode character/"code point" view and can be interchanged natively.  Any other usage of the surrogate "code units" will need to use a \uXXXX escape, because they can't be represented in Unicode.)

Grüße, Carsten