Re: [Json] Proposed minimal change for strings

Norbert Lindenberg <ietf@lindenbergsoftware.com> Thu, 04 July 2013 05:06 UTC

Return-Path: <ietf@lindenbergsoftware.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 32CB521F9E85 for <json@ietfa.amsl.com>; Wed, 3 Jul 2013 22:06:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.299
X-Spam-Level:
X-Spam-Status: No, score=-3.299 tagged_above=-999 required=5 tests=[AWL=-0.300, BAYES_00=-2.599, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ufZStjoGM5kj for <json@ietfa.amsl.com>; Wed, 3 Jul 2013 22:06:37 -0700 (PDT)
Received: from mirach.lunarpages.com (mirach.lunarpages.com [216.97.235.70]) by ietfa.amsl.com (Postfix) with ESMTP id 6E00921F9E7F for <json@ietf.org>; Wed, 3 Jul 2013 22:06:37 -0700 (PDT)
Received: from 50-0-136-241.dsl.dynamic.sonic.net ([50.0.136.241]:51969 helo=[192.168.0.5]) by mirach.lunarpages.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80.1) (envelope-from <ietf@lindenbergsoftware.com>) id 1Uubka-000f1I-4P; Wed, 03 Jul 2013 22:06:36 -0700
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
From: Norbert Lindenberg <ietf@lindenbergsoftware.com>
In-Reply-To: <9BACB3F2-F9BF-40C7-B4BA-C0C2F33E4278@vpnc.org>
Date: Wed, 03 Jul 2013 22:06:36 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <600B717E-2F49-4D5F-835C-C90218396E75@lindenbergsoftware.com>
References: <9BACB3F2-F9BF-40C7-B4BA-C0C2F33E4278@vpnc.org>
To: "json@ietf.org WG" <json@ietf.org>
X-Mailer: Apple Mail (2.1508)
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - mirach.lunarpages.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - lindenbergsoftware.com
X-Get-Message-Sender-Via: mirach.lunarpages.com: authenticated_id: ietf@lindenbergsoftware.com
Cc: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Subject: Re: [Json] Proposed minimal change for strings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Jul 2013 05:06:42 -0000

On Jul 2, 2013, at 16:27 , Paul Hoffman <paul.hoffman@vpnc.org> wrote:

> Proposal 1 (allow all code units in their unescaped form):
> 
> In section 1 (Introduction):
> Change the sentence about Unicode characters to:
>   A string is a sequence of zero or more Unicode code units [UNICODE].

This should be "code points", not "code units".

If you allow a sequence of code units, you might get UTF-8 code units in random sequence, or a mix of UTF-8 code units and UTF-16 code units, which wouldn't make sense at all.

Also, JSON documents have to be understood at the level of code points, because their code unit sequences are often changed in transit - e.g., from UTF-16 code units in JavaScript to UTF-8 code units over the network back to UTF-16 code units in Java - with no intended change in content.

See also
http://www.unicode.org/glossary/#code_point
http://www.unicode.org/glossary/#code_unit

> In section 2.2 (Strings):
> Leave the production for "unescaped" as-is.
> In section 3 (Encoding):
> Add "Some strings, notably those that have unescaped surrogate code units
> (value 0xD800 to 0xDFFF), cannot be encoded in UTF-8."

As above, "surrogate code units" should be "surrogate code points", therefore U+D800 to U+DFFF. Also, it's exactly the strings containing unescaped surrogate code points that cannot be encoded in well-formed UTF-8, so we can get rid of "some" and "notably those".

There are a few additional references to "character" that would have to be changed to "Unicode code point" to make this proposal complete. See
http://www.ietf.org/mail-archive/web/json/current/msg00870.html

Norbert