Re: [Json] Proposal: Code points and surrogates

Norbert Lindenberg <ietf@lindenbergsoftware.com> Sat, 22 June 2013 16:29 UTC

Return-Path: <ietf@lindenbergsoftware.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 598B121F9FB6 for <json@ietfa.amsl.com>; Sat, 22 Jun 2013 09:29:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.443
X-Spam-Level:
X-Spam-Status: No, score=-3.443 tagged_above=-999 required=5 tests=[AWL=-0.444, BAYES_00=-2.599, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id IZ8u7JOcLgHr for <json@ietfa.amsl.com>; Sat, 22 Jun 2013 09:29:27 -0700 (PDT)
Received: from mirach.lunarpages.com (mirach.lunarpages.com [216.97.235.70]) by ietfa.amsl.com (Postfix) with ESMTP id C998021F9F9D for <json@ietf.org>; Sat, 22 Jun 2013 09:29:27 -0700 (PDT)
Received: from 50-0-136-241.dsl.dynamic.sonic.net ([50.0.136.241]:52269 helo=[192.168.0.5]) by mirach.lunarpages.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80) (envelope-from <ietf@lindenbergsoftware.com>) id 1UqQgo-00455M-5Z; Sat, 22 Jun 2013 09:29:26 -0700
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset="us-ascii"
From: Norbert Lindenberg <ietf@lindenbergsoftware.com>
In-Reply-To: <CA+mHimPF5Q4us+pUKnT79h6SbS63Qh7bAOkp4tSpaCvDWZf-2g@mail.gmail.com>
Date: Sat, 22 Jun 2013 09:29:21 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <FB5408B8-68ED-4ACB-96DE-C3AD69240B10@lindenbergsoftware.com>
References: <05A7D2E5-C119-4900-B52B-54B0F1206300@lindenbergsoftware.com> <CA+mHimPF5Q4us+pUKnT79h6SbS63Qh7bAOkp4tSpaCvDWZf-2g@mail.gmail.com>
To: Stephen Dolan <stephen.dolan@cl.cam.ac.uk>
X-Mailer: Apple Mail (2.1283)
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - mirach.lunarpages.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - lindenbergsoftware.com
X-Get-Message-Sender-Via: mirach.lunarpages.com: authenticated_id: ietf@lindenbergsoftware.com
Cc: Norbert Lindenberg <ietf@lindenbergsoftware.com>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal: Code points and surrogates
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 22 Jun 2013 16:29:39 -0000

On Jun 22, 2013, at 7:58 , Stephen Dolan wrote:

> On Tue, Jun 18, 2013 at 8:43 AM, Norbert Lindenberg
> <ietf@lindenbergsoftware.com> wrote:
>> Before:
>>    A string is a sequence of zero or more Unicode characters [UNICODE].
>> 
>> After:
>>    A string is a sequence of zero or more Unicode code points [UNICODE].
> 
> Unicode's three character encoding forms (UTF-8, UTF-16, and UTF-32)
> define means of encoding sequences of unicode scalar values. A unicode
> scalar value is a code point which is not a surrogate. Noncharacter
> codepoints like U+FFFF are unicode scalar values.
> 
> Since JSON is invariably represented in some unicode encoding, it
> seems odd that JSON would include things which are not representable
> in any standard unicode encoding.
> 
> Should we instead say "a string is a sequence of zero or more Unicode
> scalar values"? Even if we use "code point", it still seems odd to
> claim "this specification allows the inclusion of surrogate code
> points (U+D800 through U+DFFF) in JSON text, both directly and through
> escape sequences" - it's not clear that the inclusion of surrogate
> code points directly is actually possible.

They can be included in Unicode strings, as defined in the Unicode Standard, section 2.7 [1]. The strings used in JavaScript, Java, and some other languages are Unicode strings, with no requirement of well-formedness.

The specification of the JSON object in the ECMAScript Language Specification [2] requires that unpaired surrogates in strings are passed through by both parse() and stringify(), and that parse() unescapes escape sequences for all surrogate values, paired or not. Testing I've done indicates that the implementations in the major browsers and in Node.js conform to this requirement.

[1] http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf
[2] http://ecma-international.org/ecma-262/5.1/#sec-15.12

Norbert