[Json] Proposal: Code points and surrogates

Norbert Lindenberg <ietf@lindenbergsoftware.com> Tue, 18 June 2013 07:44 UTC

Return-Path: <ietf@lindenbergsoftware.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D25B321F9C69 for <json@ietfa.amsl.com>; Tue, 18 Jun 2013 00:44:03 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.893
X-Spam-Level:
X-Spam-Status: No, score=-4.893 tagged_above=-999 required=5 tests=[AWL=0.706, BAYES_00=-2.599, GB_I_LETTER=-2, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id j6HSGz58nGnK for <json@ietfa.amsl.com>; Tue, 18 Jun 2013 00:43:59 -0700 (PDT)
Received: from mirach.lunarpages.com (mirach.lunarpages.com [216.97.235.70]) by ietfa.amsl.com (Postfix) with ESMTP id 1A3B621F9C50 for <json@ietf.org>; Tue, 18 Jun 2013 00:43:59 -0700 (PDT)
Received: from 50-0-136-241.dsl.dynamic.sonic.net ([50.0.136.241]:63054 helo=[192.168.0.5]) by mirach.lunarpages.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80) (envelope-from <ietf@lindenbergsoftware.com>) id 1Uoqa5-001GAA-T4; Tue, 18 Jun 2013 00:43:58 -0700
From: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 18 Jun 2013 00:43:53 -0700
Message-Id: <05A7D2E5-C119-4900-B52B-54B0F1206300@lindenbergsoftware.com>
To: json@ietf.org
Mime-Version: 1.0 (Apple Message framework v1283)
X-Mailer: Apple Mail (2.1283)
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - mirach.lunarpages.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - lindenbergsoftware.com
X-Get-Message-Sender-Via: mirach.lunarpages.com: authenticated_id: ietf@lindenbergsoftware.com
Cc: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Subject: [Json] Proposal: Code points and surrogates
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Jun 2013 07:44:04 -0000

This proposal attempts to clarify that JSON allows all Unicode code points, but also which issues exist with surrogate code points. It proposes no normative changes so as to remain compatible with the ECMAScript specification, but recommends that clients carefully consider whether to allow surrogate code points in their uses of JSON.


Details:
- changed from "character" to "code point", unless specific characters are referenced,
- added paragraph with information about surrogate code points,
- rearranged paragraphs so that escape sequences for BMP and non-BMP code points are discussed side by side.

Compared to Tim's proposal [1], this proposal
- says what the specification actually allows (code points) rather than what may be intended (characters),
- gets rid of the term "16-bit quantities",
- provides more concrete information about surrogate code points than just "breakage",
- doesn't discuss the need, or lack thereof, to update the RFC in the future.


Section 1, Introduction:

Before:
    A string is a sequence of zero or more Unicode characters [UNICODE].

After:
    A string is a sequence of zero or more Unicode code points [UNICODE].


Section 2.5, Strings:

A string begins and ends with quotation marks. All Unicode code points may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any code point may be escaped. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

To escape a code point that is not in the Basic Multilingual Plane, the code point is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as "\\".

Note that this specification allows the inclusion of surrogate code points (U+D800 through U+DFFF) in JSON text, both directly and through escape sequences. However, Unicode code unit sequences containing surrogate code points are not well-formed, are prohibited by standards such as [RFC 3629], and may be rejected or modified by software such as character encoding converters. Developers and specification authors should carefully consider whether to allow surrogate code points in their uses of JSON.

[continue with grammar]


Section 4, Parsers:

Before:
   An implementation may set limits on the length and character contents of strings.

After:
   An implementation may set limits on the length and code point contents of strings.

[1] http://www.ietf.org/mail-archive/web/json/current/msg00814.html

Norbert