[Json] Proposal: Code points and surrogates

Norbert Lindenberg <ietf@lindenbergsoftware.com> Tue, 18 June 2013 07:44 UTC

From: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Tue, 18 Jun 2013 00:43:53 -0700
Message-Id: <05A7D2E5-C119-4900-B52B-54B0F1206300@lindenbergsoftware.com>
To: json@ietf.org
Mime-Version: 1.0 (Apple Message framework v1283)
Cc: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Subject: [Json] Proposal: Code points and surrogates
Precedence: list

This proposal attempts to clarify that JSON allows all Unicode code points, but also which issues exist with surrogate code points. It proposes no normative changes so as to remain compatible with the ECMAScript specification, but recommends that clients carefully consider whether to allow surrogate code points in their uses of JSON.

Details:
- changed from "character" to "code point", unless specific characters are referenced,
- added paragraph with information about surrogate code points,
- rearranged paragraphs so that escape sequences for BMP and non-BMP code points are discussed side by side.

Compared to Tim's proposal [1], this proposal
- says what the specification actually allows (code points) rather than what may be intended (characters),
- gets rid of the term "16-bit quantities",
- provides more concrete information about surrogate code points than just "breakage",
- doesn't discuss the need, or lack thereof, to update the RFC in the future.

Section 1, Introduction:

Before:
A string is a sequence of zero or more Unicode characters [UNICODE].

After:
A string is a sequence of zero or more Unicode code points [UNICODE].

Section 2.5, Strings:

A string begins and ends with quotation marks. All Unicode code points may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any code point may be escaped. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

To escape a code point that is not in the Basic Multilingual Plane, the code point is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as "\\".

Note that this specification allows the inclusion of surrogate code points (U+D800 through U+DFFF) in JSON text, both directly and through escape sequences. However, Unicode code unit sequences containing surrogate code points are not well-formed, are prohibited by standards such as [RFC 3629], and may be rejected or modified by software such as character encoding converters. Developers and specification authors should carefully consider whether to allow surrogate code points in their uses of JSON.

[continue with grammar]

Section 4, Parsers:

Before:
An implementation may set limits on the length and character contents of strings.

After:
An implementation may set limits on the length and code point contents of strings.

[1] http://www.ietf.org/mail-archive/web/json/current/msg00814.html

Norbert

[Json] Proposal: Code points and surrogates Norbert Lindenberg
Re: [Json] Proposal: Code points and surrogates Stephen Dolan
Re: [Json] Proposal: Code points and surrogates Norbert Lindenberg
Re: [Json] Proposal: Code points and surrogates John Cowan
Re: [Json] Proposal: Code points and surrogates Nico Williams
[Json] Numeric ECMAScript limitations John Cowan
Re: [Json] Numeric ECMAScript limitations R S
Re: [Json] Proposal: Code points and surrogates Carsten Bormann
Re: [Json] Proposal: Code points and surrogates Norbert Lindenberg
Re: [Json] Proposal: Code points and surrogates John Cowan