Re: [Json] Proposal for strings/Unicode text

Norbert Lindenberg <ietf@lindenbergsoftware.com> Wed, 12 June 2013 22:46 UTC

Return-Path: <ietf@lindenbergsoftware.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5DCC411E8114 for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 15:46:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.099
X-Spam-Level:
X-Spam-Status: No, score=-4.099 tagged_above=-999 required=5 tests=[AWL=0.900, BAYES_00=-2.599, GB_I_LETTER=-2, J_CHICKENPOX_82=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QZts72a9qXAv for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 15:46:14 -0700 (PDT)
Received: from mirach.lunarpages.com (mirach.lunarpages.com [216.97.235.70]) by ietfa.amsl.com (Postfix) with ESMTP id 75B2321E8105 for <json@ietf.org>; Wed, 12 Jun 2013 15:46:14 -0700 (PDT)
Received: from 50-0-136-241.dsl.dynamic.sonic.net ([50.0.136.241]:56169 helo=[192.168.0.5]) by mirach.lunarpages.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80) (envelope-from <ietf@lindenbergsoftware.com>) id 1Umtnx-000aoL-0Y; Wed, 12 Jun 2013 15:46:13 -0700
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset="utf-8"
From: Norbert Lindenberg <ietf@lindenbergsoftware.com>
In-Reply-To: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
Date: Wed, 12 Jun 2013 15:46:07 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
X-Mailer: Apple Mail (2.1283)
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - mirach.lunarpages.com
X-AntiAbuse: Original Domain - ietf.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - lindenbergsoftware.com
X-Get-Message-Sender-Via: mirach.lunarpages.com: authenticated_id: ietf@lindenbergsoftware.com
Cc: Norbert Lindenberg <ietf@lindenbergsoftware.com>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Jun 2013 22:46:20 -0000

On Jun 12, 2013, at 11:46 , Tim Bray wrote:

> Rationale:
> - emphasize the important fact that Strings are *intended* for Unicode characters
> - document the important fact that the rules allow horrible Unicode practices
> - say “backslash” instead of “reverse solidus” :)

The JSON RFC seems to use Unicode character names, in this case case "reverse solidus".

> In section 1, introduction
> 
> Before:
>    A string is a sequence of zero or more Unicode characters [UNICODE].
> After:
>    A string is intended to contain sequences of zero or more Unicode characters [UNICODE 6.2]
> 
> Rewrite section 2.5 as follows:
> 
> Strings begin and end with quotation marks.  They are intended,to contain sequences of Unicode characters; Note however that the ABNF in this section allows the inclusion of 16-bit quantities in ways which can never be useful for representing characters and is likely to cause breakage in software designed to process Unicode text.

This warning is too vague to be useful. Which specific risks do you think need to be discussed here? Also, the ABNF doesn't do anything specifically for 16-bit quantities, as far as I can see.

> The ABNF allows the use of many Unicode code points that could be used in future to represent Unicode characters, but have not yet been assigned. Therefore, this specification should not need revision as the Unicode character repertoire continues to grow.
> 
> 16-bit quantities (normally Unicode characters from the Basic Multingual Pane(U+0000 through U+FFFF) may be “escaped”, or represented as a six-character sequence: a backslash (U+005C REVERSE SOLIDUS), followed  by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.  The hexadecimal letters A though F can be upper or lower case.  So, for example, a string containing only a single backslash may be represented as "\u005C".

These escape sequences aren't about 16-bit quantities - they represent Unicode BMP code points. If the parser inserts a BMP code point into an output string in UTF-16 (in JavaScript, Java, and others), then the result will be a 16-bit quantity. If it inserts the code point into an output string in UTF-8 (in Python and others), then the result will be a sequence of 1-3 bytes.

If we discuss these escape sequences in prose, then the sequences for BMP and supplementary characters need to be discussed together so that it's clear what's a surrogate pair and how unpaired surrogates are handled.

JSON syntax allows almost all Unicode code points (with the exceptions visible in the "unescaped" production) to be inserted into strings directly, so escape sequences are mostly a convenience for using JSON in environments that restrict the set of allowed characters (such as RFCs), don't have the necessary fonts and input methods installed, or benefit from making the code points visible (e.g., in test cases).

> Alternatively, there are two-character sequence escape representations of some popular characters.  So, for example, a string containing only a single backslash may be represented more compactly as "\\".

I'd assume that this is not about popularity, but about the need to represent control characters and characters that are also used within the JSON syntax. I don't see two-character escapes for "e" or "的".

Norbert