Re: [Json] Proposal for strings/Unicode text

Carsten Bormann <cabo@tzi.org> Wed, 12 June 2013 22:30 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B22F021E80E4 for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 15:30:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -106.856
X-Spam-Level:
X-Spam-Status: No, score=-106.856 tagged_above=-999 required=5 tests=[AWL=0.793, BAYES_00=-2.599, GB_I_LETTER=-2, HELO_EQ_DE=0.35, J_CHICKENPOX_82=0.6, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id powRIzbq7Q2N for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 15:30:16 -0700 (PDT)
Received: from informatik.uni-bremen.de (mailhost.informatik.uni-bremen.de [IPv6:2001:638:708:30c9::12]) by ietfa.amsl.com (Postfix) with ESMTP id 6A13F21F8F6E for <json@ietf.org>; Wed, 12 Jun 2013 15:30:13 -0700 (PDT)
X-Virus-Scanned: amavisd-new at informatik.uni-bremen.de
Received: from smtp-fb3.informatik.uni-bremen.de (smtp-fb3.informatik.uni-bremen.de [134.102.224.120]) by informatik.uni-bremen.de (8.14.4/8.14.4) with ESMTP id r5CMUBK8020709; Thu, 13 Jun 2013 00:30:11 +0200 (CEST)
Received: from [192.168.217.105] (p54893401.dip0.t-ipconnect.de [84.137.52.1]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp-fb3.informatik.uni-bremen.de (Postfix) with ESMTPSA id 324893E91; Thu, 13 Jun 2013 00:30:11 +0200 (CEST)
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Content-Type: text/plain; charset="windows-1252"
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
Date: Thu, 13 Jun 2013 00:30:10 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <3E2A194C-789F-4E75-ABB0-CE966319463E@tzi.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
X-Mailer: Apple Mail (2.1508)
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Jun 2013 22:30:33 -0000

Hmm.

Somehow I think the JSON specification should focus on describing what the intended usage is.

I strongly prefer adding an appendix M for things that you can do with the ABNF that are almost, but not entirely unlike JSON.

Grüße, Carsten

PS.: I support jettisoning liturgical language in standards, and I applaud Douglas for slipping with respect to the liturgical term "octet" only twice (both in the same paragraph).  But a document must also speak the language of its editor, and if Douglas thinks "reverse solidus" is the best way to speak about what some Germans (but never me) would call "Rückschräger", that's fine, as long as it is consistent.

PPS.: On the specific wording:
> Before:
>    A string is a sequence of zero or more Unicode characters [UNICODE].
> After:
>    A string is intended to contain sequences of zero or more Unicode characters [UNICODE 6.2]

A string is a sequence of characters.  [not sequences of them]
Add something like: "To reduce the burden on implementations, JSON is less selective in what it accepts as a character than Unicode itself is.  See also Appendix M."

> Strings begin and end with quotation marks. 

The representation does, the string rarely does.
RFC4627 got this right consistently, but in a tight language where extracting a single sentence may lose the necessary context.

> They are intended,to contain sequences of Unicode characters; Note however that the ABNF in this section allows the inclusion of 16-bit quantities in ways which can never be useful for representing characters and is likely to cause breakage in software designed to process Unicode text.

This is where I would simply point to Appendix M.

> The ABNF allows the use of many Unicode code points that could be used in future to represent Unicode characters, but have not yet been assigned. Therefore, this specification should not need revision as the Unicode character repertoire continues to grow.

This is something that even could be said in the introduction.
Or in a section about stability and protocol evolution (the same section that is needed to say that [past and future] changes in JavaScript don't change JSON).

> 16-bit quantities

These are Unicode code points.

> (normally Unicode characters from the Basic Multingual Pane(U+0000 through U+FFFF) may be “escaped”, or represented as a six-character sequence: a backslash (U+005C REVERSE SOLIDUS), followed  by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.  The hexadecimal letters A though F can be upper or lower case.  So, for example, a string containing only a single backslash may be represented as "\u005C".
> 
> Alternatively, there are two-character sequence escape representations of some popular characters.  So, for example, a string containing only a single backslash may be represented more compactly as "\\".
> 
>  To escape an extended character

Non-BMP characters aren't "extended".  They have rights, too!
You MUST rid your mind of discriminating against them.
(I know, this was just copied over...)

> that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.  So, for example, a string containing only U+1D11E G CLEF may be represented as
>    "\uD834\uDD1E".

(This could add a small apologetic clause pointing out the UTF-16 roots of the weird notation.  Or not.)
This needs another pointer to appendix M.