Re: [Json] Proposal for strings/Unicode text

Carsten Bormann <cabo@tzi.org> Wed, 12 June 2013 23:41 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2E1D721E80AA for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 16:41:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -105.889
X-Spam-Level:
X-Spam-Status: No, score=-105.889 tagged_above=-999 required=5 tests=[AWL=-0.240, BAYES_00=-2.599, HELO_EQ_DE=0.35, J_CHICKENPOX_82=0.6, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id H-CkD1u3s-gv for <json@ietfa.amsl.com>; Wed, 12 Jun 2013 16:41:02 -0700 (PDT)
Received: from informatik.uni-bremen.de (mailhost.informatik.uni-bremen.de [IPv6:2001:638:708:30c9::12]) by ietfa.amsl.com (Postfix) with ESMTP id 7D2CD11E8119 for <json@ietf.org>; Wed, 12 Jun 2013 16:41:00 -0700 (PDT)
X-Virus-Scanned: amavisd-new at informatik.uni-bremen.de
Received: from smtp-fb3.informatik.uni-bremen.de (smtp-fb3.informatik.uni-bremen.de [134.102.224.120]) by informatik.uni-bremen.de (8.14.4/8.14.4) with ESMTP id r5CNepUf021482; Thu, 13 Jun 2013 01:40:51 +0200 (CEST)
Received: from [192.168.217.105] (p54893401.dip0.t-ipconnect.de [84.137.52.1]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp-fb3.informatik.uni-bremen.de (Postfix) with ESMTPSA id AA5E23EA0; Thu, 13 Jun 2013 01:40:50 +0200 (CEST)
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Content-Type: text/plain; charset="iso-8859-1"
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <CAHBU6ivw=4WfTyXdBns-i30fvzhkb+Zs_puj=YhFw+fh6n3R7A@mail.gmail.com>
Date: Thu, 13 Jun 2013 01:40:49 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <D6191FD1-7DF9-4433-8427-48F866A3DBBC@tzi.org>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <3E2A194C-789F-4E75-ABB0-CE966319463E@tzi.org> <CAHBU6ivw=4WfTyXdBns-i30fvzhkb+Zs_puj=YhFw+fh6n3R7A@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
X-Mailer: Apple Mail (2.1508)
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Jun 2013 23:41:08 -0000

On Jun 13, 2013, at 00:49, Tim Bray <tbray@textuality.com> wrote:

> Strings are delimited  with quotation marks (U+0022 QUOTATION MARK).  They are intended,to contain sequences of Unicode characters.  Note however that the normative ABNF in this section allows the inclusion of 16-bit quantities, for example unpaired surrogate-block code points, in ways which can never be useful for representing characters and is likely to cause breakage in software designed to process Unicode text.

A kitchen knife is intended to cut food.  Note however that the normative cutting edge of the knife also allows the use against intruding burglars, for example by slashing open their guts, which can never be useful for food and is likely to cause you trouble in a number of ways, including bloodstains on your shirt.

I don't want to read this sentence in the introduction of the manual for my kitchen knife, please.


Again, can we focus on the correct usage and ban avenues for misuse to an appendix?
It could even be called "the JSON character model" to obscure its grisly purpose.

You are not going to get both "correct" and "understandable" if you combine the intended usage with all the arcane legacy issues into one piece of text.

(If I'm the only one who cares about this editorial issue, this is going to be my last comment on this... if this WG does manage to turn one of the nicest-written RFCs into goulash, I can always point my students to read the original.)

Grüße, Carsten


PS.: "16-bit quantities" is highly confusing.  Outside of mathematics and physics, "quantity" gives the amount of something.  There is absolutely no reason not to use the term "code point" from Unicode.

»In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF[16], comprising 1,114,112 code points available for assigning the repertoire of abstract characters.«

See also Table 2-3.

Yes, all of them can be used in a JSON document...  
Some of them must be escaped, some of them are non-characters but otherwise innocuous, some of them (surrogates) actively try to slash your guts, but they are all code points.

»Surrogate code points cannot be conformantly interchanged using Unicode encoding forms. They do not correspond to Unicode scalar values and thus do not have well-formed representations in any Unicode encoding form. (See Section 3.8, Surrogates.)«

So they must be escaped (unless you are using them for their intended purpose in a UTF-16-encoded JSON document, a rare species).  There also is no way to notate a sequence of a high surrogate and a low surrogate using escape sequences in JSON, because that always stands for the non-BMP code point that results from the values of the two surrogates.  All that should be mentioned somewhere, because it takes a couple of hours of analysis to ascertain that.  (Maybe also mention that that doesn't hurt the "JavaScript string as vector of 16-bit numbers" use case, because the non-BMP characters can be (will be!) taken apart again by the recipient.  So you don't even have to escape them...  More stuff for the appendix, or maybe for a special "worst practice" document.)