Re: [Json] On characters and code points

Nico Williams <nico@cryptonector.com> Sat, 08 June 2013 20:41 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A506A21F9632 for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 13:41:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.977
X-Spam-Level:
X-Spam-Status: No, score=-1.977 tagged_above=-999 required=5 tests=[AWL=-0.000, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id A9EKIk-Sp5MW for <json@ietfa.amsl.com>; Sat, 8 Jun 2013 13:41:46 -0700 (PDT)
Received: from homiemail-a32.g.dreamhost.com (caiajhbdcbbj.dreamhost.com [208.97.132.119]) by ietfa.amsl.com (Postfix) with ESMTP id 725E821F9600 for <json@ietf.org>; Sat, 8 Jun 2013 13:41:46 -0700 (PDT)
Received: from homiemail-a32.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a32.g.dreamhost.com (Postfix) with ESMTP id 4E485584058 for <json@ietf.org>; Sat, 8 Jun 2013 13:41:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h= mime-version:in-reply-to:references:date:message-id:subject:from :to:cc:content-type:content-transfer-encoding; s= cryptonector.com; bh=kyBWUDCNplhNCS5IqLUwdXEcab8=; b=KyMYzEqZ6i0 IZYFw9XdkewpkJUjqyndWsLsu85RImTL5L9pkbxwd0R1yw+3pu7lkm0PWolEX/iQ 6gQEtpxNWWgNDMp6UVQXVz7PoN4rLXp8VFBq1ujac+yMnbuydxFfUj+bMdBKkECS CLxj7Q03SFhwkyrwmDwy//w3sIQK1o7Y=
Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by homiemail-a32.g.dreamhost.com (Postfix) with ESMTPSA id 031F6584057 for <json@ietf.org>; Sat, 8 Jun 2013 13:41:44 -0700 (PDT)
Received: by mail-wi0-f177.google.com with SMTP id ey16so2177525wid.16 for <json@ietf.org>; Sat, 08 Jun 2013 13:41:43 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=e8uteiSRcJ7u34F7P+xDv+0j9x7vh0SeqPY5l4Nj4RQ=; b=hs+POFK2NJ+7Iy7JKKL5VolETzWs12/8s8KadoIQz5zNOaS1IKtjh/6QBhu6OyUrGK aU85fEfQjvu8u/mrcblt4I9471MAaxEu9t1NH5ydzYgqHomEsgZAdRUmGAP16bfByM2H Lpta0KL0RhEe5j2kOdkzsFzm3osisGLajhyAl68mHnfoBojqh2lhfEYENF6gP16injo1 KvuwCXAAh/dnFhQigxgMYqBu6zlEbbA2QiUPu4xec//wW2cwMERrNaGBnZ65m+nzELGt +oDGsjo8IiVP4n1Fouwrxaas6MYoQi/oim9PqH08+qEbSrmkkpKM5y1qN3ti+UZZKC03 vlHg==
MIME-Version: 1.0
X-Received: by 10.181.12.1 with SMTP id em1mr1580670wid.4.1370724103463; Sat, 08 Jun 2013 13:41:43 -0700 (PDT)
Received: by 10.216.63.136 with HTTP; Sat, 8 Jun 2013 13:41:43 -0700 (PDT)
In-Reply-To: <3A9644F9-A0E2-46FA-B4BD-9A834C2F442B@vpnc.org>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2E7E1@xmb-rcd-x10.cisco.com> <51B06F38.8050707@crockford.com> <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com> <51B1B4E7.8090101@it.aoyama.ac.jp> <9ld3r8pc0tufif18dohb2fmi0ijna1vs4n@hive.bjoern.hoehrmann.de> <56A163E9-E7CD-46B3-9984-8F009EBFF500@vpnc.org> <CDFC7751-98EE-466C-98D9-A53D278B2113@tzi.org> <3A9644F9-A0E2-46FA-B4BD-9A834C2F442B@vpnc.org>
Date: Sat, 08 Jun 2013 15:41:43 -0500
Message-ID: <CAK3OfOhHVz_yae02EacgpHC7xbqU3A_JWuuENgafQ8LF=iyVRg@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
To: Paul Hoffman <paul.hoffman@vpnc.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: Carsten Bormann <cabo@tzi.org>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] On characters and code points
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 08 Jun 2013 20:41:51 -0000

On Sat, Jun 8, 2013 at 10:25 AM, Paul Hoffman <paul.hoffman@vpnc.org> wrote:
> On Jun 8, 2013, at 2:38 AM, Carsten Bormann <cabo@tzi.org> wrote:
>> (4) The remaining question appears what we do with unpaired escaped surrogates.
>> The answer will have to be a bit wishy-washy, because anything strong will invalidate half of the implementations.  If is probably a good idea not to "break" those applications that compensate JavaScript's lack of a binary string type by using UTF-16 as a vector of unconstrained 16-bit values, but we also cannot mandate that everyone adopt this 1990s style hack.

I'm starting to conclude that "anything goes..."

"...with the understanding that in some cases some, er, code points
get dropped".  E.g., a surrogate encoded in UTF-8 or UTF-32 makes no
sense, and though some parsers can leave it be, a parser that converts
to UTF-16 probably can't leave there.  Note that there's a bit of a
difference between an escaped and an unescaped such code point: the
former might be an attempt to pass binary data, while the latter could
just be either an error or the result of unescaping the first.

> Ummm, how is that "much better"? "Code points minus THEONESWEHATE" seems a lot simpler than "characters plus ADDITIONAL1 plus ADDITIONAL2 plus ADDITIONAL3".

The latter seems better in that we defer the definition of "character"
to the UC, but the former seems better in that it's much more specific
about THEONESWEHATE.  The latter seems likely to allow more
interoperability with any implementations that allow just about
anything.

Nico
--