Re: [Json] Proposal for strings/Unicode text

Nico Williams <nico@cryptonector.com> Thu, 13 June 2013 18:16 UTC

Return-Path: <nico@cryptonector.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 33DF421F9A03 for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 11:16:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.963
X-Spam-Level:
X-Spam-Status: No, score=-1.963 tagged_above=-999 required=5 tests=[AWL=0.014, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0Mcklvxeoq4M for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 11:16:11 -0700 (PDT)
Received: from homiemail-a16.g.dreamhost.com (mailbigip.dreamhost.com [208.97.132.5]) by ietfa.amsl.com (Postfix) with ESMTP id 448E621F9A06 for <json@ietf.org>; Thu, 13 Jun 2013 11:16:11 -0700 (PDT)
Received: from homiemail-a16.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a16.g.dreamhost.com (Postfix) with ESMTP id CA528508084 for <json@ietf.org>; Thu, 13 Jun 2013 11:16:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h= mime-version:in-reply-to:references:date:message-id:subject:from :to:cc:content-type:content-transfer-encoding; s= cryptonector.com; bh=plu8huskcP1UOSQAMTNy6oXrI5A=; b=hpLIlX7wHR7 nVlGHft9NY1vWEKlwT4IrzXKL3CSkF3DZ6nNGT9Q4XNBdgoWkkeXTZWSKCydWfhO Ue1RYKDUUGHCISCN/Umpf3hTxtxIFkuWdOfp44bVMRl6B2PtxwJAT/4Bm+D0uASI 5qLyWI8bas/aGRYPW354vgel1+d2In9U=
Received: from mail-we0-f169.google.com (mail-we0-f169.google.com [74.125.82.169]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by homiemail-a16.g.dreamhost.com (Postfix) with ESMTPSA id 78FC9508064 for <json@ietf.org>; Thu, 13 Jun 2013 11:16:10 -0700 (PDT)
Received: by mail-we0-f169.google.com with SMTP id n57so8392492wev.28 for <json@ietf.org>; Thu, 13 Jun 2013 11:16:09 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=P2cKPkHZa46zCdVgDnLM3arVwEd/Lexrl4pACs5S2Jw=; b=lG6ckVizx2xvErUrelDXjg8teBAJrRm6nNmfpDtKJdcan6jRH/QIjSXJgCWWXHmFoS fYjrUdlOPNaJqIVcEXZftH5fUJumgoSrUJ/iIQoYaG3dNQbuar1jDFnQwSGlzHe8F5IQ xwASL6E/eRaLfBxD7IOrPJA9JugXfKtjEan2YRlMdnXvVGXhgTODgsePgtmx47JA/Viu Q8aTjYcHZPSNpu2UcwVv7m/PQk/fHlnEDWpY2C5v0kWdYujmb99OT/40k/m7v2mDII62 J2jjbyuziXcOkQjTMXA8pDOeqy/Xb0n8IhpJsdu+ijy6DnlSYRlNOfODizTmwYIRvveU faSg==
MIME-Version: 1.0
X-Received: by 10.180.109.195 with SMTP id hu3mr1311679wib.13.1371147369060; Thu, 13 Jun 2013 11:16:09 -0700 (PDT)
Received: by 10.216.63.136 with HTTP; Thu, 13 Jun 2013 11:16:08 -0700 (PDT)
In-Reply-To: <CAHBU6ismp6HZqUQOgDnjBRYtC5jFCzhTB3RFG8Ms7qohz+w1eg@mail.gmail.com>
References: <CAHBU6ivNjMUwN2Hsn-E8FKxjqXS6b4qz=_MeeaHahWBWqG_Hgg@mail.gmail.com> <ED62F638-C0C4-411D-BA5B-EB9BA71EDB75@lindenbergsoftware.com> <20130613003213.GA26989@mercury.ccil.org> <jr5jr85h6pig2cr9id5hf1eh586g0u09i7@hive.bjoern.hoehrmann.de> <20130613121620.GB11739@mercury.ccil.org> <CAHBU6ismp6HZqUQOgDnjBRYtC5jFCzhTB3RFG8Ms7qohz+w1eg@mail.gmail.com>
Date: Thu, 13 Jun 2013 13:16:08 -0500
Message-ID: <CAK3OfOhvP_0gUOfkumTGZDa+QM1G9W1q0NmB2KdPxoK_uX+kBQ@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
To: Tim Bray <tbray@textuality.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, John Cowan <cowan@mercury.ccil.org>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Proposal for strings/Unicode text
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Jun 2013 18:16:16 -0000

On Thu, Jun 13, 2013 at 1:00 PM, Tim Bray <tbray@textuality.com> wrote:
> On Thu, Jun 13, 2013 at 5:16 AM, John Cowan <cowan@mercury.ccil.org> wrote:
>> The point is that if JSON is encoded in UTF-8, any surrogate code points
>> MUST be escaped, even though the grammar does not say so.
>
> Why?  UTF-8 is perfectly capable of representing those integers.  Yes, the
> spec says that You Shouldn’t Do That, but it says the same thing about
> unpaired surrogates in UTF-16.  For historical reasons JSON allows the
> encoding of stuff that is strictly nonconforming to Unicode.  This will
> break lots of things, not just UTF-8 decoders (most of which, I bet, will
> never actually notice).  -T

I was thinking the same thing.  We're way into "you shouldn't do that
but actually, look, you can" territory.

I think it's important that some \uXXXX sequences not be unescaped by
intermediaries, or, rather, that certain code points always be escaped
by encoders.  E.g., \u0000, \u2028, \uD800, and so on.  The reason is
that a parser may not be able to handle such code points unescaped.
We should stop at that and otherwise allow just about any code point.

Nico
--