Re: [Json] Unpaired surrogates in JSON strings

Tatu Saloranta <tsaloranta@gmail.com> Fri, 07 June 2013 18:15 UTC

Return-Path: <tsaloranta@gmail.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 53A6E21F9951 for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 11:15:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.386
X-Spam-Level:
X-Spam-Status: No, score=-2.386 tagged_above=-999 required=5 tests=[AWL=0.213, BAYES_00=-2.599, HTML_MESSAGE=0.001, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9alNQx+TOECr for <json@ietfa.amsl.com>; Fri, 7 Jun 2013 11:15:55 -0700 (PDT)
Received: from mail-wg0-x22d.google.com (mail-wg0-x22d.google.com [IPv6:2a00:1450:400c:c00::22d]) by ietfa.amsl.com (Postfix) with ESMTP id 117C121F96C6 for <json@ietf.org>; Fri, 7 Jun 2013 11:15:47 -0700 (PDT)
Received: by mail-wg0-f45.google.com with SMTP id n12so3345053wgh.0 for <json@ietf.org>; Fri, 07 Jun 2013 11:15:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=LVDp4NsqAQMVCOxHbfWFDJMbML1aY9/VllZyAUNqZoU=; b=kayb+vlm4b9P1H3OwoRDW0fiW6bK3vKn8jfy9T1u3/igIjaDiKw4lA3XSWhq5S26F6 kRzBUGvOO7SRKbMaSdDyyyK4clMV14MF6WGq2IlU8PsLEG7sjx7rNKwdNeA5AMG21ECc cZWTmlRsFhqUU4vcacZUpMaswTcKB88O2Ps4tl32D2tfZqqLQfm2bcCUCqZyJt7qnvcj tYcqBKWGLrqOSt3EAY6sz2y3aocMiiLL2KCcAR6AWnOt+dPS+UZ7DKxieF5BRLL9sYPB QlNtRnFkOQfZ+CyjMyy14LKacGsf+hUY6REUUACYAqFlF8UKYcM+9CRTEDXIgBfp7D8c C6Pw==
MIME-Version: 1.0
X-Received: by 10.194.134.73 with SMTP id pi9mr36684970wjb.38.1370628947178; Fri, 07 Jun 2013 11:15:47 -0700 (PDT)
Received: by 10.227.97.6 with HTTP; Fri, 7 Jun 2013 11:15:47 -0700 (PDT)
In-Reply-To: <CAK3OfOgw7-hwiYVESNkVe8xCux+JQBY6_-D5L4nthhHjMzXnGQ@mail.gmail.com>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2E7E1@xmb-rcd-x10.cisco.com> <51B06F38.8050707@crockford.com> <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com> <51B1B4E7.8090101@it.aoyama.ac.jp> <9ld3r8pc0tufif18dohb2fmi0ijna1vs4n@hive.bjoern.hoehrmann.de> <CAK3OfOgw7-hwiYVESNkVe8xCux+JQBY6_-D5L4nthhHjMzXnGQ@mail.gmail.com>
Date: Fri, 07 Jun 2013 11:15:47 -0700
Message-ID: <CAGrxA25ZCxAXoEzd0=Xtt4DWW4--Wa_4GpqpXV_mfWOD-+DkwA@mail.gmail.com>
From: Tatu Saloranta <tsaloranta@gmail.com>
To: Nico Williams <nico@cryptonector.com>
Content-Type: multipart/alternative; boundary="e89a8f3ba28b83fc3404de946ba4"
Cc: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Bjoern Hoehrmann <derhoermi@gmx.net>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Unpaired surrogates in JSON strings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Jun 2013 18:15:56 -0000

On Fri, Jun 7, 2013 at 11:09 AM, Nico Williams <nico@cryptonector.com>wrote:

> On Fri, Jun 7, 2013 at 5:42 AM, Bjoern Hoehrmann <derhoermi@gmx.net>
> wrote:
> > Actually there are many good reasons for having unpaired surrogates in
> > JSON documents. A simple example would be a test suite for string APIs.
>
> Or what heck, if ECMAScript allows any 16-bit values, 0x0000..0xFFFF
> to be used (escaped as \uXXXX if necessary) then one very useful use
> of that is encoding binary data: when parsing you know if you have
> binary data when you see any 16-bit code units that don't make any
> sense in Unicode text.  Not that I'm advocating this... but if we did
> allow this then it wouldn't preclude us from saying that a string of
> text must not include unpaired surrogates.


Not really. Since most content is exchanged as UTF-8, it gives you 3-to-2
conversion, which is no better than Base64 (4-to-3), nor necessarily any
faster. It would only help when using UTF-16.
APIs would expose this as text, requiring conversion or coercion (depending
on platform), so it would not even be a short-cut.

-+ Tatu +-