Re: [Json] Unpaired surrogates in JSON strings

Tim Bray <tbray@textuality.com> Thu, 06 June 2013 14:57 UTC

Return-Path: <tbray@textuality.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E804821F9640 for <json@ietfa.amsl.com>; Thu, 6 Jun 2013 07:57:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.168
X-Spam-Level: *
X-Spam-Status: No, score=1.168 tagged_above=-999 required=5 tests=[AWL=-0.189, BAYES_00=-2.599, FH_RELAY_NODNS=1.451, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, RCVD_IN_PBL=0.905, RCVD_IN_SORBS_DUL=0.877, RDNS_NONE=0.1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zXzFo7-vepDz for <json@ietfa.amsl.com>; Thu, 6 Jun 2013 07:57:48 -0700 (PDT)
Received: from mail-vb0-x22b.google.com (mail-vb0-x22b.google.com [IPv6:2607:f8b0:400c:c02::22b]) by ietfa.amsl.com (Postfix) with ESMTP id BFFA921F90A5 for <json@ietf.org>; Thu, 6 Jun 2013 07:57:48 -0700 (PDT)
Received: by mail-vb0-f43.google.com with SMTP id e15so2037202vbg.30 for <json@ietf.org>; Thu, 06 Jun 2013 07:57:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:date :message-id:subject:from:to:cc:content-type:x-gm-message-state; bh=zuDP0ATOAeP/d/I2KxXsRmYrNzeBsSt+mzxl7ioWRpo=; b=FcuFdrYkDA1XKLuMXaXGG83k7HvL7NUd/JICbLd8UIXNigy+i1+gn/wwKvaNbRgSyT yPP78hzHRz5J2wNH+K4GrbpC5JRz9rfOCqgkjYdQzyzJ03R01/sh6OMRownh/QK3AUMd sb+4SYYzSjHDdyMQrDGmLWt7+TxA1bFRzRI7fano86D+lRkiFEsVqaOxSDxEtXD8br4c 58+byiLsUqfv91ySJCTE6CMfJi9CZgAx1Ks7HlkLrE1MDdtpAdpL+lghgdhpvQGbizOa iZXv6lKSczj9OTcgbkkp6TsuKIq93GdQ6E4GZP2GNE5LY+fPs0Dg0Rm+balu4083V9nG KTPQ==
MIME-Version: 1.0
X-Received: by 10.52.112.5 with SMTP id im5mr3482818vdb.4.1370530668098; Thu, 06 Jun 2013 07:57:48 -0700 (PDT)
Received: by 10.220.48.14 with HTTP; Thu, 6 Jun 2013 07:57:47 -0700 (PDT)
X-Originating-IP: [24.84.235.32]
In-Reply-To: <51B06F38.8050707@crockford.com>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2E7E1@xmb-rcd-x10.cisco.com> <51B06F38.8050707@crockford.com>
Date: Thu, 06 Jun 2013 07:57:47 -0700
Message-ID: <CAHBU6iuFBuW-RfgBLQF5q4BnUOzs088QXW3uOQG1OjBFjZttkw@mail.gmail.com>
From: Tim Bray <tbray@textuality.com>
To: Douglas Crockford <douglas@crockford.com>
Content-Type: multipart/alternative; boundary="bcaec54857e8a0484904de7d89c8"
X-Gm-Message-State: ALoCoQkLXGkV+/dxHl7eHE+qqvijeNQAXLuPhbb1dyiu4nnk+5a1rPc6N2vdzlZw1VEeLlhr2RGN
Cc: Paul Hoffman <paul.hoffman@vpnc.org>, "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>, "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] Unpaired surrogates in JSON strings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Jun 2013 14:57:54 -0000

F0, 90, 8D, 86
On Thu, Jun 6, 2013 at 4:15 AM, Douglas Crockford <douglas@crockford.com>wrote:

> What  then is the standard name for a 16-bit element of text? When
> JavaScript was created, that word was character. What is the word now?
>

The only somewhat-standardized term would be “UTF-16 codepoint”.  But
that’s not really a “unit of text” any more than the 2nd byte of a
character encoded in 3 bytes with UTF-8 is.

I’m fairly shocked.  I have always believed that JSON encodes what its
introduction (and section 2.5 "Strings") say it encodes, Unicode
characters.

If it is a requirement to accommodate the class of bug where languages that
use UTF-16 (Java, JavaScript, C#) can emit unpaired UTF-16 surrogates, the
spec needs to be clear that the INTENT is actually to support Unicode
characters, and that unpaired surrogates are always evidence of a bug, and
there can be no expectation that any software receiving such buggy data
will be able to do anything useful with it, or even avoid crashing in a
hard-to-debug way down in the bowels of a library routine.  -T