Re: [Json] Unpaired surrogates in JSON strings

Paul Hoffman <paul.hoffman@vpnc.org> Thu, 06 June 2013 00:58 UTC

Return-Path: <paul.hoffman@vpnc.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3744521F93E0 for <json@ietfa.amsl.com>; Wed, 5 Jun 2013 17:58:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -99.4
X-Spam-Level:
X-Spam-Status: No, score=-99.4 tagged_above=-999 required=5 tests=[J_CHICKENPOX_14=0.6, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xW1roNwY4ozp for <json@ietfa.amsl.com>; Wed, 5 Jun 2013 17:58:35 -0700 (PDT)
Received: from hoffman.proper.com (IPv6.Hoffman.Proper.COM [IPv6:2605:8e00:100:41::81]) by ietfa.amsl.com (Postfix) with ESMTP id B75DC21F93D4 for <json@ietf.org>; Wed, 5 Jun 2013 17:58:35 -0700 (PDT)
Received: from [10.20.30.90] (50-0-66-165.dsl.dynamic.sonic.net [50.0.66.165]) (authenticated bits=0) by hoffman.proper.com (8.14.5/8.14.5) with ESMTP id r560wY79001434 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO) for <json@ietf.org>; Wed, 5 Jun 2013 17:58:35 -0700 (MST) (envelope-from paul.hoffman@vpnc.org)
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
From: Paul Hoffman <paul.hoffman@vpnc.org>
In-Reply-To: <A723FC6ECC552A4D8C8249D9E07425A70FC2C12D@xmb-rcd-x10.cisco.com>
Date: Wed, 05 Jun 2013 17:58:33 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <83728898-9A2D-4758-9C06-1157E2954CCB@vpnc.org>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC2C12D@xmb-rcd-x10.cisco.com>
To: "json@ietf.org" <json@ietf.org>
X-Mailer: Apple Mail (2.1508)
Subject: Re: [Json] Unpaired surrogates in JSON strings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 06 Jun 2013 00:58:36 -0000

<no hat>

The text says "character" everywhere. It says "code point" exactly once, when it is talking about a character's code point.

The ABNF seems to allow code points, not characters. This is pretty clearly a mistake.

This could be resolved by a warning similar to what Joe proposed, but it should be stronger. Also, Joe's wording assumes that these code points are always represented as escape sequences, but the ABNF allows them to be in a string as raw, unescaped code points.

On Jun 5, 2013, at 4:35 PM, Joe Hildebrand (jhildebr) <jhildebr@cisco.com> wrote:

> I don't mind allowing surrogate pairs in \u notation.  But I think we
> should specify what happens when you send just half of the pair.  For
> example, adding this after section 2.5, graph 4:
> 
> Escape sequences between \uD800 and \uDFFF SHOULD be generated only as
> valid UTF16 surrogate pairs (this SHOULD is only to allow backward
> compatibility).  When encountering an invalid surrogate pair (such as
> "foo\uD834bar" or "\uDD1E\uD834"), parsers MAY either throw an error
> (taking the risk of some backward incompatibility with old generators) or
> MAY ignore the sequence.

Alternate proposal:

Code points between U+D800 and U+DFFF SHOULD be generated only as
valid UTF16 surrogate pairs; this SHOULD is only to allow backward
compatibility with applications that ignored the restriction that
strings consist of Unicode characters. A parser that encounters
an invalid surrogate pair (such as "foo\uD834bar" or "\uDD1E\uD834"),
SHOULD throw an error because the string does not consist of characters;
it might ignore the errant code points, but at the risk of allowing
strings that other parsers would find illegal.

--Paul Hoffman