Re: [Json] Proposed minimal change for strings

Nico Williams <> Wed, 03 July 2013 16:41 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id E3A4611E81B2 for <>; Wed, 3 Jul 2013 09:41:52 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.677
X-Spam-Status: No, score=-1.677 tagged_above=-999 required=5 tests=[AWL=-0.300, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, J_CHICKENPOX_14=0.6]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id herJo9u+iyhy for <>; Wed, 3 Jul 2013 09:41:48 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 11EDE21F9D01 for <>; Wed, 3 Jul 2013 09:41:48 -0700 (PDT)
Received: from (localhost []) by (Postfix) with ESMTP id CB2E81006E for <>; Wed, 3 Jul 2013 09:41:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed;; h= mime-version:in-reply-to:references:date:message-id:subject:from :to:cc:content-type;; bh=ATR9xAt0zXBpgT7RHJ/3 j8LL02U=; b=cNOi7yTPfJLChcjjDUmlIuxvrCEKTuXUPc3tv5dETA1ScB9A9pyG 6zwLFisyATown40uW91uSY11dkNk7VcDRKKe13y6MDw8l53KaZPjlfK04ZP3YRPZ xRkuwmoRhBdfaR+Ay0OyjHwQ+AbyTnzBzbb2r9SoagTcJ9TGROcmuxY=
Received: from ( []) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id 7A7F71005D for <>; Wed, 3 Jul 2013 09:41:47 -0700 (PDT)
Received: by with SMTP id n11so314814wgh.33 for <>; Wed, 03 Jul 2013 09:41:45 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=qvFoVkZ43waE8MlKUQ33pIJVtuK7M/h8G219/h67dD0=; b=AB20dkaiKn1fPfv4mQoddn9or7MR+YXKUDDcvxkKpbzAEM6O5y2hxAgZ8UjaIDzzA8 otbl0d82WOKL8dNpIaQJtKu2X+IFo6hSBjTQpG1B4/oeMEoeDWZf7dHf2aKbhZNMjBzr aI0I4MGfrWf3OlXwyNZfarWZCteBsjy2eG5TmweOHFLGaw6R+EEurFDBPXN8JrQO9HG7 6VSfF3gSUK0gEGfpyH+eCk2S+289WrU6jdikxRieAbgcV2klfFJQWtF9V34HBKSGv42e 60sMK0EwGxXMVPeub7oQ/1Rb2fb8TTFN/lne5UxzPXi2mXErCFXAysZQz6Vkaar2VYzq nFVw==
MIME-Version: 1.0
X-Received: by with SMTP id j9mr1196998wja.11.1372869705850; Wed, 03 Jul 2013 09:41:45 -0700 (PDT)
Received: by with HTTP; Wed, 3 Jul 2013 09:41:45 -0700 (PDT)
In-Reply-To: <>
References: <> <> <>
Date: Wed, 3 Jul 2013 11:41:45 -0500
Message-ID: <>
From: Nico Williams <>
To: Paul Hoffman <>
Content-Type: text/plain; charset=UTF-8
Cc: " WG" <>
Subject: Re: [Json] Proposed minimal change for strings
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 03 Jul 2013 16:41:53 -0000

On Wed, Jul 3, 2013 at 10:28 AM, Paul Hoffman <> wrote:
> On Jul 2, 2013, at 8:44 PM, Nico Williams <> wrote:
>> Huh?  Do you mean that any code unit may be allowed if escaped?
> That is exactly what the current document says, I believe. Do you see anything in the grammar that says differently?

No, but I couldn't understand what you wrote, which was

| Proposal 1 (allow all code units in their unescaped form):

surely you did not mean that all 16-bit code units could be sent
unescaped.  But that is how I read that.  I figured you'd typoed.

>>> In section 2.2 (Strings):
>>>  Leave the production for "unescaped" as-is.
>>> In section 3 (Encoding):
>>>  Add "Some strings, notably those that have unescaped surrogate code units
>>>  (value 0xD800 to 0xDFFF), cannot be encoded in UTF-8."
>> Unescaped and *unpaired*.
> No, any surrogate code point. RFC 3629, the IETF's definition of UTF-8, says:
>    The definition of UTF-8 prohibits encoding character numbers between
>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>    encoding form (as surrogate pairs) and do not directly represent
>    characters.
> Similar language is used in The Unicode Standard's definition of UTF-8 (D92), and of UTF-32 (D90).

Yet there are ECMAScript applications that send these.  In their
escaped form there's no conflict with UTF-8 (nor UTF-32, nor even in
UTF-16 if unpaired).  In their unescaped form there is most definitely
a conflict.

Which brings up: what should a parser do when it sees escaped code
units?  Should it attempt to unescape them in the parsed result?  What
if the escaped code unit cannot be unescaped because it would result
in invalid UTF-8/16/32?  RFC4627 says nothing about this.

My proposal is to allow any code units as long as they are escaped if
they are unpaired surrogate code points or as long as they are
unescaped and properly represented in the UTF of the JSON document
otherwise (i.e., if a pair of surragates appear in UTF-16 then when
re-encoding to UTF-8 the result must be UTF-8, not CESU-8, and the
pair must be decoded into a code point then re-encoded in UTF-8).