Re: [Json] Complete section 3 proposal

Nico Williams <> Tue, 18 June 2013 22:17 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id BCD8721E80D3 for <>; Tue, 18 Jun 2013 15:17:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.966
X-Spam-Status: No, score=-1.966 tagged_above=-999 required=5 tests=[AWL=0.011, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id xOxdR5PtI-tF for <>; Tue, 18 Jun 2013 15:17:37 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 7814111E8106 for <>; Tue, 18 Jun 2013 15:17:37 -0700 (PDT)
Received: from (localhost []) by (Postfix) with ESMTP id 5D3D2400FDB2F for <>; Tue, 18 Jun 2013 15:17:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed;; h= mime-version:in-reply-to:references:date:message-id:subject:from :to:cc:content-type;; bh=KkifQEh105uuLiRrWquO pFA7tks=; b=Pboc1VkvOggfaT0ShvGiXbQQUESMKr5fYGjsZfh43+5BOIDA8Ct7 IdXG+9QKgVigUN+E3DtbB0QHfngLCLPQy6WSfpsNh15B4ku4cg1StXgezbEkjv5d 2ejk1Dp8uaKO2dMYVI+FUm4jjbc3nRYofJQcOI6HC5aQi3uWKv1VFXI=
Received: from ( []) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: by (Postfix) with ESMTPSA id 0E951400FDB2B for <>; Tue, 18 Jun 2013 15:17:31 -0700 (PDT)
Received: by with SMTP id k14so3982377wgh.29 for <>; Tue, 18 Jun 2013 15:17:29 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=aTcIhT57Hkq0YYptxNfSNrtesc+B0c9R7gUY/5OSw8c=; b=NkgwePN6C4kBbH8aBru83dhov9TG44wT+HR9vb3qUj89KsM53JXA2MZbMNDk/Xbp8/ b0y2GR0Ky6RVS3+BBRyUkcDxZv9Go9fT9RrSieQnPIiyNZPXe7SPKCKWfHa+CONLXL8d tqNrCJbxiJhIlu5WnNTxt2wEhh5ii35yGBsCCetUUIedVBeIXG045bIkqrDrJ12MFQsX vU8qhtoYMvl31DtQJA0hPcHH1JaURoW2wz2R04Ei5OW16KCBeJ13wANW5V/8Euxbb/BM NF8rWtwXSADHzd4NL4muERf5huMX5HKlLwRTQotG2meUUM3/dvhAKjSbJ6bE/zIB1zE5 Vuvg==
MIME-Version: 1.0
X-Received: by with SMTP id s3mr12230481wja.41.1371593849131; Tue, 18 Jun 2013 15:17:29 -0700 (PDT)
Received: by with HTTP; Tue, 18 Jun 2013 15:17:29 -0700 (PDT)
In-Reply-To: <>
References: <> <>
Date: Tue, 18 Jun 2013 17:17:29 -0500
Message-ID: <>
From: Nico Williams <>
To: "Joe Hildebrand (jhildebr)" <>
Content-Type: text/plain; charset=UTF-8
Cc: "" <>
Subject: Re: [Json] Complete section 3 proposal
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 18 Jun 2013 22:17:42 -0000

On Tue, Jun 18, 2013 at 5:05 PM, Joe Hildebrand (jhildebr)
<> wrote:
> On 6/18/13 2:52 PM, "Nico Williams" <> wrote:
>>Note that if a JSON string in JSON data contains unescaped naked
>>surrogates then the encoding of that data will not be valid UTF-8,
>>UTF-16, nor, for that matter, CESU-8.  And some implementations
>>probably produce CESU-8-encoded data.
> I think the spirit of 4627 was that it literally be UTF-8, and that all of
> those other odd encodings are already non-conformant.  We could always add
> a note that says that there has been a history of encodings not being
> quite adequately specified, so old software may produce octet streams that
> this document doesn't describe.

We've been over this.  The spirit was that strings were of Unicode
characters, but really they're of code points.  And so on.  And
there's no consensus to break existing encoders.  And so I don't think
we can get consensus for that MUST w/o that note.

>>I'm not sure whether that's
>>worth stating here or elsewhere, but the fact that there's
>>not-quite-UTF-8 JSON out there means this SHALL is either
>>interop-breaking or the matter must be mentioned nearby.
> I agree it might be interop-breaking, but I don't think that's necessarily
> the spec's fault.  People will write bad software, particularly when they
> don't have test vectors easily at hand for them to probe what they
> originally thought were edge cases.

Well, if you can get consensus for it...

It's not that people write bad code.  It's that bad *data* exists and
why should encoders look for it (particularly naked surrogates, why
look for them)?  I think in practice not much can be done about this
but note the problem and encourage encoders not to add to this (i.e.,
encoders should escape naked surrogates).