Re: [Json] Encoding Schemes

Carsten Bormann <> Tue, 18 June 2013 19:57 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 632D621E805E for <>; Tue, 18 Jun 2013 12:57:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -106.198
X-Spam-Status: No, score=-106.198 tagged_above=-999 required=5 tests=[AWL=0.051, BAYES_00=-2.599, HELO_EQ_DE=0.35, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id pJBFkCfCxxyX for <>; Tue, 18 Jun 2013 12:57:39 -0700 (PDT)
Received: from ( [IPv6:2001:638:708:30c9::12]) by (Postfix) with ESMTP id 4793621F843F for <>; Tue, 18 Jun 2013 12:57:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
Received: from ( []) by (8.14.4/8.14.4) with ESMTP id r5IJvT3e016502; Tue, 18 Jun 2013 21:57:29 +0200 (CEST)
Received: from [] ( []) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPSA id 4ED133521; Tue, 18 Jun 2013 21:57:29 +0200 (CEST)
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Content-Type: text/plain; charset=iso-8859-1
From: Carsten Bormann <>
In-Reply-To: <>
Date: Tue, 18 Jun 2013 21:57:28 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <> <>
To: Paul Hoffman <>
X-Mailer: Apple Mail (2.1508)
Subject: Re: [Json] Encoding Schemes
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 18 Jun 2013 19:57:45 -0000

On Jun 18, 2013, at 21:34, Paul Hoffman <> wrote:

>> JSON the media type (application/json) is specifically limited to UTF-8 (and theoretically the two or possibly four other character encoding schemes listed in RFC 4627; the RFC isn't quite consistent here).
> Can you point to the text in the draft that supports that statement? I see the opposite:
>   Encoding considerations: 8bit if UTF-8; binary if UTF-16 or UTF-32
>     JSON may be represented using UTF-8, UTF-16, or UTF-32.  When JSON
>     is written in UTF-8, JSON is 8bit compatible.  When JSON is
>     written in UTF-16 or UTF-32, the binary content-transfer-encoding
>     must be used.

You listed two of the three places, the third is in section 3, which doesn't really list UTF-16 but UTF-16 "BE or LE" (same for UTF-32).  Note that these are three different character encoding schemes (CESs), so it is not clear which ones are actually meant.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

[Obligatory Unicode bashing: Giving one of the three encoding schemes (CESs) for the encoding form (CEF) UTF-16 the same name, i.e., "UTF-16", must have been decided by a very cruel person.]

(Strictly speaking, the other mentions of UTF-16/-32 might be the encoding form, not the scheme, but the RFC simply isn't completely specified here.  I think the only reasonable way to read this has the same result as what Joe is proposing, but some text interpretation is required.  Clearly, the text in section 3 does not work with the BOM-based CESs.  But you might also read the text in 6 as asking for UTF-16 CES and the text in 3 then excluding BOMs so the UTF-16 CES is implicitly big-endian.  So much effort for something so theoretical.)

Grüße, Carsten