Re: [Json] Encoding Schemes

Carsten Bormann <cabo@tzi.org> Tue, 18 June 2013 19:57 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 632D621E805E for <json@ietfa.amsl.com>; Tue, 18 Jun 2013 12:57:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -106.198
X-Spam-Level:
X-Spam-Status: No, score=-106.198 tagged_above=-999 required=5 tests=[AWL=0.051, BAYES_00=-2.599, HELO_EQ_DE=0.35, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pJBFkCfCxxyX for <json@ietfa.amsl.com>; Tue, 18 Jun 2013 12:57:39 -0700 (PDT)
Received: from informatik.uni-bremen.de (mailhost.informatik.uni-bremen.de [IPv6:2001:638:708:30c9::12]) by ietfa.amsl.com (Postfix) with ESMTP id 4793621F843F for <json@ietf.org>; Tue, 18 Jun 2013 12:57:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at informatik.uni-bremen.de
Received: from smtp-fb3.informatik.uni-bremen.de (smtp-fb3.informatik.uni-bremen.de [134.102.224.120]) by informatik.uni-bremen.de (8.14.4/8.14.4) with ESMTP id r5IJvT3e016502; Tue, 18 Jun 2013 21:57:29 +0200 (CEST)
Received: from [192.168.217.105] (p54893361.dip0.t-ipconnect.de [84.137.51.97]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp-fb3.informatik.uni-bremen.de (Postfix) with ESMTPSA id 4ED133521; Tue, 18 Jun 2013 21:57:29 +0200 (CEST)
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Content-Type: text/plain; charset=iso-8859-1
From: Carsten Bormann <cabo@tzi.org>
In-Reply-To: <4626FCFD-90CE-4CE7-A123-ED3E12E7FF0A@vpnc.org>
Date: Tue, 18 Jun 2013 21:57:28 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <4EC0C40B-CFEE-438C-A30F-1F43C017E24E@tzi.org>
References: <A723FC6ECC552A4D8C8249D9E07425A70FC57CF2@xmb-rcd-x10.cisco.com> <20130618183926.GG12085@mercury.ccil.org> <E9527431-1354-4755-8280-634B4A47BA25@tzi.org> <4626FCFD-90CE-4CE7-A123-ED3E12E7FF0A@vpnc.org>
To: Paul Hoffman <paul.hoffman@vpnc.org>
X-Mailer: Apple Mail (2.1508)
Cc: json@ietf.org
Subject: Re: [Json] Encoding Schemes
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Jun 2013 19:57:45 -0000

On Jun 18, 2013, at 21:34, Paul Hoffman <paul.hoffman@vpnc.org> wrote:

>> JSON the media type (application/json) is specifically limited to UTF-8 (and theoretically the two or possibly four other character encoding schemes listed in RFC 4627; the RFC isn't quite consistent here).
> 
> Can you point to the text in the draft that supports that statement? I see the opposite:
>   Encoding considerations: 8bit if UTF-8; binary if UTF-16 or UTF-32
> 
>     JSON may be represented using UTF-8, UTF-16, or UTF-32.  When JSON
>     is written in UTF-8, JSON is 8bit compatible.  When JSON is
>     written in UTF-16 or UTF-32, the binary content-transfer-encoding
>     must be used.

You listed two of the three places, the third is in section 3, which doesn't really list UTF-16 but UTF-16 "BE or LE" (same for UTF-32).  Note that these are three different character encoding schemes (CESs), so it is not clear which ones are actually meant.


   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8


[Obligatory Unicode bashing: Giving one of the three encoding schemes (CESs) for the encoding form (CEF) UTF-16 the same name, i.e., "UTF-16", must have been decided by a very cruel person.]

(Strictly speaking, the other mentions of UTF-16/-32 might be the encoding form, not the scheme, but the RFC simply isn't completely specified here.  I think the only reasonable way to read this has the same result as what Joe is proposing, but some text interpretation is required.  Clearly, the text in section 3 does not work with the BOM-based CESs.  But you might also read the text in 6 as asking for UTF-16 CES and the text in 3 then excluding BOMs so the UTF-16 CES is implicitly big-endian.  So much effort for something so theoretical.)

Grüße, Carsten