[Json] Encoding Schemes

"Joe Hildebrand (jhildebr)" <jhildebr@cisco.com> Tue, 18 June 2013 18:27 UTC

Return-Path: <jhildebr@cisco.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost []) by ietfa.amsl.com (Postfix) with ESMTP id AF5D321F9C87 for <json@ietfa.amsl.com>; Tue, 18 Jun 2013 11:27:33 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.442
X-Spam-Status: No, score=-10.442 tagged_above=-999 required=5 tests=[AWL=-0.443, BAYES_00=-2.599, J_CHICKENPOX_14=0.6, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id oxb2ukRQoapT for <json@ietfa.amsl.com>; Tue, 18 Jun 2013 11:27:28 -0700 (PDT)
Received: from rcdn-iport-9.cisco.com (rcdn-iport-9.cisco.com []) by ietfa.amsl.com (Postfix) with ESMTP id DEB3E21F9C3E for <json@ietf.org>; Tue, 18 Jun 2013 11:27:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=2815; q=dns/txt; s=iport; t=1371580044; x=1372789644; h=from:to:subject:date:message-id:content-id: content-transfer-encoding:mime-version; bh=2pPoVzeXod2x57s+bwhERX4hQZRcto1RzCNT9naAWqk=; b=GpVHWjwX8CzKeqWbIv9rJK24yA2jRe3v68Wvo8pMXBNbtY1oYJBpbTIX NNz8Dz4Cqma4ziLsY1gQYidgM6Gokjc87OTJ1jZ8arDIxLuDt4zdiDOSL lfayc+r8VBjTwfL85J3oLSTEVOXcPLjhHhaPyf9rfk3orn3M7Eri1J7r6 g=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AmYGAJ2lwFGtJXHA/2dsb2JhbABZgwkxSb8PgQIWbQeCJQEEOlEBKhRCJwQbAYgFmj+gRI8KgzhhA5Nvj3SFIYMPgig
X-IronPort-AV: E=Sophos;i="4.87,890,1363132800"; d="scan'208";a="221410292"
Received: from rcdn-core2-5.cisco.com ([]) by rcdn-iport-9.cisco.com with ESMTP; 18 Jun 2013 18:27:23 +0000
Received: from xhc-rcd-x13.cisco.com (xhc-rcd-x13.cisco.com []) by rcdn-core2-5.cisco.com (8.14.5/8.14.5) with ESMTP id r5IIRNAR015700 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL) for <json@ietf.org>; Tue, 18 Jun 2013 18:27:23 GMT
Received: from xmb-rcd-x10.cisco.com ([]) by xhc-rcd-x13.cisco.com ([]) with mapi id 14.02.0318.004; Tue, 18 Jun 2013 13:27:22 -0500
From: "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>
To: "json@ietf.org" <json@ietf.org>
Thread-Topic: Encoding Schemes
Thread-Index: AQHObFF5M05aOqj6pkC704hlIeI8tg==
Date: Tue, 18 Jun 2013 18:27:22 +0000
Message-ID: <A723FC6ECC552A4D8C8249D9E07425A70FC57CF2@xmb-rcd-x10.cisco.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/
x-originating-ip: []
Content-Type: text/plain; charset="us-ascii"
Content-ID: <DA684F004F1C964FAEC4B7DC9B0B079B@emea.cisco.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: [Json] Encoding Schemes
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Jun 2013 18:27:33 -0000

Background: the word "encoding" is somewhat imprecise.  From the Unicode
Glossary (http://www.unicode.org/glossary/):

Character Encoding Scheme. A character encoding form plus byte
serialization. There are seven character encoding schemes in Unicode:
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

Unicode Encoding Form. A character encoding form that assigns each Unicode
scalar value to a unique code unit sequence. The Unicode Standard defines
three Unicode encoding forms: UTF-8, UTF-16, and UTF-32.

Unicode Encoding Scheme. A specified byte serialization for a Unicode
encoding form, including the specification of the handling of a byte order
mark (BOM), if allowed.

In UTF-16 and UTF-32 (note: *not* BE or LE), you SHOULD include a Byte
Order Mark, but if it's not there, you assume BE.  From Unicode 6.2,
section 3.10, D98:

The BOM is not considered part of the content of the text.

The UTF-16 encoding scheme may or may not begin with a BOM. However,
when there is no BOM, and in the absence of a higher-level protocol, the
order of the UTF-16 encoding scheme is big-endian.

and from D101:

The BOM is not considered part of the content of the text.

The UTF-32 encoding scheme may or may not begin with a BOM. However,
when there is no BOM, and in the absence of a higher-level protocol, the
order of the UTF-32 encoding scheme is big-endian.

One could argue that the current table in section 3 of 4627 constitutes
such a higher-level protocol, but since we haven't specified either UTF-16
or UTF-32 in the table, nor have we specified what to do with BOMs, I
don't think that 4627 currently acts as that higher-level protocol.

As such, I believe that the phrase from section 3 of 4627 that says:

"JSON text SHALL be encoded in Unicode.  The default encoding is UTF-8."

needs a little help.  We probably want to include only the 5 encoding
schemes currently mentioned in the draft, not all 7.  Some suggest text:

"When serialized to an octet stream, JSON text SHALL be encoded in one of
the following Unicode encoding schemes: UTF-8,  UTF-16BE, UTF-16LE,
UTF-32BE, and UTF-32LE.  The default and RECOMMENDED encoding is UTF-8.
Byte Order Marks MUST NOT be used to signal byte order in any encoding."

and then on with the changed version of the table that we've discussed.
This text intentionally allows U+FEFF in strings (where they've been
allowed to date as zero width no-break space), but NOT as the first
codepoint in the stream, where, since U+FEFF isn't in the ws production,
they can't be used.

As well, if Unicode adds another encoding scheme one day, I don't want
4627bis to automatically require support for it unintentionally.

Joe Hildebrand