Re: [Json] BOMs

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Mon, 18 November 2013 11:27 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6A1E111E8107; Mon, 18 Nov 2013 03:27:17 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.417
X-Spam-Level:
X-Spam-Status: No, score=-102.417 tagged_above=-999 required=5 tests=[AWL=-2.627, BAYES_00=-2.599, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, MIME_8BIT_HEADER=0.3, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id c7C6nv+PWAyb; Mon, 18 Nov 2013 03:27:12 -0800 (PST)
Received: from scintmta01.scbb.aoyama.ac.jp (scintmta01.scbb.aoyama.ac.jp [133.2.253.33]) by ietfa.amsl.com (Postfix) with ESMTP id ABAFB11E8185; Mon, 18 Nov 2013 03:27:10 -0800 (PST)
Received: from scmse02.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta01.scbb.aoyama.ac.jp (secret/secret) with SMTP id rAIBQuto030708; Mon, 18 Nov 2013 20:26:56 +0900
Received: from (unknown [133.2.206.134]) by scmse02.scbb.aoyama.ac.jp with smtp id 137b_275b_554186a0_5044_11e3_97b8_001e6722eec2; Mon, 18 Nov 2013 20:26:56 +0900
Received: from [IPv6:::1] (unknown [133.2.210.1]) by itmail2.it.aoyama.ac.jp (Postfix) with ESMTP id 2644FBFFFE; Mon, 18 Nov 2013 20:26:56 +0900 (JST)
Message-ID: <5289F974.9020709@it.aoyama.ac.jp>
Date: Mon, 18 Nov 2013 20:26:44 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
References: <AA45B3C6-1DC5-4B1E-8045-C9FE76022584@vpnc.org> <CEA92854.2CC53%jhildebr@cisco.com> <20131113224737.GI31823@mercury.ccil.org> <f5bob5n71y7.fsf@troutbeck.inf.ed.ac.uk> <5284B095.4070004@it.aoyama.ac.jp> <C37B2FE59C164DBCA982AC81A56A09AA@codalogic> <f5bk3g6ufqy.fsf@troutbeck.inf.ed.ac.uk>
In-Reply-To: <f5bk3g6ufqy.fsf@troutbeck.inf.ed.ac.uk>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: John Cowan <cowan@mercury.ccil.org>, IETF Discussion <ietf@ietf.org>, Pete Cordell <petejson@codalogic.com>, JSON WG <json@ietf.org>, Anne van Kesteren <annevk@annevk.nl>, www-tag@w3.org, es-discuss <es-discuss@mozilla.org>
Subject: Re: [Json] BOMs
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Nov 2013 11:27:17 -0000

On 2013/11/18 20:11, Henry S. Thompson wrote:
> Pete Cordell writes:
>
>> Given the history below, would it be sensible to accept BOMs for UTF-8
>> encoding, but not for UTF-16 and UTF-32?  In other words, are BOMs needed
>> and/or used in the wild for UTF-16 and UTF-32?
>>
>> Maybe the text can say something like "SHOULD accept BOMs for UTF-8,
>> and MAY accept BOMs for UTF-16 and / or UTF-32"?
>
> My sense is that you'll see more UTF-16 BOMs than anything else.

Yes indeed. BOM means Byte Order Mark. It's crucial for over-the-wire 
UTF-16. (It's irrelevant for in-memory UTF-16, but that's not what we 
are discussing.) To bring up the XML example again, XML actually 
strictly requires a BOM for UTF-16. The IETF definition of UTF-16 does 
not require a BOM for UTF-16. See http://tools.ietf.org/html/rfc2781, in 
particular http://tools.ietf.org/html/rfc2781#section-3.2, 
http://tools.ietf.org/html/rfc2781#section-3.3, and 
http://tools.ietf.org/html/rfc2781#section-4.

For UTF-8, the BOM is not a Byte Order Mark, because such a mark isn't 
necessary at all. It may serve as a signature, but is not necessary, and 
in some circumstances counterproductive.

As for what to say about whether to accept BOMs or not, I'd really want 
to know what the various existing parsers do. If they accept BOMs, then 
we can say they should accept BOMs. If they don't accept BOMs, then we 
should say that they don't.

Regards,   Martin.

> UTF-32 support seems to be waning (at least in the browsers), but
> UTF-16 is in pretty widespread use.  John, do you think you can fool
> google into counting BOMs for us?