Re: [apps-discuss] draft-ietf-appsawg-xml-mediatypes vs. JSON and BOM and UTF-8

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Thu, 09 January 2014 01:55 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C703E1ADFBB for <apps-discuss@ietfa.amsl.com>; Wed, 8 Jan 2014 17:55:57 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.371
X-Spam-Level:
X-Spam-Status: No, score=0.371 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, MIME_8BIT_HEADER=0.3, RP_MATCHES_RCVD=-0.538] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9vpC3aVb28IT for <apps-discuss@ietfa.amsl.com>; Wed, 8 Jan 2014 17:55:56 -0800 (PST)
Received: from scintmta01.scbb.aoyama.ac.jp (scintmta01.scbb.aoyama.ac.jp [133.2.253.33]) by ietfa.amsl.com (Postfix) with ESMTP id C3B9B1ADFB0 for <apps-discuss@ietf.org>; Wed, 8 Jan 2014 17:55:55 -0800 (PST)
Received: from scmse02.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta01.scbb.aoyama.ac.jp (secret/secret) with SMTP id s091tdbN020754; Thu, 9 Jan 2014 10:55:39 +0900
Received: from (unknown [133.2.206.134]) by scmse02.scbb.aoyama.ac.jp with smtp id 2b21_06a5_2402f654_78d1_11e3_9dd9_001e6722eec2; Thu, 09 Jan 2014 10:55:39 +0900
Received: from [IPv6:::1] (unknown [133.2.210.1]) by itmail2.it.aoyama.ac.jp (Postfix) with ESMTP id 750B6BF548; Thu, 9 Jan 2014 10:55:39 +0900 (JST)
Message-ID: <52CE018D.4020303@it.aoyama.ac.jp>
Date: Thu, 09 Jan 2014 10:55:25 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: Larry Masinter <masinter@adobe.com>
References: <dc29826a2bbf48088abe51bb5de22e0d@BL2PR02MB307.namprd02.prod.outlook.com> <f5b38kyhjyt.fsf@troutbeck.inf.ed.ac.uk> <9711cfa14ec04c29bc5eadc2bca83c15@BL2PR02MB307.namprd02.prod.outlook.com>
In-Reply-To: <9711cfa14ec04c29bc5eadc2bca83c15@BL2PR02MB307.namprd02.prod.outlook.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: "apps-discuss@ietf.org" <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] draft-ietf-appsawg-xml-mediatypes vs. JSON and BOM and UTF-8
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Jan 2014 01:55:58 -0000

On 2014/01/09 10:44, Larry Masinter wrote:
> I said
>> I don't really understand what a "XML-unaware MIME producer" is. If
>> they're "unaware", why are they reading this spec?
>
> But I think I'd missed this the first pass, I think I just don't like the -aware terminology here.
> Please don't clarify.
>
> There are three ways content can advertise its encoding:
>    Charset parameter to media type
>     Internal charset declaration
>     Initial BOM (indicates utf16be, utf16le, utf-32, utf-8)
>
> I think things get simpler if you treat the non-aware case just one of the exceptions to the "SHOULD" for transmitters.
> That is, rather than dividing the world into "aware" and "not-aware", just make two categories:
>
> * Those than can send UTF-8 without a BOM and with an internal charset declaration of UTF-8.
> * Those that can't do so.
>
> "producers of XML  SHOULD produce UTF-8, with no BOM, with no charset parameter, with an internal UTF-8 charset declaration", we make this the default; it is a new requirement. Those that can't are in some situation of deprecated legacy implementation or limited capabilities. They can't transcode into UTF-8, can't add or remove a BOM, or can't add or modify an internal charset declaration. Do the best you can, but try to make sure there's a charset parameter and that it matches the actual encoding or an internal charset declaration (because then processors will fail.

Why would the default here be *with* an internal UTF-8 declaration? 
Documents without such a declaration are also well-formed/valid UTF-8 
XML documents (assuming that they are well-formed/valid otherwise).

Moving to an UTF-8 only world, not having to declare UTF-8 anywhere 
should be part of that world. It is not for HTML, unfortunately, but it 
is perfectly possible for XML.

Regards,   Martin.

> Of course, there are more constraints on processors, because they need to deal with legacy, but this will give clear advice to senders.