Re: [apps-discuss] draft-ietf-appsawg-xml-mediatypes vs. JSON and BOM and UTF-8

Larry Masinter <masinter@adobe.com> Thu, 09 January 2014 01:44 UTC

Return-Path: <masinter@adobe.com>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 77EA41ADF9A for <apps-discuss@ietfa.amsl.com>; Wed, 8 Jan 2014 17:44:21 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.601
X-Spam-Level:
X-Spam-Status: No, score=-2.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ae4SFT3HRCiH for <apps-discuss@ietfa.amsl.com>; Wed, 8 Jan 2014 17:44:19 -0800 (PST)
Received: from na01-bn1-obe.outbound.protection.outlook.com (mail-bn1lp0149.outbound.protection.outlook.com [207.46.163.149]) by ietfa.amsl.com (Postfix) with ESMTP id 5B21C1ADFA1 for <apps-discuss@ietf.org>; Wed, 8 Jan 2014 17:44:19 -0800 (PST)
Received: from BL2PR02MB307.namprd02.prod.outlook.com (10.141.91.21) by BL2PR02MB305.namprd02.prod.outlook.com (10.141.91.17) with Microsoft SMTP Server (TLS) id 15.0.842.7; Thu, 9 Jan 2014 01:44:09 +0000
Received: from BL2PR02MB307.namprd02.prod.outlook.com ([10.141.91.21]) by BL2PR02MB307.namprd02.prod.outlook.com ([10.141.91.21]) with mapi id 15.00.0842.003; Thu, 9 Jan 2014 01:44:09 +0000
From: Larry Masinter <masinter@adobe.com>
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
Thread-Topic: [apps-discuss] draft-ietf-appsawg-xml-mediatypes vs. JSON and BOM and UTF-8
Thread-Index: Ac8L2XBoRlaAodKPT3Ke5g3NgorsigAf0axqAB9QbvA=
Date: Thu, 09 Jan 2014 01:44:08 +0000
Message-ID: <9711cfa14ec04c29bc5eadc2bca83c15@BL2PR02MB307.namprd02.prod.outlook.com>
References: <dc29826a2bbf48088abe51bb5de22e0d@BL2PR02MB307.namprd02.prod.outlook.com> <f5b38kyhjyt.fsf@troutbeck.inf.ed.ac.uk>
In-Reply-To: <f5b38kyhjyt.fsf@troutbeck.inf.ed.ac.uk>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [50.184.24.49]
x-forefront-prvs: 008663486A
x-forefront-antispam-report: SFV:NSPM; SFS:(10019001)(189002)(199002)(65816001)(80976001)(74316001)(85306002)(76786001)(85852003)(2656002)(87936001)(79102001)(76576001)(56816005)(63696002)(66066001)(74366001)(69226001)(81686001)(83072002)(81542001)(76796001)(80022001)(87266001)(46102001)(74706001)(74876001)(81816001)(53806001)(31966008)(54356001)(76482001)(51856001)(74662001)(54316002)(77982001)(59766001)(90146001)(33646001)(81342001)(47446002)(50986001)(49866001)(56776001)(83322001)(74502001)(47976001)(4396001)(47736001)(92566001)(24736002); DIR:OUT; SFP:1102; SCL:1; SRVR:BL2PR02MB305; H:BL2PR02MB307.namprd02.prod.outlook.com; CLIP:50.184.24.49; FPR:; RD:InfoNoRecords; MX:1; A:1; LANG:en;
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: adobe.com
Cc: "apps-discuss@ietf.org" <apps-discuss@ietf.org>
Subject: Re: [apps-discuss] draft-ietf-appsawg-xml-mediatypes vs. JSON and BOM and UTF-8
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Jan 2014 01:44:21 -0000

I said
> I don't really understand what a "XML-unaware MIME producer" is. If
> they're "unaware", why are they reading this spec?

But I think I'd missed this the first pass, I think I just don't like the -aware terminology here.
Please don't clarify.

There are three ways content can advertise its encoding:
  Charset parameter to media type
   Internal charset declaration
   Initial BOM (indicates utf16be, utf16le, utf-32, utf-8)

I think things get simpler if you treat the non-aware case just one of the exceptions to the "SHOULD" for transmitters.
That is, rather than dividing the world into "aware" and "not-aware", just make two categories:

* Those than can send UTF-8 without a BOM and with an internal charset declaration of UTF-8.
* Those that can't do so.

"producers of XML  SHOULD produce UTF-8, with no BOM, with no charset parameter, with an internal UTF-8 charset declaration", we make this the default; it is a new requirement. Those that can't are in some situation of deprecated legacy implementation or limited capabilities. They can't transcode into UTF-8, can't add or remove a BOM, or can't add or modify an internal charset declaration. Do the best you can, but try to make sure there's a charset parameter and that it matches the actual encoding or an internal charset declaration (because then processors will fail. 

Of course, there are more constraints on processors, because they need to deal with legacy, but this will give clear advice to senders.