[apps-discuss] Thoughts on text/* encoding defaults

Julian Reschke <julian.reschke@gmx.de> Mon, 06 June 2011 12:42 UTC

Return-Path: <julian.reschke@gmx.de>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D535711E8135 for <apps-discuss@ietfa.amsl.com>; Mon, 6 Jun 2011 05:42:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.599
X-Spam-Level:
X-Spam-Status: No, score=-102.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kNBnaFT5yAXO for <apps-discuss@ietfa.amsl.com>; Mon, 6 Jun 2011 05:42:18 -0700 (PDT)
Received: from mailout-de.gmx.net (mailout-de.gmx.net [213.165.64.23]) by ietfa.amsl.com (Postfix) with SMTP id 949E111E8133 for <apps-discuss@ietf.org>; Mon, 6 Jun 2011 05:42:16 -0700 (PDT)
Received: (qmail invoked by alias); 06 Jun 2011 12:42:15 -0000
Received: from mail.greenbytes.de (EHLO [192.168.1.140]) [217.91.35.233] by mail.gmx.net (mp013) with SMTP; 06 Jun 2011 14:42:15 +0200
X-Authenticated: #1915285
X-Provags-ID: V01U2FsdGVkX1+reiMucm6rVutIQvUt1rNHsyk4h+u7dL9r/qC8sA FMECnLVL+s13xF
Message-ID: <4DECCB27.4030209@gmx.de>
Date: Mon, 06 Jun 2011 14:42:15 +0200
From: Julian Reschke <julian.reschke@gmx.de>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.17) Gecko/20110414 Lightning/1.0b2 Thunderbird/3.1.10
MIME-Version: 1.0
To: IETF Apps Discuss <apps-discuss@ietf.org>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Y-GMX-Trusted: 0
Subject: [apps-discuss] Thoughts on text/* encoding defaults
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 06 Jun 2011 12:42:18 -0000

Hi there.

In Prague, we had a few hallway conversations with respect to the 
default encoding of text/* media types.

Below are my notes (references to the relevant spec sections, 
information about a recent change in HTTPbis, and a rough proposal about 
how to proceed).

Best regards, Julian

-- snip --


1) RFC 2046 says that the default is US-ASCII

"Note that the character set used, if anything other than US- ASCII, 
must always be explicitly specified in the Content-Type field." -- 
<http://greenbytes.de/tech/webdav/rfc2046.html#rfc.section.4.1.2.p.18>

2) RFC 2616 says it's ISO-8859-1

"The "charset" parameter is used with some media types to define the 
character set (Section 3.4) of the data. When no explicit charset 
parameter is provided by the sender, media subtypes of the "text" type 
are defined to have a default charset value of "ISO-8859-1" when 
received via HTTP. Data in character sets other than "ISO-8859-1" or its 
subsets MUST be labeled with an appropriate charset value. See Section 
3.4.1 for compatibility problems." -- 
<http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.3.7.1.p.4>

3) For text/xml, RFC 3023 says it's US-ASCII, no matter what 2616 says :-)

"Conformant with [RFC2046], if a text/xml entity is received with the 
charset parameter omitted, MIME processors and XML processors MUST use 
the default charset value of "us-ascii"[ASCII].  In cases where the XML 
MIME entity is transmitted via HTTP, the default charset value is still 
"us-ascii".  (Note: There is an inconsistency between this specification 
and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a 
historical reason.  Since XML is a new format, a new default should be 
chosen for better I18N.  US-ASCII was chosen, since it is the 
intersection of UTF-8 and ISO-8859-1 and since it is already used by 
MIME.)" -- <http://tools.ietf.org/html/rfc3023#section-3.1>

The problem

Recipients do not implement this; they take the absence of encoding 
information as indicator to inspect the payload; this is at least true 
for text/xml and text/html (see 
<http://www.w3.org/TR/REC-xml/#sec-guessing> and 
<http://www.w3.org/TR/2011/WD-html5-20110405/parsing.html#determining-the-character-encoding>)

Current development: HTTPbis, P3 has dropped drop the default and 
delegate to the relevant media type definitions (see 
<http://trac.tools.ietf.org/wg/httpbis/trac/ticket/20>, 
<http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p3-payload-14.html>).

Left to do:

a) Revise RFC 2046; allow text/* types that carry encoding information 
inline to do the expected thing (overriding the US-ASCII default); warn 
against doing so in new registrations (recommend to only support UTF-8, 
and require to always explicitly include the charset parameter, such as 
text/vcard is going to do it?)

b) Revise RFC 3023 to delegate text/xml charset defaults to revision of 
2046?

Best regards, Julian