Character encodings in headers [i74][was: Straw-man charter for http-bis]

Mark Nottingham <mnot@mnot.net> Mon, 20 August 2007 03:40 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IMy8L-0005cC-Jb; Sun, 19 Aug 2007 23:40:53 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IMy8K-0005c5-1P for discuss-confirm+ok@megatron.ietf.org; Sun, 19 Aug 2007 23:40:52 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IMy8J-0005bx-O9 for discuss@apps.ietf.org; Sun, 19 Aug 2007 23:40:51 -0400
Received: from mxout-04.mxes.net ([216.86.168.179]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IMy8J-0008R1-Dk for discuss@apps.ietf.org; Sun, 19 Aug 2007 23:40:51 -0400
Received: from [192.168.1.102] (unknown [121.44.235.196]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTP id E12DAA321C; Sun, 19 Aug 2007 23:40:48 -0400 (EDT)
In-Reply-To: <6.0.0.20.2.20070610165356.0a69cec0@localhost>
References: <BA772834-227A-4C1B-9534-070C50DF05B3@mnot.net> <392C98BA-E7B8-44ED-964B-82FC48162924@mnot.net> <p06240843c2833f4d7f2f@[10.20.30.108]> <465D9142.9050506@gmx.de> <6.0.0.20.2.20070610165356.0a69cec0@localhost>
Mime-Version: 1.0 (Apple Message framework v752.3)
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <088FB13E-F12F-4BE7-94FB-78B21C51512E@mnot.net>
Content-Transfer-Encoding: 7bit
From: Mark Nottingham <mnot@mnot.net>
Subject: Character encodings in headers [i74][was: Straw-man charter for http-bis]
Date: Mon, 20 Aug 2007 13:40:36 +1000
To: Martin Duerst <duerst@it.aoyama.ac.jp>
X-Mailer: Apple Mail (2.752.3)
X-Spam-Score: -0.0 (/)
X-Scan-Signature: 8b30eb7682a596edff707698f4a80f7d
Cc: Richard Ishida <ishida@w3.org>, Apps Discuss <discuss@apps.ietf.org>, Felix Sasaki <fsasaki@w3.org>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>, Paul Hoffman <phoffman@imc.org>
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

On 10/06/2007, at 6:05 PM, Martin Duerst wrote:
> - RFC 2616 prescribes that headers containing non-ASCII have to use
>   either iso-8859-1 or RFC 2047. This is unnecessarily complex and
>   not necessarily followed. At the least, new extensions should be
>   allowed to specify that UTF-8 is used.

My .02;

I'm concerned about allowing UTF-8; it may break existing  
implementations.

I'd like to see the text just require that the actual character set  
be 8859-1, but to allow individual extensions to nominate encodings  
*like* 2047,without being restricted to it. For example, the encoding  
specified in 3987 is appropriate for URIs. However, it *has* to be  
explicit; I've heard some people read this requirement and think that  
they need to check *every* header for 2047 encoding.

So, I think this means;

1) Change
   "Words of *TEXT MAY contain characters from character sets other  
than ISO-8859-1 [22] only when encoded according to the rules of RFC  
2047 [14]."
to
   "Words of *TEXT MUST NOT contain characters from character sets  
other than ISO-885901 [22]."
and,

2) Identify headers that may have non-8859 content and explicitly say  
how to encode them (IRI, 2047, whatever; the existing ones will have  
to be 2047, I believe), modifying their BNF to suit.

3) When we document extensibility, require new headers to nominate  
any encoding explicitly.

--
Mark Nottingham     http://www.mnot.net/