Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]

John C Klensin <john-ietf@jck.com> Mon, 20 August 2007 07:22 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IN1af-0007mJ-CS; Mon, 20 Aug 2007 03:22:21 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IN1ad-0007iA-PY for discuss-confirm+ok@megatron.ietf.org; Mon, 20 Aug 2007 03:22:19 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IN1ad-0007i0-Fy for discuss@apps.ietf.org; Mon, 20 Aug 2007 03:22:19 -0400
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IN1ac-0004ak-2s for discuss@apps.ietf.org; Mon, 20 Aug 2007 03:22:19 -0400
Received: from [127.0.0.1] (helo=p3.JCK.COM) by bs.jck.com with esmtp (Exim 4.34) id 1IN1aS-000Iou-8C; Mon, 20 Aug 2007 03:22:08 -0400
Date: Mon, 20 Aug 2007 03:22:06 -0400
From: John C Klensin <john-ietf@jck.com>
To: Mark Nottingham <mnot@mnot.net>, Martin Duerst <duerst@it.aoyama.ac.jp>
Subject: Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]
Message-ID: <157F4F253535B9C73F8EDC75@p3.JCK.COM>
In-Reply-To: <088FB13E-F12F-4BE7-94FB-78B21C51512E@mnot.net>
References: <BA772834-227A-4C1B-9534-070C50DF05B3@mnot.net> <392C98BA-E7B8-44ED-964B-82FC48162924@mnot.net> <p06240843c2833f4d7f2f@[10.20.30.108]> <465D9142.9050506@gmx.de> <6.0.0.20.2.20070610165356.0a69cec0@localhost> <088FB13E-F12F-4BE7-94FB-78B21C51512E@mnot.net>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 769a46790fb42fbb0b0cc700c82f7081
Cc: Paul Hoffman <phoffman@imc.org>, Apps Discuss <discuss@apps.ietf.org>, Felix Sasaki <fsasaki@w3.org>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>, Richard Ishida <ishida@w3.org>
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org


--On Monday, 20 August, 2007 13:40 +1000 Mark Nottingham
<mnot@mnot.net> wrote:

> On 10/06/2007, at 6:05 PM, Martin Duerst wrote:
>> - RFC 2616 prescribes that headers containing non-ASCII have
>> to use either iso-8859-1 or RFC 2047. This is unnecessarily
>>   complex and not necessarily followed. At the least, new
>>   extensions should be allowed to specify that UTF-8 is used.
> 
> My .02;
> 
> I'm concerned about allowing UTF-8; it may break existing
> implementations.

And whatever is done about it should be consistent with the EAI
work.  Otherwise, we are likely to find ourselves in big trouble
going down the line.

> I'd like to see the text just require that the actual
> character set be 8859-1, but to allow individual extensions to
> nominate encodings *like* 2047,without being restricted to it.
> For example, the encoding specified in 3987 is appropriate for
> URIs. However, it *has* to be explicit; I've heard some people
> read this requirement and think that they need to check
> *every* header for 2047 encoding.

Sigh.  My own sense is that, going forward, we need to lose
8859-N, not make it the default (or only) character set for more
protocols.  It is, to put it mildly, a little Euro-centric (and
not even completely suitable for Europe).  Much of the advantage
of Unicode is that one does not need to designate/ nominate a
particular CCS or encoding and then maintain state for it... and
that is a fairly large advantage.  See also
draft-klensin-unicode-escapes-03.txt(probably expired, but you
should be able to find a copy somewhere -- I'll get back to it
sometime soon) for a discussion of issues in ASCII encoding of
multioctet character sets.   The IRI spec may constrain things
to encoding of octets, but that doesn't make it a good idea.

If we are going to consider changes in this area, let's make
them improvements.  Locking in 8859-1 is not an improvement: it
would, IMO, be better to deprecate its use and require explicit
charset designation always if that is the only choice.

     john