Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]

Keith Moore <moore@cs.utk.edu> Mon, 20 August 2007 19:30 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1INCwy-0005H6-H4; Mon, 20 Aug 2007 15:30:08 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1INCwx-0005Gz-Bo for discuss-confirm+ok@megatron.ietf.org; Mon, 20 Aug 2007 15:30:07 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1INCwx-0005Gr-2K for discuss@apps.ietf.org; Mon, 20 Aug 2007 15:30:07 -0400
Received: from shu.cs.utk.edu ([160.36.56.39]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1INCwv-0007F5-Qm for discuss@apps.ietf.org; Mon, 20 Aug 2007 15:30:07 -0400
Received: from localhost (localhost [127.0.0.1]) by shu.cs.utk.edu (Postfix) with ESMTP id 78F0B1EE33D; Mon, 20 Aug 2007 15:30:04 -0400 (EDT)
X-Virus-Scanned: by amavisd-new with ClamAV and SpamAssasin at cs.utk.edu
Received: from shu.cs.utk.edu ([127.0.0.1]) by localhost (bes.cs.utk.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 92o3Tzrf7heq; Mon, 20 Aug 2007 15:29:59 -0400 (EDT)
Received: from lust.indecency.org (user-119b1dm.biz.mindspring.com [66.149.133.182]) by shu.cs.utk.edu (Postfix) with ESMTP id 3BC3E1EE32B; Mon, 20 Aug 2007 15:29:58 -0400 (EDT)
Message-ID: <46C9EBAA.8030608@cs.utk.edu>
Date: Mon, 20 Aug 2007 15:29:46 -0400
From: Keith Moore <moore@cs.utk.edu>
User-Agent: Thunderbird 2.0.0.6 (Macintosh/20070728)
MIME-Version: 1.0
To: der Mouse <mouse@Rodents.Montreal.QC.CA>
Subject: Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]
References: <BA772834-227A-4C1B-9534-070C50DF05B3@mnot.net> <392C98BA-E7B8-44ED-964B-82FC48162924@mnot.net> <p06240843c2833f4d7f2f@[10.20.30.108]> <465D9142.9050506@gmx.de> <6.0.0.20.2.20070610165356.0a69cec0@localhost> <088FB13E-F12F-4BE7-94FB-78B21C51512E@mnot.net> <157F4F253535B9C73F8EDC75@p3.JCK.COM> <6B8E3D7A-71B8-4B8D-9625-2AB3C74A9072@mnot.net> <6.0.0.20.2.20070820181338.07260770@localhost> <200708201436.KAA27449@Sparkle.Rodents.Montreal.QC.CA>
In-Reply-To: <200708201436.KAA27449@Sparkle.Rodents.Montreal.QC.CA>
X-Enigmail-Version: 0.95.2
OpenPGP: id=E1473978
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 93238566e09e6e262849b4f805833007
Cc: discuss@apps.ietf.org, Felix Sasaki <fsasaki@w3.org>, ietf-http-wg@w3.org, Richard Ishida <ishida@w3.org>
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

der Mouse wrote:
>> I think you present a valid scenario.  However, storing headers as
>> iso-8859-1 essentially means storing (and resending) them as bytes.
>>     
>
> Depends on how much checking is done.  The C0 and C1 ranges are not
> valid 8859-x text (except for a few codes in C0, like HT), but, as
> Clive points out, C1 does, in general, occur in UTF-8-encoded text.
>
> I recognize there's a "who would bother to check" tendency.  While I
> share it, I also believe the number of distinct implementations out
> there is large enough that anything permitted by the spec has probably
> been done (and, of course, a great many things not permitted by the
> spec, but I see no reason to care about compatability with them).  In
> particular, any implementation whose native text encoding is not 8859-1
> may be recoding headers into its native encoding for storage and back
> again on output, and that is almost certain to corrupt C1 octets.
I suspect that the problem is not so much transparency, as
presentation.  The larger set of things broken by allowing utf-8 in
existing header fields (and to a lesser extent new fields) will not be
things that forbid C1 octet values, but rather things that try to
display those fields as if they were 8859/1.  Translation of the
presumed 8859/1 into other charsets is another version of the same problem.