Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]

der Mouse <mouse@Rodents.Montreal.QC.CA> Mon, 20 August 2007 14:37 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IN8NZ-00048n-0r; Mon, 20 Aug 2007 10:37:17 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IN8NY-00048h-8k for discuss-confirm+ok@megatron.ietf.org; Mon, 20 Aug 2007 10:37:16 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IN8NX-00048Z-VV for discuss@apps.ietf.org; Mon, 20 Aug 2007 10:37:15 -0400
Received: from sparkle.rodents.montreal.qc.ca ([216.46.5.7]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IN8NW-0006ma-JI for discuss@apps.ietf.org; Mon, 20 Aug 2007 10:37:15 -0400
Received: (from mouse@localhost) by Sparkle.Rodents.Montreal.QC.CA (8.8.8/8.8.8) id KAA27449; Mon, 20 Aug 2007 10:36:35 -0400 (EDT)
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
Message-Id: <200708201436.KAA27449@Sparkle.Rodents.Montreal.QC.CA>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
X-Erik-Conspiracy: There is no Conspiracy - and if there were I wouldn't be part of it anyway.
X-Message-Flag: Microsoft: the company who gave us the botnet zombies.
Date: Mon, 20 Aug 2007 10:18:13 -0400 (EDT)
To: discuss@apps.ietf.org, Felix Sasaki <fsasaki@w3.org>, ietf-http-wg@w3.org, Richard Ishida <ishida@w3.org>
Subject: Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]
In-Reply-To: <6.0.0.20.2.20070820181338.07260770@localhost>
References: <BA772834-227A-4C1B-9534-070C50DF05B3@mnot.net> <392C98BA-E7B8-44ED-964B-82FC48162924@mnot.net> <p06240843c2833f4d7f2f@[10.20.30.108]> <465D9142.9050506@gmx.de> <6.0.0.20.2.20070610165356.0a69cec0@localhost> <088FB13E-F12F-4BE7-94FB-78B21C51512E@mnot.net> <157F4F253535B9C73F8EDC75@p3.JCK.COM> <6B8E3D7A-71B8-4B8D-9625-2AB3C74A9072@mnot.net> <6.0.0.20.2.20070820181338.07260770@localhost>
X-Spam-Score: 0.0 (/)
X-Scan-Signature: ea4ac80f790299f943f0a53be7e1a21a
Cc:
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

> I think you present a valid scenario.  However, storing headers as
> iso-8859-1 essentially means storing (and resending) them as bytes.

Depends on how much checking is done.  The C0 and C1 ranges are not
valid 8859-x text (except for a few codes in C0, like HT), but, as
Clive points out, C1 does, in general, occur in UTF-8-encoded text.

I recognize there's a "who would bother to check" tendency.  While I
share it, I also believe the number of distinct implementations out
there is large enough that anything permitted by the spec has probably
been done (and, of course, a great many things not permitted by the
spec, but I see no reason to care about compatability with them).  In
particular, any implementation whose native text encoding is not 8859-1
may be recoding headers into its native encoding for storage and back
again on output, and that is almost certain to corrupt C1 octets.

The only fix I can see for that is to do something like UTF-8, but
tweaked to keep all octets in the ISO-8859-x printable space.  I've ben
unable to come up with a way of doing this by just changing the fixed
bits in UTF-8; it seems to me to require putting only five (rather than
six) bits of data in the second and later octets.  (I suspect this
wouldn't fly, simply because UTF-8 is too entrenched, but it's the only
way I can see to be strictly compatible.  It also has the disadvantage
that part of the BMP needs four octets rather than the three that UTF-8
needs.)

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B