Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]

Stefanos Harhalakis <v13@priest.com> Tue, 21 August 2007 16:45 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1INWrD-0005Rs-LE; Tue, 21 Aug 2007 12:45:31 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IN7oi-00035X-3g for discuss-confirm+ok@megatron.ietf.org; Mon, 20 Aug 2007 10:01:16 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IN7oh-00035O-Qb for discuss@apps.ietf.org; Mon, 20 Aug 2007 10:01:15 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IN7gV-0004H4-MI for discuss@apps.ietf.org; Mon, 20 Aug 2007 09:52:47 -0400
Received: from mx-out.forthnet.gr ([193.92.150.103] helo=mx-out-04.forthnet.gr) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IN7gS-0005kj-NV for discuss@apps.ietf.org; Mon, 20 Aug 2007 09:52:47 -0400
Received: from mx-av-01.forthnet.gr (mx-av.forthnet.gr [193.92.150.27]) by mx-out-04.forthnet.gr (8.13.8/8.13.8) with ESMTP id l7KDqduw021850; Mon, 20 Aug 2007 16:52:39 +0300
Received: from MX-IN-01.forthnet.gr (mx-in-01.forthnet.gr [193.92.150.23]) by mx-av-01.forthnet.gr (8.14.1/8.14.1) with ESMTP id l7KDqctc019538; Mon, 20 Aug 2007 16:52:38 +0300
Received: from hell.hell.gr (ppp178-206.adsl.forthnet.gr [62.1.181.206]) by MX-IN-01.forthnet.gr (8.14.1/8.14.1) with ESMTP id l7KDqRhj018895; Mon, 20 Aug 2007 16:52:29 +0300
Authentication-Results: MX-IN-01.forthnet.gr smtp.mail=v13@priest.com; spf=neutral
Authentication-Results: MX-IN-01.forthnet.gr header.from=v13@priest.com; sender-id=neutral
From: Stefanos Harhalakis <v13@priest.com>
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Subject: Re: Character encodings in headers [i74][was: Straw-man charter =?iso-8859-7?q?for=09http-bis=5D?=
Date: Mon, 20 Aug 2007 16:52:26 +0300
User-Agent: KMail/1.9.7
References: <BA772834-227A-4C1B-9534-070C50DF05B3@mnot.net> <6B8E3D7A-71B8-4B8D-9625-2AB3C74A9072@mnot.net> <6.0.0.20.2.20070820181338.07260770@localhost>
In-Reply-To: <6.0.0.20.2.20070820181338.07260770@localhost>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-7"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200708201652.26863.v13@priest.com>
X-Spam-Score: 0.0 (/)
X-Scan-Signature: ffa9dfbbe7cc58b3fa6b8ae3e57b0aa3
X-TMDA-Confirmed: Mon, 20 Aug 2007 10:01:15 -0400
X-Mailman-Approved-At: Tue, 21 Aug 2007 12:45:30 -0400
Cc: Paul Hoffman <phoffman@imc.org>, Felix Sasaki <fsasaki@w3.org>, Richard Ishida <ishida@w3.org>, Apps Discuss <discuss@apps.ietf.org>, Mark Nottingham <mnot@mnot.net>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

On Monday 20 August 2007, Martin Duerst wrote:
> At 17:55 07/08/20, Mark Nottingham wrote:
> >The (potential) problem is that an intermediary (for example) needs
> >to be able to handle headers that it doesn't understand. If it's been
> >built to store headers as iso-8859-1 strings as they pass through (a
> >reasonable assumption, considering 2616), an unknown header with
> >another encoding -- no matter how specified or flagged -- may break it.
>
> I think you present a valid scenario. However, storing headers as
> iso-8859-1 essentially means storing (and resending) them as bytes.
> If such an implementation gets UTF-8, it will just store and
> resend that as iso-8859-1, which means store and resend as bytes,
> which, from the viewpoint of that implementation, will be GIGO,
> but overall, will not cause any damage.

My 2c:

  UTF-8 introduces a requirement that ISO8859-X encodings don't have. UTF-8 
strings may be invalid, in which case a proper action may be needed (drop ?). 
Thus, all UTF-8 strings need to be validated.

  Apart from that, implementations may do various tricks like logging etc, 
where:
a) strlen() is used - not unicode aware
b) iconv() is used to convert ISO8859-1 to UTF-8 either for presentation or 
for internal storage (python or java perhaps?)