Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]

"Clive D.W. Feather" <clive@demon.net> Mon, 20 August 2007 10:28 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1IN4VE-0006ii-At; Mon, 20 Aug 2007 06:28:56 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1IN4VD-0006ic-GR for discuss-confirm+ok@megatron.ietf.org; Mon, 20 Aug 2007 06:28:55 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1IN4VD-0006iU-6y for discuss@apps.ietf.org; Mon, 20 Aug 2007 06:28:55 -0400
Received: from anchor-internal-1.mail.demon.net ([195.173.56.100]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1IN4VB-0008ED-TI for discuss@apps.ietf.org; Mon, 20 Aug 2007 06:28:55 -0400
Received: from finch-staff-1.server.demon.net (finch-staff-1.server.demon.net [193.195.224.1]) by anchor-internal-1.mail.demon.net with ESMTP id l7KASojS010788Mon, 20 Aug 2007 10:28:51 GMT
Received: from clive by finch-staff-1.server.demon.net with local (Exim 3.36 #1) id 1IN4QY-000LNm-00; Mon, 20 Aug 2007 11:24:06 +0100
Date: Mon, 20 Aug 2007 11:24:06 +0100
From: "Clive D.W. Feather" <clive@demon.net>
To: Mark Nottingham <mnot@mnot.net>
Subject: Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]
Message-ID: <20070820102406.GN68079@finch-staff-1.thus.net>
References: <BA772834-227A-4C1B-9534-070C50DF05B3@mnot.net> <392C98BA-E7B8-44ED-964B-82FC48162924@mnot.net> <p06240843c2833f4d7f2f@[10.20.30.108]> <465D9142.9050506@gmx.de> <6.0.0.20.2.20070610165356.0a69cec0@localhost> <088FB13E-F12F-4BE7-94FB-78B21C51512E@mnot.net> <6.0.0.20.2.20070820162657.08bf55a0@localhost> <6F4AC9A3-721E-418D-B0FA-D0223DCDFFF9@mnot.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <6F4AC9A3-721E-418D-B0FA-D0223DCDFFF9@mnot.net>
User-Agent: Mutt/1.5.3i
X-Spam-Score: 0.0 (/)
X-Scan-Signature: a7d6aff76b15f3f56fcb94490e1052e4
Cc: Richard Ishida <ishida@w3.org>, Apps Discuss <discuss@apps.ietf.org>, Felix Sasaki <fsasaki@w3.org>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>, Paul Hoffman <phoffman@imc.org>
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

Mark Nottingham said:
>> UTF-8 has virtually
>> the same footprint in terms of bytes as ISO-8859-1: All bytes
>> above 0x7F may be used. Implementations that have to deal with
>> ISO-8859-1 usually do this by just being 8-bit-transparent;
>> that works for UTF-8, too.
> If utf-8 is a subset of iso-8859-1, it would work; but I don't think  
> that's the case (not that I'm an expert in this area, by any means).

It's not.

Printable text in ISO-8859-n (for all n) consists of a sequence of
characters, each of which is either:

    one octet in the range 20 to 7E
    one octet in the range A0 to FF

Printable text in UTF-8 consists of a sequence of characters, each of which
is either:

    one octet in the range 20 to 7E
    one octet in the range C2 to E4 followed by between 1 and 3 octets
              in the range 80 to BF (the first octet tells you how many [*])

In both cases, 20 to 7E are the ASCII characters. In both cases, codes like
09 (HTAB) and 0A (LF) have the same meaning. In ISO-8859-n the meaning of
codes A0 to FF depends on the value of n. In UTF-8 each sequence has a
unique meaning that never changes.

The syntax in 2616 allows any octet in the range 20 to FF except 7F; both
of these are subsets of that.

(*) To be precise:
     one octet C2 to DF followed by one   octet  in the range 80 to BF, or
     one octet E0 to E4 followed by two   octets in the range 80 to BF, or
     one octet F0 to F7 followed by three octets in the range 80 to BF.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive@davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc            |                            |