Re: URL internationalization!

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Mon, 24 February 1997 16:32 UTC

Received: from cnri by ietf.org id aa21221; 24 Feb 97 11:32 EST
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa14385; 24 Feb 97 11:32 EST
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id LAA20027 for uri-out; Mon, 24 Feb 1997 11:09:19 -0500 (EST)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id LAA20020 for <uri@services.bunyip.com>; Mon, 24 Feb 1997 11:09:15 -0500 (EST)
Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA20182 (mail destined for uri@services.bunyip.com); Mon, 24 Feb 97 11:09:11 -0500
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <11546-0@josef.ifi.unizh.ch>; Mon, 24 Feb 1997 17:09:08 +0100
Date: Mon, 24 Feb 1997 17:09:06 +0100
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Alain LaBont/e'/ <alb@sct.gouv.qc.ca>
Cc: Francois Yergeau <yergeau@alis.com>, "Roy T. Fielding" <fielding@kiwi.ics.uci.edu>, URI mailing list <uri@bunyip.com>
Subject: Re: URL internationalization!
In-Reply-To: <9702211454.AA12501@socrate.riq.qc.ca>
Message-Id: <Pine.SUN.3.95q.970224164714.245O-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

On Fri, 21 Feb 1997, Alain LaBont/e'/ wrote:

> @ 23:11 97-02-20 -0500, Francois Yergeau icrit :
> 
> [given 8 bit per byte encoding]
> 
> >Right.  In fact, not only the system MUST NOT crash, but it SHOULD behave
> >the same as if it had received the corresponding %XX.
> 
> Ginial!

Sorry, but it's not exactly as genial as it looks. As an example,
let's take a resource name with a G with breve (U+011E). Let's
assume that on the server, resource names are encoded in iso-8859-3.
Then the G with breve contains appears as %AB in a well-formed
URL. Now suppose somebody put that URL into an HTML document
that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains
the octet 0xAB for the G with breve character), and that that
document is correctly tagged as iso-8859-3.

Now assume a browser sends a request with
	Accept-Charset: iso-8859-5
The server (or a proxy) translates the whole document from
iso-8859-3 to iso-8859-5 to honor the request of the browser.
The G with breve gets changed to 0xD0. The client receives
the 0xD0. If it "behaves the same as if it had received the
corresponding %XX", i.e. %D0, the URL will not work at all.

This is difficult to fix in the short term, but in the long
term, once the convention that URLs use UTF-8 becomes popular,
the client shouldn't "behave the same", but should take the
character (namely the G with breve), encode it as UTF-8
and then with %HH, and then send it to the server. If we make
recommendations as to what to do with an 8-bit encoded
URL, we should definitely mention both possibilities,
namely:

- Interpret as octet directly and convert it to %HH
- Interpret as character and convert to UTF-8 and then to %HH

With this, we cover two cases:

- The URL wasn't transcoded (not guaranteed, but quite frequent)
- The server uses UTF-8 to encode characters (will become
	more and more frequent)

The third case, namely that the URL gets transcoded, but the
server doesn't support UTF-8, would be very difficult to
cover, and is unrelated to the proposal of introducing
UTF-8 as a recommended character encoding for URLs.

Regards,	Martin.