Re: URL internationalization!

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Tue, 25 February 1997 14:50 UTC

Received: from cnri by ietf.org id aa10397; 25 Feb 97 9:50 EST
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa11691; 25 Feb 97 9:50 EST
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id JAA25212 for uri-out; Tue, 25 Feb 1997 09:04:03 -0500 (EST)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id JAA25207 for <uri@services.bunyip.com>; Tue, 25 Feb 1997 09:03:58 -0500 (EST)
Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA27825 (mail destined for uri@services.bunyip.com); Tue, 25 Feb 97 09:03:51 -0500
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <04847-0@josef.ifi.unizh.ch>; Tue, 25 Feb 1997 15:02:58 +0100
Date: Tue, 25 Feb 1997 15:02:57 +0100
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Dan Oscarsson <Dan.Oscarsson@trab.se>
Cc: alb@sct.gouv.qc.ca, yergeau@alis.com, fielding@kiwi.ics.uci.edu, uri@bunyip.com
Subject: Re: URL internationalization!
In-Reply-To: <199702251309.OAA06854@valinor.malmo.trab.se>
Message-Id: <Pine.SUN.3.95q.970225143447.245G-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

On Tue, 25 Feb 1997, Dan Oscarsson wrote:

> There are a few more things to think about 8 bits versus %XX.
> 
> > > [given 8 bit per byte encoding]
> > > 
> > > >Right.  In fact, not only the system MUST NOT crash, but it SHOULD behave
> > > >the same as if it had received the corresponding %XX.
> > As an example,
> > let's take a resource name with a G with breve (U+011E). Let's
> > assume that on the server, resource names are encoded in iso-8859-3.
> > Then the G with breve contains appears as %AB in a well-formed
> > URL. Now suppose somebody put that URL into an HTML document
> > that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains
> > the octet 0xAB for the G with breve character), and that that
> > document is correctly tagged as iso-8859-3.
> > 
> > Now assume a browser sends a request with
> > 	Accept-Charset: iso-8859-5
> > The server (or a proxy) translates the whole document from
> > iso-8859-3 to iso-8859-5 to honor the request of the browser.
> > The G with breve gets changed to 0xD0. The client receives
> > the 0xD0. If it "behaves the same as if it had received the
> > corresponding %XX", i.e. %D0, the URL will not work at all.
> 
> As Martin points out there are a few problems, they exist mostly
> because the URL used today does not define a how to encode characters
> outside ascii.
> 
> If we define that an URL that is sent using the transport format of
> an URL with all characters encoded using UTF-8, it is no problem.
> In a html document the URL can be represented using 8-bit octets
> encoded in iso 8859-3 as of above. When the document is transcoded
> (are there any servers that do that?)

Yes, there are! Gavin or Francois sure can give examples.

> the URL is changed into the
> same URL, but encoded in iso 8859-5 (the G with breve is still a
> G with breve). When the browser that requested the iso 8859-5 format
> of the html document follows a link and sends the URL to a
> web server, it will encode the URL using the transport format
> (that is using UTF-8). The server will decode the UTF-8 and convert
> it into iso 8859-3, if that is the set used on the server.
> No problem here with 8-bit byts.

Exactly. That's the core of "stage two" of my proposal. Everything
will work as expected, exactly as it already does for ASCII and EBCDIC
at the moment. 

I very much understand Dan that he would like to go to that stage
immediately. However, I decided to separate my proposal in two
stages, and am currently asking for "stage one" only, because of
the following reasons:

- It is important that the convention to use UTF-8 (with %HH) gets
	sufficiently deployed before we seriously start to put
	URLs into HTML and such in native encoding.
- With mandating %HH, we are exactly parallel (except for the
	backwards compatibility issues) with URNs.
- The URN discussion has shown that many people are still sceptical
	about the correct treatment of non-ASCII characters upon
	transcoding, cut-and-paste, and so on. I didn't want to
	repeat this discussion, I think time will show. Of course,
	if the people that have opposed native non-ASCII encoding
	in the URN discussion are already convinced to the contrary,
	I wouldn't have any problems moving ahead
	(Keith, any comments :-?).
- Because sending around natively encoded URLs is already established
	practice (the syntax draft, with due right, contains a warning
	in this direction), even formally specifying that only the
	"canonical form" is allowed (i.e. %HH-escaping is mandated)
	will not be enforcable.
- "stage one" is completely independent and separate from "stage two".
	There is no technical need to move to "stage two" if we don't
	agree to do so.
- On the other hand, "stage two" is the natural consequence of
	"stage one", in the sense that (at least that's my
	prediction) once UTF-8 is seriously established for URLs,
	their native encoding, without %HH, will get deployed
	quickly. As a browser maker, I would definitely want
	to provide that feature to my users!


Regards,	Martin.