Re: URL internationalization!
Dan Oscarsson <Dan.Oscarsson@trab.se> Tue, 25 February 1997 13:38 UTC
Received: from cnri by ietf.org id aa04532; 25 Feb 97 8:38 EST
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa10001; 25 Feb 97 8:38 EST
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id IAA24798 for uri-out; Tue, 25 Feb 1997 08:10:04 -0500 (EST)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id IAA24785 for <uri@services.bunyip.com>; Tue, 25 Feb 1997 08:09:56 -0500 (EST)
Received: from malmo.trab.se by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA27508 (mail destined for uri@services.bunyip.com); Tue, 25 Feb 97 08:09:52 -0500
Received: from valinor.malmo.trab.se (valinor.malmo.trab.se [131.115.48.20]) by malmo.trab.se (8.7.5/TRAB-primary-2) with ESMTP id OAA01929; Tue, 25 Feb 1997 14:09:39 +0100 (MET)
Received: by valinor.malmo.trab.se (8.7.5/TRM-1-KLIENT); Tue, 25 Feb 1997 14:09:38 +0100 (MET) (MET)
Date: Tue, 25 Feb 1997 14:09:38 +0100
From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Message-Id: <199702251309.OAA06854@valinor.malmo.trab.se>
To: alb@sct.gouv.qc.ca, mduerst@ifi.unizh.ch
Subject: Re: URL internationalization!
Cc: yergeau@alis.com, fielding@kiwi.ics.uci.edu, uri@bunyip.com
Mime-Version: 1.0
Content-Md5: +5gKoIP0MMQp8R3q4bSzvg==
Content-Type: text/plain; charset="ISO-8859-1"
X-MIME-Autoconverted: from quoted-printable to 8bit by services.bunyip.com id IAA24794
Sender: owner-uri@bunyip.com
Precedence: bulk
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by services.bunyip.com id IAA24798
ä There are a few more things to think about 8 bits versus %XX. > > [given 8 bit per byte encoding] > > > > >Right. In fact, not only the system MUST NOT crash, but it SHOULD behave > > >the same as if it had received the corresponding %XX. > As an example, > let's take a resource name with a G with breve (U+011E). Let's > assume that on the server, resource names are encoded in iso-8859-3. > Then the G with breve contains appears as %AB in a well-formed > URL. Now suppose somebody put that URL into an HTML document > that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains > the octet 0xAB for the G with breve character), and that that > document is correctly tagged as iso-8859-3. > > Now assume a browser sends a request with > Accept-Charset: iso-8859-5 > The server (or a proxy) translates the whole document from > iso-8859-3 to iso-8859-5 to honor the request of the browser. > The G with breve gets changed to 0xD0. The client receives > the 0xD0. If it "behaves the same as if it had received the > corresponding %XX", i.e. %D0, the URL will not work at all. As Martin points out there are a few problems, they exist mostly because the URL used today does not define a how to encode characters outside ascii. If we define that an URL that is sent using the transport format of an URL with all characters encoded using UTF-8, it is no problem. In a html document the URL can be represented using 8-bit octets encoded in iso 8859-3 as of above. When the document is transcoded (are there any servers that do that?) the URL is changed into the same URL, but encoded in iso 8859-5 (the G with breve is still a G with breve). When the browser that requested the iso 8859-5 format of the html document follows a link and sends the URL to a web server, it will encode the URL using the transport format (that is using UTF-8). The server will decode the UTF-8 and convert it into iso 8859-3, if that is the set used on the server. No problem here with 8-bit byts. The difficulty is as it is today when no defined handling of non ascii characters in URLs exist. Either they must be in %XX form or transcoding must not occur and octets in URLs must not be changed. It is this mess we want to remove by defining UTF-8 as the transport format for URLs. By doing that there is a defined way to use most characters in the world in a URL. The UTF-8 encoded URL can be, by %XX encoding, both transported over 7-bit media or printed on paper so that many people in the world can enter it on a keyboard. And it can be presented in local character set to make it user friendly. The quicker we can change to a standard way to represent non ascii in the transport format of a URL, the quicker the current problems will go away. I know Masataka prefers iso 2022, but of what I have seen, most are planing to support ISO 10646/Unicode and not ISO 2022. As time goes on, added information in documents will probably fix the problems Masataka sees in UCS. Regards, Dan
- URL internationalization! Martin J. Duerst
- URL internationalization! Martin J. Duerst
- Re: URL internationalization! Roy T. Fielding
- Re: URL internationalization! Gregory J. Woodhouse
- Re: URL internationalization! Francois Yergeau
- Re: URL internationalization! Martin J. Duerst
- Re: URL internationalization! Dan Oscarsson
- Re: URL internationalization! Alain LaBont/e'/
- Re: URL internationalization! Gregory J. Woodhouse
- Re: URL internationalization! Francois Yergeau
- Re: URL internationalization! Gregory J. Woodhouse
- Re: URL internationalization! Martin J. Duerst
- Symbolic vs Numeric identifiers (was Re: URL inte… Daniel LaLiberte
- Re: URL internationalization! Martin J. Duerst
- Re: Symbolic vs Numeric identifiers (was Re: URL … Gregory J. Woodhouse
- Re: URL internationalization! Dan Oscarsson
- Re: URL internationalization! Martin J. Duerst
- Re: URL internationalization! Jonathan Rosenne
- Re: URL internationalization! Larry Masinter
- Re: URL internationalization! Alain LaBont/e'/
- Re: Symbolic vs Numeric identifiers Daniel LaLiberte
- Re: URL internationalization! Martin J. Duerst
- Re: URL internationalization! Martin J. Duerst
- Re: Symbolic vs Numeric identifiers (was Re: URL … Martin J. Duerst
- Re: Symbolic vs Numeric identifiers (was Re: URL … Gavin Nicol