Re: Using UTF-8 for non-ASCII Characters in URLs

Dan Oscarsson <Dan.Oscarsson@trab.se> Wed, 30 April 1997 07:02 UTC

Received: from cnri by ietf.org id aa05669; 30 Apr 97 3:02 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa03927; 30 Apr 97 3:02 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id CAA09383 for uri-out; Wed, 30 Apr 1997 02:53:11 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id CAA09378 for <uri@services.bunyip.com>; Wed, 30 Apr 1997 02:53:09 -0400 (EDT)
Received: from malmo.trab.se (malmo.trab.se [131.115.48.10]) by mocha.bunyip.com (8.8.5/8.8.5) with ESMTP id CAA29437 for <uri@bunyip.com>; Wed, 30 Apr 1997 02:53:05 -0400 (EDT)
Received: from valinor.malmo.trab.se (valinor.malmo.trab.se [131.115.48.20]) by malmo.trab.se (8.7.5/TRAB-primary-2) with ESMTP id IAA17793; Wed, 30 Apr 1997 08:52:18 +0200 (MET DST)
Received: by valinor.malmo.trab.se (8.7.5/TRM-1-KLIENT); Wed, 30 Apr 1997 08:52:17 +0200 (MET DST) (MET)
Date: Wed, 30 Apr 1997 08:52:17 +0200
From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Message-Id: <199704300652.IAA09984@valinor.malmo.trab.se>
To: uri@bunyip.com, masinter@parc.xerox.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
Mime-Version: 1.0
Content-MD5: 5tFAsRBqSXseK4wOLNsKUA==
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: owner-uri@bunyip.com
Precedence: bulk

> Since no one else has, here's a rough draft of a UTF-8 URL
> internet-draft, which I intend to submit in a few days time,
> after taking another pass on it.
> 
> 
> -----
> INTERNET-DRAFT			    Larry Masinter, Xerox Corporation
> draft-masinter-url-i18n-00xx	                       April 27, 1997
> Expires: October 27, 1997

> 3.2 Requirements for URL generation and interpretation
>    
>    Systems that are offering resources through the internet
>    where those resources have logical names sometimes offer
>    the ability to generate URLs for the resources they offer.
>    For example, some HTTP servers offer the ability to
>    generate a 'directory listing' for file directories
>    under their purvue, and then to respond to the generated
>    URLs with the files. If the names of the files consist
>    solely of US-ASCII characters, the transcription is
>    simple, but other file systems offer a wider variety
>    of characters. It is recommended that the generation
>    of directories result in hex-encoded UTF-8 for non-USASCII
>    characters in the listing, and that the interpretation
>    of URLs accept both the raw UTF-8 or the hex-encoded version.
> 

This is not right. A directory listing service generates a html document
that is sent back to the web browser. All URLs within a html document
should use the same character set as the document uses. That is, 
if the document uses iso 8859-1, the URLs will be in iso 8859-1, and
if the document is in UTF-8, the URLs will be in UTF-8.

If the browser knows how to handle the character set of the html document,
it also should know how to translate the embedded URLs into UTF-8 when
the user follows a link.

In general, URLs used without a context that defines the characters used,
should be encoded using UTF-8. URLs used within a context where the
meaning of the characters is defined should use the character encoding
of the context.

    Dan