Re: Using UTF-8 for non-ASCII Characters in URLs

Larry Masinter <masinter@parc.xerox.com> Wed, 30 April 1997 08:23 UTC

Received: from cnri by ietf.org id aa06657; 30 Apr 97 4:23 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa05172; 30 Apr 97 4:23 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id EAA10327 for uri-out; Wed, 30 Apr 1997 04:01:32 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id EAA10322 for <uri@services.bunyip.com>; Wed, 30 Apr 1997 04:01:30 -0400 (EDT)
Received: from alpha.xerox.com (alpha.Xerox.COM [13.1.64.93]) by mocha.bunyip.com (8.8.5/8.8.5) with SMTP id EAA29747 for <uri@bunyip.com>; Wed, 30 Apr 1997 04:01:27 -0400 (EDT)
Received: from casablanca.parc.xerox.com ([13.2.16.111]) by alpha.xerox.com with SMTP id <17421(8)>; Wed, 30 Apr 1997 01:00:54 PDT
Received: from bronze-208.parc.xerox.com ([13.0.209.122]) by casablanca.parc.xerox.com with SMTP id <71888>; Wed, 30 Apr 1997 01:00:33 PDT
Message-ID: <3366FC1B.EA8@parc.xerox.com>
Date: Wed, 30 Apr 1997 01:00:27 -0700
From: Larry Masinter <masinter@parc.xerox.com>
Organization: Xerox PARC
X-Mailer: Mozilla 3.01Gold (Win95; I)
MIME-Version: 1.0
To: Dan Oscarsson <Dan.Oscarsson@trab.se>
CC: uri@bunyip.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
References: <199704300652.IAA09984@valinor.malmo.trab.se>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: owner-uri@bunyip.com
Precedence: bulk

Dan,

> This is not right. A directory listing service generates a html document
> that is sent back to the web browser. All URLs within a html document
> should use the same character set as the document uses. That is, 
> if the document uses iso 8859-1, the URLs will be in iso 8859-1, and
> if the document is in UTF-8, the URLs will be in UTF-8.

Dan, for each item in a directory listing, there are two entries.

<A HREF="this-is-the-URL">this-is-what-the-user-sees</A>

The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8,
no matter what the user sees.

I'll try to make clear that the recommendation for how URLs should
be processed really only applies to the URLs and not to anything
else that isn't a URL.

> If the browser knows how to handle the character set of the html document,
> it also should know how to translate the embedded URLs into UTF-8 when
> the user follows a link.

I think you've missed the whole point. A browser that knows
ISO-8859-1 and KOI-8 can continue to only process directory
listings from servers that have files whose file names
are in Japanese.

> In general, URLs used without a context that defines the characters used,
> should be encoded using UTF-8. URLs used within a context where the
> meaning of the characters is defined should use the character encoding
> of the context.

I suppose you're entitled to this opinion that thats how they "should"
be encoded, but this is a different recommendation from those being
promoted by others on this mailing list.

If you want to make a counter-proposal, you're free to do so, but
I don't think you have described anything that is actually workable.

Larry