Re: Using UTF-8 for non-ASCII Characters in URLs

Dan Oscarsson <Dan.Oscarsson@trab.se> Wed, 30 April 1997 09:20 UTC

Received: from cnri by ietf.org id aa07317; 30 Apr 97 5:20 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa05941; 30 Apr 97 5:20 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id EAA11053 for uri-out; Wed, 30 Apr 1997 04:46:03 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id EAA11042 for <uri@services.bunyip.com>; Wed, 30 Apr 1997 04:46:00 -0400 (EDT)
Received: from malmo.trab.se (malmo.trab.se [131.115.48.10]) by mocha.bunyip.com (8.8.5/8.8.5) with ESMTP id EAA29924 for <uri@bunyip.com>; Wed, 30 Apr 1997 04:45:56 -0400 (EDT)
Received: from valinor.malmo.trab.se (valinor.malmo.trab.se [131.115.48.20]) by malmo.trab.se (8.7.5/TRAB-primary-2) with ESMTP id KAA20700; Wed, 30 Apr 1997 10:45:20 +0200 (MET DST)
Received: by valinor.malmo.trab.se (8.7.5/TRM-1-KLIENT); Wed, 30 Apr 1997 10:45:20 +0200 (MET DST) (MET)
Date: Wed, 30 Apr 1997 10:45:20 +0200 (MET DST)
From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Message-Id: <199704300845.KAA10131@valinor.malmo.trab.se>
To: masinter@parc.xerox.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
Cc: uri@bunyip.com
Mime-Version: 1.0
Content-MD5: XfECtRru3cxFKc+MfKOQxQ==
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-uri@bunyip.com
Precedence: bulk

> > This is not right. A directory listing service generates a html document
> > that is sent back to the web browser. All URLs within a html document
> > should use the same character set as the document uses. That is, 
> > if the document uses iso 8859-1, the URLs will be in iso 8859-1, and
> > if the document is in UTF-8, the URLs will be in UTF-8.
> 
> Dan, for each item in a directory listing, there are two entries.
> 
> <A HREF="this-is-the-URL">this-is-what-the-user-sees</A>
> 
> The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8,
> no matter what the user sees.
> 

If you use hex-encoding, yes. But NOT if you use the native character set
of the document. In that case, the 'this-is-the-URL' part must
use the same character set as the rest of the html document. Raw UTF-8
may only be used in a UTF-8 encoded html document, not in a iso 8859-1
encoded document.

A large amount of html documents are hand written in a text editor. A user
can not be expected to use a different encoding when typing the URLs
in a document.

But I agree that if hex-encoded characters are found in a URL they
should be UTF-8 otherwise it would be unclear what encoding is used
for hex-encoded URLs in a ascii-only html document. But a ascii-only
document may not contain any 8-bit characters in a URL as there is no
defined character set for them. 


To use native encoding in URLs in known context and hex-encoded UTF-8
in other places and, if you want, in known context is what I understand
others on the list also wants. If we cannot use native encoding when
typing in our URLs in our html documents very little is won.

    Dan