Re: Using UTF-8 for non-ASCII Characters in URLs

Gary Adams - Sun Microsystems Labs BOS <> Wed, 30 April 1997 13:10 UTC

Received: from cnri by id aa10439; 30 Apr 97 9:10 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa09666; 30 Apr 97 9:10 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id IAA19100 for uri-out; Wed, 30 Apr 1997 08:38:19 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with ESMTP id IAA19095 for <>; Wed, 30 Apr 1997 08:38:17 -0400 (EDT)
Received: from mercury.Sun.COM (mercury.Sun.COM []) by (8.8.5/8.8.5) with SMTP id IAA00909 for <>; Wed, 30 Apr 1997 08:38:15 -0400 (EDT)
Received: from East.Sun.COM ([]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id FAA27985; Wed, 30 Apr 1997 05:48:00 -0700
Received: from suneast.East.Sun.COM by East.Sun.COM (SMI-8.6/SMI-5.3) id IAA22232; Wed, 30 Apr 1997 08:37:39 -0400
Received: from zeppo.East.Sun.COM by suneast.East.Sun.COM (SMI-8.6/SMI-SVR4) id IAA19085; Wed, 30 Apr 1997 08:37:41 -0400
Received: by zeppo.East.Sun.COM (SMI-8.6/SMI-SVR4) id IAA25725; Wed, 30 Apr 1997 08:37:40 -0400
Date: Wed, 30 Apr 1997 08:37:40 -0400
From: Gary Adams - Sun Microsystems Labs BOS <>
Message-Id: <199704301237.IAA25725@zeppo.East.Sun.COM>
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
Precedence: bulk

> From: Dan Oscarsson <>
> > 
> > Dan, for each item in a directory listing, there are two entries.
> > 
> > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A>
> > 
> > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8,
> > no matter what the user sees.
> > 
> If you use hex-encoding, yes. But NOT if you use the native character set
> of the document. In that case, the 'this-is-the-URL' part must
> use the same character set as the rest of the html document. Raw UTF-8
> may only be used in a UTF-8 encoded html document, not in a iso 8859-1
> encoded document.

The document character set for HTML 2.0 and 3.2 was iso 8859-1.
The document character set for HTML 4.0 and XML will be iso 10646.
From what little I know about SGML, the document must be converted
to a single document character set before the SGML parser is
allowed to operate on the markup.

If I use a multilingual text editor to create my *ML documents
and "paste" a raw UTF8 url into the href field, the editor 
either 'negotiates for the encoding information' from the
desktop clipboard service or it assumes the sending application
is using the same encoding that it needs. So when I cut the 
EUC-jp URL from my browser  "location" window and paste it 
into my editor it may just assume the bits are iso8859-1 characters.

For experimenting with combined document authoring/browsing
functions the "w3 for emacs" browser and the "psgml-mode"
editor in the Xemacs 20.0(with  MULE support) provide a good
platform for experimentation.

> A large amount of html documents are hand written in a text editor. A user
> can not be expected to use a different encoding when typing the URLs
> in a document.

But they might have to use a different encoding when saving the file
to disk. And the document itself might be converted as it is saved
to disk. These are common functions in a multibyte plain text editor,
just as intelligent cut and paste functions are needed in a shared 
desktop environment.

I think your point about "authoring URLs" within HTML documents with
a "plain text editor" is that the user will have a local input 
method for entering native characters (e.g., compose key sequences,
virtual keyboard, radical composition, etc.) which will be operating
in the same manner for document text and for URL characters. Since the
authoring tools did not offer a means of recording the character encoding
information, it is not possible for a web server to make a distinction
when a document is transmitted on the wire.