Re: UTF-8 and URLs

"Martin J. Duerst" <> Sun, 27 April 1997 16:31 UTC

Received: from cnri by id aa23522; 27 Apr 97 12:31 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa11014; 27 Apr 97 12:31 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id MAA22891 for uri-out; Sun, 27 Apr 1997 12:16:18 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id MAA22886 for <>; Sun, 27 Apr 1997 12:16:13 -0400 (EDT)
Received: from by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA06207 (mail destined for; Sun, 27 Apr 97 12:16:01 -0400
Received: from by with SMTP (PP) id <>; Sun, 27 Apr 1997 18:16:03 +0200
Date: Sun, 27 Apr 1997 18:16:02 +0200
From: "Martin J. Duerst" <>
To: Francois Yergeau <>
Subject: Re: UTF-8 and URLs
In-Reply-To: <>
Message-Id: <Pine.SUN.3.96.970427175721.245P-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="ISO-8859-1"
X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by id MAA22887
Precedence: bulk
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by id MAA22891

On Fri, 25 Apr 1997, Francois Yergeau wrote:

> À 00:25 25-04-97 -0500, Dan Connolly a écrit :
> >> Let's see: we would have an i18n RFC that would allow URLs to contain most
> >> any characters, and a (possibly Draft) standard that would say "All URLs
> >> consist of a restricted set of characters..." (we know which): clear
> >> contradiction.
> >
> >Please don't cite out of context or paraphrase wildly. The _existing_
> >RFC limits the characters in URLs. In fact, the UTF-8-in-%XX encoding
> >propsal doesn't even change that: it just adds semantics to the syntax.
> I'm sorry, but I see it differently: the UTF-8-in-%XX proposal doesn't add
> octet values on-the-wire, but it adds, and correctly maps, thousands of
> characters.

It can be seen in different ways. For some of the issues discussed
in the syntax draft, in particular all about relative URL processing,
it is indeed just semantics and doesn't interfere. On the other hand,
the current draft contains many explanations about the relation between
represented characters, octets, and URL characters. Somebody studying
it will greatly benefit from being told about the limitations of the
model that the current draft assumes, and from being told the direction
that is being taken to change the model and eliminate the deficiencies.

Also, the UTF-8-in-%XX proposal, strictly requiring %XX, is indeed
just an addition of semantics. However, once it is clear how these
semantics are added, the next step, namely removing the %XX requirement
and extending the URL character set to most of the Universal Character
Set (excluding compatibility characters and stuff), is obvious.
If URLs were closely similar to MIME headers, we could say that
this is a transparent user-interface issue, but because URLs include
the form on paper, where we agree that transcribing long %XX sequences
is a great pain for those that know the actual characters, the
situation is different.

I originally proposed the addition of UTF-8-in-%XX to the current
draft as an important first step towards fully international URLs,
based on experience with the URN compromize. But UTF-8-in-%XX is
only the first step, and because we already know the next steps,
we definitely should tell this to the reader of the syntax draft,
whether in the form of a fully reworked draft or (probably
preferable) in the form of a note discussing future developments.

Regards,	Martin.