Re: Using UTF-8 for non-ASCII Characters in URLs

"Martin J. Duerst" <> Fri, 02 May 1997 16:41 UTC

Received: from cnri by id aa27271; 2 May 97 12:41 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa15113; 2 May 97 12:41 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id MAA06064 for uri-out; Fri, 2 May 1997 12:20:23 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with ESMTP id MAA06059 for <>; Fri, 2 May 1997 12:20:20 -0400 (EDT)
Received: from ( []) by (8.8.5/8.8.5) with SMTP id MAA21383 for <>; Fri, 2 May 1997 12:20:15 -0400 (EDT)
Received: from by with SMTP (PP) id <>; Fri, 2 May 1997 18:19:44 +0200
Date: Fri, 02 May 1997 18:19:43 +0200
From: "Martin J. Duerst" <>
To: Larry Masinter <>
cc: "Michael Kung <MKUNG.US.ORACLE.COM>" <>,
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
In-Reply-To: <>
Message-ID: <Pine.SUN.3.96.970502180918.245k-100000@enoshima>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Precedence: bulk

On Tue, 29 Apr 1997, Larry Masinter wrote:

> This isn't just a "small point", it's essential:
> The only way to guarantee "round trip" is to stick to the smallest
> repertoire of characters.

Yes. But it has to be qualified. It is the smallest set of
characters that you think your target audience is safely
able to distinguish and handle.

> Clearly you shouldn't enter "http" as
> wide characters,

That goes without saying, or doesn't it? Or a browser could
convert it to half-width characters (as a curtesy to the user,
not as part of any spec).

> and if you have 'wide characters' that need
> to be distinguished from ascii characters, you should encode them
> in hex-encoded-UTF8 always.

I think we have to distinguish two cases:

The case that the URL is just used as a carrier for transporting
information from point to point (FORM/QUERY): In this case,
both hex-encoded and 8-bit UTF-8 will work, as the binary
world is never left (but we know there are other problems with
querys, I am working towards a draft about them).

The case that URLs are passed around, on paper and so: In this
case, using %HH as a backup mechanism works, but it is no fun.
As there may be target audiences that can very well (actually
too well :-) distinguish between half-width and full-width
variants (e.g. East Asian programmers), it may very well be
possible to issue such URLs for such audiences. That's why
for such cases, I don't specify eqivalence nor normalization,
but I strongly discourage their use because they cannot
be safely distinguished by a wider audience.

Regards,	Martin.