Re: Using UTF-8 for non-ASCII Characters in URLs

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Fri, 02 May 1997 16:41 UTC

Received: from cnri by ietf.org id aa27271; 2 May 97 12:41 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa15113; 2 May 97 12:41 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id MAA06064 for uri-out; Fri, 2 May 1997 12:20:23 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id MAA06059 for <uri@services.bunyip.com>; Fri, 2 May 1997 12:20:20 -0400 (EDT)
Received: from josef.ifi.unizh.ch (josef.ifi.unizh.ch [130.60.48.10]) by mocha.bunyip.com (8.8.5/8.8.5) with SMTP id MAA21383 for <uri@bunyip.com>; Fri, 2 May 1997 12:20:15 -0400 (EDT)
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <15574-0@josef.ifi.unizh.ch>; Fri, 2 May 1997 18:19:44 +0200
Date: Fri, 02 May 1997 18:19:43 +0200
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Larry Masinter <masinter@parc.xerox.com>
cc: "Michael Kung <MKUNG.US.ORACLE.COM>" <MKUNG@us.oracle.com>, uri@bunyip.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
In-Reply-To: <3366C606.786A@parc.xerox.com>
Message-ID: <Pine.SUN.3.96.970502180918.245k-100000@enoshima>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

On Tue, 29 Apr 1997, Larry Masinter wrote:

> This isn't just a "small point", it's essential:
> 
> The only way to guarantee "round trip" is to stick to the smallest
> repertoire of characters.

Yes. But it has to be qualified. It is the smallest set of
characters that you think your target audience is safely
able to distinguish and handle.


> Clearly you shouldn't enter "http" as
> wide characters,

That goes without saying, or doesn't it? Or a browser could
convert it to half-width characters (as a curtesy to the user,
not as part of any spec).


> and if you have 'wide characters' that need
> to be distinguished from ascii characters, you should encode them
> in hex-encoded-UTF8 always.

I think we have to distinguish two cases:

The case that the URL is just used as a carrier for transporting
information from point to point (FORM/QUERY): In this case,
both hex-encoded and 8-bit UTF-8 will work, as the binary
world is never left (but we know there are other problems with
querys, I am working towards a draft about them).

The case that URLs are passed around, on paper and so: In this
case, using %HH as a backup mechanism works, but it is no fun.
As there may be target audiences that can very well (actually
too well :-) distinguish between half-width and full-width
variants (e.g. East Asian programmers), it may very well be
possible to issue such URLs for such audiences. That's why
for such cases, I don't specify eqivalence nor normalization,
but I strongly discourage their use because they cannot
be safely distinguished by a wider audience.


Regards,	Martin.