UTF-8 and URLs

Larry Masinter <masinter@parc.xerox.com> Thu, 24 April 1997 17:43 UTC

Received: from cnri by ietf.org id aa07641; 24 Apr 97 13:43 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa16524; 24 Apr 97 13:43 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id MAA14405 for uri-out; Thu, 24 Apr 1997 12:58:49 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id MAA14400 for <uri@services.bunyip.com>; Thu, 24 Apr 1997 12:58:36 -0400 (EDT)
Received: from alpha.Xerox.COM by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA21477 (mail destined for uri@services.bunyip.com); Thu, 24 Apr 97 12:58:34 -0400
Received: from casablanca.parc.xerox.com ([13.2.16.111]) by alpha.xerox.com with SMTP id <18017(3)>; Thu, 24 Apr 1997 09:57:26 PDT
Received: from bronze.parc.xerox.com ([13.1.100.114]) by casablanca.parc.xerox.com with SMTP id <72455>; Thu, 24 Apr 1997 09:57:01 PDT
Message-Id: <335F90D8.6EDB@parc.xerox.com>
Date: Thu, 24 Apr 1997 09:56:56 -0700
From: Larry Masinter <masinter@parc.xerox.com>
Organization: Xerox PARC
X-Mailer: Mozilla 3.01Gold (Win95; I)
Mime-Version: 1.0
To: John C Klensin <klensin@mci.net>
Cc: uri@bunyip.com
Subject: UTF-8 and URLs
References: <SIMEON.9704240851.W@tp7.Jck.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: owner-uri@bunyip.com
Precedence: bulk

John,

Your clarification didn't help me. And the sticking point
for me is that "as a sequence of glyphs" is an important 
part of the transport of URLs, whether those glyphs are
on paper or on the screen, and that the octet->glyph
and glyph->octet route is really error-prone.

I think to actually solve the problem of Internationalization
of URLs we need two recommendations:

a) If you're writing software that displays URLs to users,
   then
    1) any 'forbidden' octets should be displayed as if
      they were UTF-8 encoded characters. That is, those
      octets are currently disallowed in URLs, but if you
      see them, display them in a standard way.
    2) Any sequences of %HH-encoded octets should be displayed
       EITHER as <%><H><H>, e.g., just show the encoding
       in ASCII, OR by assuming that they're hex-encoded
       UTF-8. The latter assumption is likely to be wrong
       for now, but might change later.

b) If you're writing software that lets users type in URLs,
   then if the user types in any character that isn't legal
   in a URL, encode the character as hex-encoded UTF-8. For
   Japanese, avoid using double-wide characters. For RTL
   scripts such as Hebrew or Arabic, leave out any direction
   changes and encode the characters in logical, not presentation
   order.

   Since there haven't been any standards for non-ASCII character
   representations, this is as good a choice as any.

c) If you're writing software that generates URLs to be
   interpreted later, then use hex-encoded UTF-8 for the
   encoding to generate, and accept either the raw UTF-8
   or the hex-encoded version as identifying the same resource.
   This is a recommendation for HTTP servers and FTP servers
   and a variety of other implementations.

These three recommendations affect software from a large number
of different producers. To make progress in the community,
those software implementors will need to agree that this is
the best solution to interoperability of URLs internationally.

I think given its likely controversial nature, we should clearly
make these recommendations in a separate RFC, and perhaps with
a new working group.

I'm willing to put this all down in a separate internet draft,
if it will help focus the process on actually making progress.
Some of the examples that have been sent out to the mailing list
will be useful to guide the recommendations in the RFC.

Regards,

Larry
--
http://www.parc.xerox.com/masinter

    
--
http://www.parc.xerox.com/masinter