UTF-8 and URLs

Larry Masinter <masinter@parc.xerox.com> Thu, 24 April 1997 17:43 UTC

Message-Id: <335F90D8.6EDB@parc.xerox.com>
Date: Thu, 24 Apr 1997 09:56:56 -0700
From: Larry Masinter <masinter@parc.xerox.com>
Organization: Xerox PARC
Mime-Version: 1.0
To: John C Klensin <klensin@mci.net>
Cc: uri@bunyip.com
Subject: UTF-8 and URLs
References: <SIMEON.9704240851.W@tp7.Jck.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: owner-uri@bunyip.com
Precedence: bulk

John,

Your clarification didn't help me. And the sticking point
for me is that "as a sequence of glyphs" is an important 
part of the transport of URLs, whether those glyphs are
on paper or on the screen, and that the octet->glyph
and glyph->octet route is really error-prone.

I think to actually solve the problem of Internationalization
of URLs we need two recommendations:

a) If you're writing software that displays URLs to users,
   then
    1) any 'forbidden' octets should be displayed as if
      they were UTF-8 encoded characters. That is, those
      octets are currently disallowed in URLs, but if you
      see them, display them in a standard way.
    2) Any sequences of %HH-encoded octets should be displayed
       EITHER as <%><H><H>, e.g., just show the encoding
       in ASCII, OR by assuming that they're hex-encoded
       UTF-8. The latter assumption is likely to be wrong
       for now, but might change later.

b) If you're writing software that lets users type in URLs,
   then if the user types in any character that isn't legal
   in a URL, encode the character as hex-encoded UTF-8. For
   Japanese, avoid using double-wide characters. For RTL
   scripts such as Hebrew or Arabic, leave out any direction
   changes and encode the characters in logical, not presentation
   order.

   Since there haven't been any standards for non-ASCII character
   representations, this is as good a choice as any.

c) If you're writing software that generates URLs to be
   interpreted later, then use hex-encoded UTF-8 for the
   encoding to generate, and accept either the raw UTF-8
   or the hex-encoded version as identifying the same resource.
   This is a recommendation for HTTP servers and FTP servers
   and a variety of other implementations.

These three recommendations affect software from a large number
of different producers. To make progress in the community,
those software implementors will need to agree that this is
the best solution to interoperability of URLs internationally.

I think given its likely controversial nature, we should clearly
make these recommendations in a separate RFC, and perhaps with
a new working group.

I'm willing to put this all down in a separate internet draft,
if it will help focus the process on actually making progress.
Some of the examples that have been sent out to the mailing list
will be useful to guide the recommendations in the RFC.

Regards,

Larry
--
http://www.parc.xerox.com/masinter

    
--
http://www.parc.xerox.com/masinter

Re: revised "generic syntax" internet draft Foteos Macrides
leading ".." (Re: revised ...) Gregory J. Woodhouse
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Francois Yergeau
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Francois Yergeau
Transcribing non-ascii URLs [was: revised "generi… Dan Connolly
Re: revised "generic syntax" internet draft Edward Cherlin
Re: Transcribing non-ascii URLs [was: revised "ge… Martin J. Duerst
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Dan Oscarsson
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft John C Klensin
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
Re: revised "generic syntax" internet draft Chris Newman
Re: revised "generic syntax" internet draft Chris Newman
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Chris Newman
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Edward Cherlin
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft Harald.T.Alvestrand
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Jon Knight
Re: revised "generic syntax" internet draft Jon Knight
Re: revised "generic syntax" internet draft John C Klensin
Re: revised "generic syntax" internet draft Ron Daniel, Jr.
Re: Transcribing non-ascii URLs [was: revised "ge… Bert Bos
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
A workable alternative to "hex-encoded UTF-8 enco… Larry Masinter
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft John C Klensin
Re: revised "generic syntax" internet draft Harald.T.Alvestrand
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Chris Newman
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: A workable alternative to "hex-encoded UTF-8 … Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Jonathan Rosenne
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Edward Cherlin
Opaque right hand sides (was: Re: revised "generi… John C Klensin
Re: revised "generic syntax" internet draft Karen R. Sollins
UTF-8 and URLs Larry Masinter
Re: UTF-8 and URLs Dan Connolly
Re: UTF-8 and URLs Chris Newman
Re: UTF-8 and URLs John C Klensin
Re: UTF-8 and URLs Francois Yergeau
Re: UTF-8 and URLs Dan Connolly
Re: revised "generic syntax" internet draft Edward Cherlin
Re: revised "generic syntax" internet draft John C Klensin
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: UTF-8 and URLs Martin J. Duerst
Re: UTF-8 and URLs Francois Yergeau
Re: UTF-8 and URLs Dan Connolly
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
New proposal (was Re: UTF-8 and URLs) Edward Cherlin
Re: UTF-8 and URLs Larry Masinter
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: UTF-8 and URLs Martin J. Duerst
initial "relative-looking" elements. Larry Masinter
Re: revised "generic syntax" internet draft Edward Cherlin
Re: initial "relative-looking" elements. Roy T. Fielding