Re: revised "generic syntax" internet draft

Gary Adams - Sun Microsystems Labs BOS <> Wed, 16 April 1997 23:55 UTC

Received: from cnri by id aa29279; 16 Apr 97 19:55 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa25904; 16 Apr 97 19:55 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id TAA24811 for uri-out; Wed, 16 Apr 1997 19:27:22 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id TAA24794 for <>; Wed, 16 Apr 1997 19:27:18 -0400 (EDT)
Received: from mercury.Sun.COM by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA26706 (mail destined for; Wed, 16 Apr 97 19:27:13 -0400
Received: from East.Sun.COM ([]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id QAA13192; Wed, 16 Apr 1997 16:34:50 -0700
Received: from suneast.East.Sun.COM by East.Sun.COM (SMI-8.6/SMI-5.3) id KAA15546; Wed, 16 Apr 1997 10:49:56 -0400
Received: from zeppo.East.Sun.COM by suneast.East.Sun.COM (SMI-8.6/SMI-SVR4) id KAA02547; Wed, 16 Apr 1997 10:49:57 -0400
Received: by zeppo.East.Sun.COM (SMI-8.6/SMI-SVR4) id KAA03396; Wed, 16 Apr 1997 10:44:21 -0400
Date: Wed, 16 Apr 1997 10:44:21 -0400
From: Gary Adams - Sun Microsystems Labs BOS <>
Message-Id: <199704161444.KAA03396@zeppo.East.Sun.COM>
Subject: Re: revised "generic syntax" internet draft
Precedence: bulk

> From: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
> >If the encoding is labeled (or known to be UTF8), then the magazine
> >could publish either native character representation or a %HH escaped
> >URL. Similarly the browser could support input of native characters
> >or a %HH escaped URL. Finally, the %HH escaped UTF8 URL is transmitted
> >to the server and converted for use in accessing the local resource.
> The magazine could also just publish the native character representation
> and assume that the reader's browser is set up to use the same charset
> encoding as the server.  OTOH, the standard could say that when a URL
> is entered from a source that has no charset, use UTF-8.  The question is
> really about what is the most likely charset used by the server.
> This is the crux of the problem.

The problem with native character representations is that they are
often platform specific. e.g. EUC-JP on the Unix http server, SJIS
on the PC clients, JIS through the mail system, and soon UTF8
on all the Java components and NFS v4 servers(wishful thinking).

The only places where a safe exchange is taking place today is
betweeen homogeneous networks. All Unix or all Windows or all Mac
networks, or in places where a single national character encoding
has been proscribed by law.

I do agree with you that the crux of the problem today has to do
with what the server can grok and what it expects it's underlying 
services to grok. Since URLs are opaque, they are safe to pass 
around and only URL generator can be certain about what 
the contents really mean.

> If a browser assumes that the server is using UTF-8 and transcodes the
> non-ASCII octets before submission to the server, then bad things happen
> if the server is not using UTF-8.  The nature of the "bad things" range
> from disallowed access to invalid form data entry.  Since it is not
> possible for us to require all servers to be upgraded, it is not safe
> for browsers to perform transcoding of URLs, and therefore it is impossible
> to deploy a solution that requires UTF-8 transcoding UNLESS that decision
> is based on the URL scheme.
> Likewise, a server often acts as a gateway for some parts of its namespace,
> as is the case for CGI scripts and API modules like mod_php, and other
> parts of its namespace are derived from filesystem names.  On a server
> like Apache, the filesystem-based URLs are generated by url-encoding all
> non-urlc bytes without concern for the filesystem charset.  While it is
> theoretically possible for the server to edit all served content such
> that URLs are identified and transcoded to UTF-8, that would assume that
> the server knows what charset is used to generate those URLs in the
> first place.  It can't use a single configuration table for all transcoding,
> since the URLs may be generated from sources with varying charsets.
> The bottom line is that a server cannot enforce UTF-8 encoding unless
> it knows that all of its URLs and gateways use a common charset, and if
> that were the case we wouldn't need a UTF-8 solution.
> I listed out the solution space in the hope that people would see the
> trade-offs.  We know that all-ASCII URLs *interoperate* well on the
> Internet, but we also know that they can be ugly.  We know that existing
> systems will accept non-ASCII URLs if the charset matches that used by
> the URL generator/interpreter on the server.  We also know that most
> existing, deployed servers are not restricted to generating UTF-8
> encoded URLs.

So, since I'm looking for a solution to the end to end problem, here's 
a proposal that I think you might see as a viable solution.
Without changing the definition of URLs, we simply define the next version
of a particular URL scheme (or a new scheme) which includes the constraint
or feature that the %HH escaped characters were generated by a UTF8 aware
service. Clients could then take advantage of this updated information in
determining how to present the URL or in the ways it would accept URL 

  GET   /%HH%HH HTTP/1.2

In otherwords,  an NFS v4 filesystem would commit to Unicode externally visible
character strings. A Java based web server would also support an httpu scheme
URL or an http version 1.2 transaction for Unicode based pathnames. The syntax
is the same, but the semantics are more clearly specified.

> In a perfect world, requiring UTF-8 would be a valid solution.  But this
> is not a perfect world!  The purpose of an Internet standard is to define
> the requirements for interoperability between implementations of the
> applicable protocol.  A solution that requires UTF-8 will fail to interoperate
> with systems that do not require UTF-8, and the latter is the case for
> most URL-based systems on the Internet today.

As far as the versioning problem is concerned, a server can always speak
a lower version protocol and a client can always rely of proxy services
to perform non local protocol requests.

   Client	Server
   http 1.1	http 1.1	(status quo)
   http 1.1	http 1.2	(server provides %hh utf8 URLs, but the client
				 doesn't know how to exploit that information)
   http 1.2	http 1.1	(client knows utf-8 url input methods,
				 but must deliver raw %hh urls)
   http 1.2	http 1.2 	(client and server have a contract about the 
				 utf8 url contents)

Or alternatively,

   Client	Proxy	Server
   httpu_proxy	httpu	httpu 1.0 (a unicode http url scheme, with client
			           designated proxy agent)

>  ...Roy T. Fielding
>     Department of Information & Computer Science    (
>     University of California, Irvine, CA 92697-3425    fax:+1(714)824-1715

(Sorry if this message is a bit cryptic, one eye on the screen and
 one eye on my 2yr old, scotch tape and cats really don't go together:-).