Re: revised "generic syntax" internet draft

Gary Adams - Sun Microsystems Labs BOS <> Tue, 15 April 1997 19:39 UTC

Received: from cnri by id aa06565; 15 Apr 97 15:39 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa18549; 15 Apr 97 15:39 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id PAA06183 for uri-out; Tue, 15 Apr 1997 15:12:23 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id PAA06172 for <>; Tue, 15 Apr 1997 15:12:18 -0400 (EDT)
Received: from mercury.Sun.COM by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA15095 (mail destined for; Tue, 15 Apr 97 15:12:15 -0400
Received: from East.Sun.COM ([]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id MAA00287; Tue, 15 Apr 1997 12:19:10 -0700
Received: from suneast.East.Sun.COM by East.Sun.COM (SMI-8.6/SMI-5.3) id PAA12762; Tue, 15 Apr 1997 15:11:27 -0400
Received: from zeppo.East.Sun.COM by suneast.East.Sun.COM (SMI-8.6/SMI-SVR4) id PAA13525; Tue, 15 Apr 1997 15:11:27 -0400
Received: from zeppo by zeppo.East.Sun.COM (SMI-8.6/SMI-SVR4) id PAA01485; Tue, 15 Apr 1997 15:05:53 -0400
Date: Tue, 15 Apr 1997 15:05:53 -0400
From: Gary Adams - Sun Microsystems Labs BOS <>
Reply-To: Gary Adams - Sun Microsystems Labs BOS <>
Subject: Re: revised "generic syntax" internet draft
Message-Id: <libSDtMail.9704151505.29976.gra@zeppo>
Mime-Version: 1.0
Content-Type: TEXT/plain; charset="ISO-8859-1"
Content-Md5: ykO5EPiLmJhIo9pd3sc48g==
X-Mailer: dtmail 1.1.0 CDE Version 1.1_59 SunOS 5.5.1 sun4u sparc
X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by id PAA06175
Precedence: bulk
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by id PAA06183

> Date: Tue, 15 Apr 1997 10:30:53 PDT
> From: Larry Masinter <>
> To: Gary Adams - Sun Microsystems Labs BOS <Gary.Adams@East>
> CC:,,
> Subject: Re: revised "generic syntax" internet draft
> > Are there any "facts" still in need of investigation 
> > or are the only unresolved issues questions of "opinion"? (My opinion
> > is that the current system is already broken, if this could be 
> > subtantiated would that invalidate the "status quo" as a viable 
> > alternative?)
> At this point, I think we need not just "facts" but some
> actual "design". Exactly how does this all work in a way that
> actually solves the problem?

This is a fair challenge and it also addresses the point I was trying to make
about whether or not the current system is indeed broken. i.e. how would the
status quo address these difficult transitions in URL transcribability.

> Let's suppose someone wants to publish information
> about their product and put up a URL in a magazine.

Let's also consider two modes of operation. One targeted at a common language
centric group (monolingual) and another targeted to a wider multilingual audience.

e.g. (iso8859-1 text with images of Chinese text) (The link "Français" points to URL "/french/"
     			 and "Español" points to the URL "/spanish/")

> a) what URLs do they support in their server?

Today the only option is to support a native encoding of the platform 
exposed via a URL which is opaque or ambiguous to the client application.
The proposal for UTF-8 %HH escaped URLs would provide a canonical external
representation of the URL.

> b) what gets printed in the magazine?

Assuming I can read the characters in the magazine in an unambiguous
fashion, I should also be able to enter the characters of the URL on
my local computer. If name of the document and the content of the document
are in different languages, then choosing a name that reaches the largest 
audience makes the the most sense. e.g. phone numbers are more recognizable 
in a language portability sense than advertizement style 1-800-CALL-NOW
mixed messages.

Without opening the metadata can of worms the ideal printed form of a URL 
would be completely unambiguous about the contents of what it promises 
to deliver. e.g. language, encoding, time to live. Basically it is an
incomplete contract for disconnected resource when it placed in print.

	( "Content-Encoding" "ISO8859-5",
	  "Content-Language" "ru-lt" "Lithuanian Russian",
	  "Expires:" "..."
	  "<URL:http://.../>" )
> c) what does the user type into the browser?

Since it was a Braille magazine, they probably type in the same Braille
characters. I'm not sure why the input and output questions are asked 
separately here? If the fonts on the screen or in the printed media are
different than the labels on the input device, then some tools will be
required to transcribe the information reliably. e.g. laser printers 
should have corresponding OCR scanners if the information is not human 

> d) what does the browser do with what the user typed
>    in order to turn it into the URL that was generated in (a).

Today the only alternative is to say the platform specific encoding 
of the server system must be %HH encoded as raw octets and published 
in the magazine, which the user enters as raw ascii strings, which is 
transmitted to the server where it is %HH decoded and handed to the local 
data store. i.e., it is only meaningful to the local server and is 
opaque to the magazine, the end user and the browser.

If the encoding is labeled (or known to be UTF8), then the magazine 
could publish either native character representation or a %HH escaped
URL. Similarly the browser could support input of native characters
or a %HH escaped URL. Finally, the %HH escaped UTF8 URL is transmitted 
to the server and converted for use in accessing the local resource.

I could be wrong, but this second scenario seems more transparent to users
in terms of the possibility of presenting meaningful names to a wider audience 
with more potential forms of user corrections when the URL is more
understandable to the user community.

> how does this work for 
>   1) Japanese (16-bit characters)

I'm not sure if the 16-bit character question is directed at eventually 
using binary URLs to eliminate the expansion problems of UTF8? e.g. Java
uses an internal 16-bit character. For now only 7-bit clean UTF8 %HH escaped
URLs are "on the table" for discussion in the gerneric URL syntax document.

>   2) Hebrew (right to left)

If I understand the issue correctly, the moment a bidirectional character 
is permitted in a URL, the rendering could be ambiguous for "http://system/ABC"
vs. "http://system/CBA". I don't have a good answer for this situation.
Perhaps someone with "file manager" experience on a Hebrew platform could 
shed some light on what the typical conventions are used for navigating 
hierarchical file systems?

> What happens with "/" and the path components? How does
> directionality get represented? What are the considerations
> for ambiguity beyond the familiar 0O0O0O1l1l1l for ASCII?

In the NFS server specification rfc2055 section 6.1 they had to address the "/"
issue for "cononical vs native path" considerations. The simple answer is that 
"/" is separator for multi-component lookup consideration, whether the native
file system uses "/" or ":" or whatever internally. The escaped form for "%2f"
could be used when the name actually contains a "/".

> When the details of this are worked out, and we actually
> have something that works to allow non-ASCII URLs, then
> we can look and see if %xx-hex encoded UTF-8 encoded Unicode
> actually forms part of the solution. But it doesn't seem
> "trivial" to me, or at all certain that the current proposal
> is actually part of the solution.

If the current proposal doesn't solve the problem, where is the next place 
where a solution could be considered?  e.g., to address the security issue
a new URL scheme https was used to introduce SSL communication. Another 
approach could be to pursue an "executable  content" solution to a portion 
of the problem space. 

I think the I18N named resource problem is a real need in the market today.
It could be that I18N URLs are not the right way to meet that need. I'm opened
to other ways to meet that need, but so far the UTF8 %HH escaped proposal has 
appeared to be the most open and understandable approach.

	%HH escaped unspecified character set (current spec)
	%HH escaped ISO-8859-1 character set (common European practice)
	%HH escaped ISO-2022 character set (common Asian practice ?)
	%HH escaped UTF-8 Unicode 2.0 character set (proposed)

> Regards,
> Larry
> --

Thanks for still listening.