Re: revised "generic syntax" internet draft

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Mon, 21 April 1997 14:26 UTC

Received: from cnri by ietf.org id aa08446; 21 Apr 97 10:26 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa20099; 21 Apr 97 10:26 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id JAA25149 for uri-out; Mon, 21 Apr 1997 09:34:28 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id JAA25138 for <uri@services.bunyip.com>; Mon, 21 Apr 1997 09:34:24 -0400 (EDT)
Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA21201 (mail destined for uri@services.bunyip.com); Mon, 21 Apr 97 09:33:50 -0400
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <25428-0@josef.ifi.unizh.ch>; Mon, 21 Apr 1997 15:33:01 +0200
Date: Mon, 21 Apr 1997 15:33:00 +0200
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Larry Masinter <masinter@parc.xerox.com>
Cc: Gary Adams - Sun Microsystems Labs BOS <Gary.Adams@east.sun.com>, uri@bunyip.com, fielding@kiwi.ics.uci.edu, Harald.T.Alvestrand@uninett.no
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <3353BB4D.4B7@parc.xerox.com>
Message-Id: <Pine.SUN.3.96.970421145531.245J-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

On Tue, 15 Apr 1997, Larry Masinter wrote:

> > Are there any "facts" still in need of investigation 
> > or are the only unresolved issues questions of "opinion"? (My opinion
> > is that the current system is already broken, if this could be 
> > subtantiated would that invalidate the "status quo" as a viable 
> > alternative?)
> 
> 
> At this point, I think we need not just "facts" but some
> actual "design". Exactly how does this all work in a way that
> actually solves the problem?
> 
> Let's suppose someone wants to publish information
> about their product and put up a URL in a magazine.
> 
> a) what URLs do they support in their server?

The UTF-8 form of the URL they want to give to their product.
%HH is not necessary, as this is already eliminated by the
server core.


> b) what gets printed in the magazine?

The actual natural-language characters of the URL for their product.


> c) what does the user type into the browser?

Those same characters.


> d) what does the browser do with what the user typed
>    in order to turn it into the URL that was generated in (a).

Interpret the characters typed in, encode them as UTF-8,
add %HH to be really conforming, and send that to the
server.


> how does this work for 
>   1) Japanese (16-bit characters)

Well, the "16-bit" characters use up three bytes when in
UTF-8, or 9 bytes whith additional %HH, but otherwise,
everything works smoothly.


>   2) Hebrew (right to left)

This has been discussed to some extent. The URL should be
stored in logical order. For display/paper, we have to
agree on a uniform way of logical->graphical conversion.
A proposal is available on Francois' web page. The main
idea is to display the URL with overall LTR directionality
and with all syntactically relevant characters with strong
LTR directionality (this will differ from the basic, text-
oriented BIDI algorithm). Bidirectional controls might be
allowed or not, but if they are allowed, they will be
strictly restricted to individual path elements and similar
stuff.


> What happens with "/" and the path components?

The "/" will be strong LTR. The path components will always
be displayed LTR globally. If an individual path component
is Hebrew, it will be displayed RTL.

> How does
> directionality get represented?

See above. Probably only implicit directionality is needed,
but with a change for "/" and similar (which are neutral
in the general text algorithm). Note that this does not
mean that there is a need to change the general algorithm,
just that some marks have to be inserted before display.


> What are the considerations
> for ambiguity beyond the familiar 0O0O0O1l1l1l for ASCII?

First a small clarification: The 0O0O0O1l1l1l is explicitly
familliar to this list because it has been mentionned a few
times. It is *implicitly* familliar to most of the other
ASCII users.

There are many considerations, most of them script-specific
and (implicitly) well known to the users of that particular
script. There are some ambiguities due to the inclusion of
characters in Unicode due to backwards compatibility. There
are some other ambiguities due to the treatment of accented
characters in Unicode/ISO 10646.
Some of these ambiguities, in particular the last one, have
to be solved by defining a normalization procedure/algorithm.
The data for this is already given by the equivalence definitions
found in Unicode, what is needed are a few core decisions.
The main one of them will probaly be "Use precomposed for
everything in Unicode 2.0, use decomposed for everything
after version 2.0." This is most practical (because the
Unicode 2.0 book is widely available and because it agrees
with current practice) and leads to less problems for upgrading
to newer versions with new characters.
If it stays at that, then the final normalization spec
will not contain much more than is used in practice
already today. It is true that we don't know yet how
a complex linguistic notation, a Latin base letter with
e.g. five different diacritics, would get normalized,
but I think because such beasts are not at the moment
used too much for identifiers, we won't have problems
to assure that our spec is ahead of actual practice.


> When the details of this are worked out, and we actually
> have something that works to allow non-ASCII URLs, then
> we can look and see if %xx-hex encoded UTF-8 encoded Unicode
> actually forms part of the solution. But it doesn't seem
> "trivial" to me, or at all certain that the current proposal
> is actually part of the solution.

The overall design is definitely clear. It is very clear
that it includes UTF-8. Nobody has seriously challenged
this or brought up any kind of workable alternative.
And that's what we should make clear in the current
draft.

It is not clear whether the final solution will include
%HH-encoding. My guess is that %HH encoding will go away
on the HTTP wire except for reserved (ASCII) characters.
My guess is also that it will go away in HTML and similar
documents. It of course will go away on paper. It will
stay as a fallback, i.e. if I find a terrific Japanese
web page of which I only know the Japanese URL, and
I want to help somebody not familliar with Japanese
to have a look at it.

Regards,	Martin.