Re: Transcribing non-ascii URLs [was: revised "generic syntax" internet draft]

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Mon, 14 April 1997 22:05 UTC

Date: Mon, 14 Apr 1997 22:35:03 +0200
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Dan Connolly <connolly@w3.org>
Cc: Francois Yergeau <yergeau@alis.com>, uri@bunyip.com, bert@w3.org
Subject: Re: Transcribing non-ascii URLs [was: revised "generic syntax" internet draft]
In-Reply-To: <33526F61.622A4B12@w3.org>
Message-Id: <Pine.SUN.3.96.970414220330.245K-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

On Mon, 14 Apr 1997, Dan Connolly wrote:

> I have been shooting from the hip on this I18N/URL stuff for a while,
> but some folks at WWW6 wanted the full weight of W3C behind it, so
> I've been trying to think more carefully.

We all have tried to do that, but we are glad for every help
we can get.

> And this issue of transcribing non-ascii URLs particularly concerns me.
> 
> On the one hand, it makes a lot of sense that if a user creates
> a file and gives it a hebrew or arabic or CJK name, and then exports
> the file via an HTTP server, that the Address: field in a web
> browser should show the hebrew or arabic or ... characters faithfully.

Yes, definitely.

> On the other hand, suppose that address is to be printed and put
> in an advertisement or a magazine article. Should it print the
> hebrew/arabic/CJK characters using those glyphs?

If it's a hebrew/arabic/CJK magazine, then definitely yes.

> Or should it print ASCII glyphs corresponding to the characters
> of the %xx encoding of the original characters?

That would be the fallback for an English (or otherwise
scriptwise unrelated) publication.

> If the former, then reliability suffers: the odds that a random
> person on the globe can faithfully key in a hebrew/arabic/CJK
> name seem considerably lower than the odds that they can key
> in an ASCII name. (though the odds of correctly transcribing
> a long sequence of %xx codes is vanishingly small too...)

Quite true. But the chances that a random person keys in
a hebrew/arabic/CJK URL are rather small. I don't know if
you speak Japanese, but assume you don't, how often have
you typed in an URL you have found in a Japanese publication,
for a Japanese page? If you take the weighted average of
users for each kind of URL, the reliability increases
dramatically, because there is a very strong correlation.

> (I'm not saying that everybody knows english, but rather
> that a person using a computer connected to the internet
> has a farily high probablility of being able to match
> the 'a' character on a peice of paper to the 'a' character
> on the keyboard.)

The average non-Latin-native user definitely has a higher
probability to match a character from the Latin alphabet
when compared to matching a character from a randomly
choosen foreign alphabet. But matching a character
in the native alphabet should always be easier.

> If the latter, then the system is very much biased to
> the *American* Standard Code of Information Interchange.

It's not so much the fact that the Americans made that
standard that is the problem. Unicode also has a very
strong American influence.

> It seems to me that the minimally constraining
> thing to do is to specify both
> and allow folks to choose:

Exactly.

> specify how Unicode strings
> fit into URLs, and then advise folks to use a small
> subset of Unicode if their audience is international
> (and at the same time, add a few more notes: perhaps advise folks that
> mixing upper and lowercase increases the risk of
> transcription errors).

There are more things you can add to the notes. For example
that you shouldn't use URLs like 0O0O0o0o.html. Some of
these things can get a little tricky if you have lots of
characters, but for ASCII, nobody up to now cared to
actually write such notes. It's probably more a problem
of computer literacy than of standards specs.

> What's the conventional wisdom among the DNS folks? Surely
> they face the same issue.

No. Or let's say not yet. DNS is strictestly case-folded ASCII.
But see draft-duerst-dns-i18n-00.txt for an idea for a way out.

> Regarding process, it seems clear (based on Larry M and John K's
> input) that specifying how Unicode
> strings fit into URLs is not the sort of thing one adds to
> a proposed standard to make it a draft standard.
> 
> But I'm not terribly interested in a draft standard that doesn't
> address this issue -- even if only to say "we thought about encoding
> Unicode in URLs, but decided against it for the following reasons... ."
> 
> In either case, a separate internet draft on the subject seems
> like a perfectly good idea. I don't think the risk of "incompatible
> standards" is unmageable.

I don't see that problem either. But having two documents that
are not connected is also not a good idea. What I thought would
be the best way to go, and what I think Dan Connolly has a lot
of experience with and can certainly advise us on, would be
a solution similar to the one we had with RFC1866 (HTML 2.0)
and RFC2070 (HTML I18N) and the issue of ISO 10646 as the
document character set. That's about two years ago, but it
has turned out to be a very nice idea.

This would mean to add a clear hint to the current draft
about the basic issue of character<->octet mapping, showing
where we are going, without having to rewrite a now very
well-done document. The text of Roy can serve as a base,
but we can change it to fit your needs.

The more extended aspects of beyond-ASCII URLs would then
be discussed in a separate draft. We already have a lot
of text from this discussion, and from Francois' web
page. I already volunteered as an author/editor.

> Larry has asked for implementation experience. Such experience
> seems to be growing. None of the implementors has reported
> any problems (as far as I can see).
> 
> Regarding Jigsaw and Amaya... Support in Jigsaw should be easy.
> I'll look into it. Anybody want to do it for me? Should
> be a quick hack.

If you need some help, please tell me (I don't volunteer
for the full job, though).

> Support in Amaya would be more work. I don't think we've
> crossed the hurdle of getting non-western fonts working
> in Amaya, not to mention internationalized input.

Amaya, as far as I remember, is based on Motiv and X11.
If that's the case, I would definitely advice you not
to spend your time on it. Many browser vendors are
doing this work anyway, as we have seen.

Regards,	Martin.

Re: revised "generic syntax" internet draft Foteos Macrides
leading ".." (Re: revised ...) Gregory J. Woodhouse
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Francois Yergeau
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Francois Yergeau
Transcribing non-ascii URLs [was: revised "generi… Dan Connolly
Re: revised "generic syntax" internet draft Edward Cherlin
Re: Transcribing non-ascii URLs [was: revised "ge… Martin J. Duerst
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Dan Oscarsson
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft John C Klensin
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
Re: revised "generic syntax" internet draft Chris Newman
Re: revised "generic syntax" internet draft Chris Newman
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Chris Newman
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Edward Cherlin
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft Harald.T.Alvestrand
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Jon Knight
Re: revised "generic syntax" internet draft Jon Knight
Re: revised "generic syntax" internet draft John C Klensin
Re: revised "generic syntax" internet draft Ron Daniel, Jr.
Re: Transcribing non-ascii URLs [was: revised "ge… Bert Bos
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
Re: revised "generic syntax" internet draft Gary Adams - Sun Microsystems Labs BOS
A workable alternative to "hex-encoded UTF-8 enco… Larry Masinter
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft John C Klensin
Re: revised "generic syntax" internet draft Harald.T.Alvestrand
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Roy T. Fielding
Re: revised "generic syntax" internet draft Chris Newman
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: A workable alternative to "hex-encoded UTF-8 … Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Larry Masinter
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Jonathan Rosenne
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Edward Cherlin
Opaque right hand sides (was: Re: revised "generi… John C Klensin
Re: revised "generic syntax" internet draft Karen R. Sollins
UTF-8 and URLs Larry Masinter
Re: UTF-8 and URLs Dan Connolly
Re: UTF-8 and URLs Chris Newman
Re: UTF-8 and URLs John C Klensin
Re: UTF-8 and URLs Francois Yergeau
Re: UTF-8 and URLs Dan Connolly
Re: revised "generic syntax" internet draft Edward Cherlin
Re: revised "generic syntax" internet draft John C Klensin
Re: revised "generic syntax" internet draft Keld J|rn Simonsen
Re: UTF-8 and URLs Martin J. Duerst
Re: UTF-8 and URLs Francois Yergeau
Re: UTF-8 and URLs Dan Connolly
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: revised "generic syntax" internet draft Martin J. Duerst
New proposal (was Re: UTF-8 and URLs) Edward Cherlin
Re: UTF-8 and URLs Larry Masinter
Re: revised "generic syntax" internet draft Martin J. Duerst
Re: UTF-8 and URLs Martin J. Duerst
initial "relative-looking" elements. Larry Masinter
Re: revised "generic syntax" internet draft Edward Cherlin
Re: initial "relative-looking" elements. Roy T. Fielding