Re: URL internationalization!

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Fri, 21 February 1997 13:07 UTC

Received: from cnri by ietf.org id aa00916; 21 Feb 97 8:07 EST
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa09446; 21 Feb 97 7:58 EST
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id HAA26268 for uri-out; Fri, 21 Feb 1997 07:32:12 -0500 (EST)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id HAA26263 for <uri@services.bunyip.com>; Fri, 21 Feb 1997 07:32:08 -0500 (EST)
Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA00441 (mail destined for uri@services.bunyip.com); Fri, 21 Feb 97 07:32:03 -0500
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <21484-0@josef.ifi.unizh.ch>; Fri, 21 Feb 1997 13:32:19 +0100
Date: Fri, 21 Feb 1997 13:32:18 +0100
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: "Roy T. Fielding" <fielding@kiwi.ics.uci.edu>
Cc: URI mailing list <uri@bunyip.com>
Subject: Re: URL internationalization!
In-Reply-To: <9702201154.aa16860@paris.ics.uci.edu>
Message-Id: <Pine.SUN.3.95q.970221113854.245F-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

Hello Roy,

On Thu, 20 Feb 1997, you wrote:

> First, I want to get some terminology straight.  The issue at hand is
> not internationalization, since the only international character
> set at the current time is US-ASCII (i.e., ISO-647).  No, I don't mean
> that US-ASCII is capable of representing all languages -- it isn't.
> What I mean is that it is the only character set that is displayable
> and typeable on almost all, if not all, computers in use on the Internet.

Thanks for the "almost all". I met an aquaintance recently and he
told me that on a trip to Japan, in the two different (multinational)
companies he visited, he and the local staff were unable to find
equipment having Latin letters on their keyboards or otherwise
allowing to type in Latin letters. With my more than three years
experience on computers in Japan, this was very hard for me to
belive, but such cases indeed seem to exist. Whether these computers
was connected to the Internet I don't know, but because of the
heavy reliance of the Internet on ASCII on the application layers,
ASCII support easily can be only a consequence of connectivity.


> It would help a great deal if advocates of localization did not use
> the term internationalization; you are just creating unnecessary heat
> instead of solving the problem at hand.

See Francois' note. The term is widely used, and it would be nice
if that could be accepted by people less familliar with the topic.
The same applies for "universal". In some discussions, ASCII was
called the "universal" character set. There is some sense of
universality in the lowest common denominator, but it is a very
poor sense. The term Universal Character Set is clearly established
in ISO 10646.


> What Martin (and others) have suggested is that the existing requirements
> on internationalization are too severe.  In essence, he wants to make it
> legitimate for URLs to be localized (or lingua-centric), based on the
> conjecture that it is more important for locals to be able to use the
> most meaningful of names within a URL than it is that non-locals be able
> to use those same URLs at all.

If you remove the "at all" (we have %HH as a fallback in all cases),
then you definitely get my point. And I have good arguments for it.

Just for a moment consider how the web deals with blind and visually
challenged. Most of the web contains great images which are quite
useless for these people; to them, many pages that could be of great
interest are completely useless. At least as long as there are fallbacks
(HTML provides for such, but they are rarely used), we all seem
to accept this. Nobody claims that everybody should stop using
nice images just because a small percentage of our population
can't access them. And this is okay, because sight, to those that
have it, is a very useful and efficient sense.

Now let's turn back to URLs. There are a lot of documents in
languages that need something else than ASCII, or more than ASCII.
These documents are accessed mostly by those people familliar
with these languages and scripts. Of course, sometimes they
are accessed by others (more by chance than on purpose, I guess).
But is it necessary to make that access much more difficult
and clumsy for the great majority of those that actually do
access and make use of these resources, just to make it easier
for the very small percentage of those that might glance at
them by chance and for curiosity, the ones we might call, in
this context, scriptwise challenged? 

The answer is pretty obvious. If it isn't to you, you are either
assuming that you, English, the US, the Latin alphabet, or whatever,
have special rights and merits before other languages, countries, or
scripts, or you have been brought up in English and the Latin
alphabet and never really used something else so that it is
difficult for you to understand (to the extent that it might
be difficult even to immagine, or to get the idea that this
could be an issue) that other people, brought up in other
cultures and with other scripts, may have about as much
difficulties to understand why one could be comfortable
in English and the Latin script.



> It is my opinion that URLs are, first and foremost, a uniform method of
> describing resource addresses such that they are usable by anyone in
> the world.  In my opinion, an address which has been localized at the
> expense of international usage is not a URL, or at least should be
> strongly discouraged.  This is, I think, one of the basic philosophies
> behind the URI design, and what I tried to describe in the URL syntax
> document.  It is one of the key reasons why URIs succeeded where all
> other attempts at a uniform address syntax have failed.

Nice and good. Memorability, guessability, understandability,
and shortness (in terms of user characters) have admittedly had
their fair share in the URI success story. To this, add the
availability of implementations at an early stage.


> It is therefore my opinion that any attempt to increase the scope of
> the URL character set to include non-ASCII characters is a bad idea.
> This does not in any way restrict the nature of resources that can
> be addressed by a URL; it just means that the URL chosen should be an
> ASCII mapping, either one chosen by the user or one chosen automatically
> using the %xx encoding.  Yes, this is an inconvenience for non-English-
> based filesystems and resources, but that is the price to pay for true
> internationalization of resource access.

The %HH encoding is the only thing I currently propose, and even
in my long-term proposal, it will always be available. The inconvenince
for non-English *USERS* (ever heard of a file system complaining
about inconvenience?), when weighted with the relative frequency
of use, is a too high price to pay when compared with the
inconvenience the use of %HH creates for English speaking users.
Also, once UTF-8 is introduced and established, we can speak
about "chosen automatically". Currently, except for the computer
(not the user) at the point of oringin, we can only say "chosen
chaotically".


> Nevertheless, I am not one to believe in forcing, by way of standard,
> a technological solution to a social problem.  If people want to create
> locals-only URLs, I am not the kind of person to stand in their way.
> However, I am the kind of person who would tell them they are being
> shortsighted, and I believe that kind of guidance should remain in
> the specification.

If a Japanese company creates fancy Japanese URLs to appeal
to their Japanese customers, and gets a better responce because
of this, would you call this shortsighted? Not more shortsighted
than what many US companies are doing all the time, with great
success.


> In regards to the changes proposed by Martin J. Duerst:

> >> 1.3. URL Transcribability
> >> 
> >>    The URL syntax has been designed to promote transcribability as one
> >>    of its main concerns. A URL is a sequence of characters, i.e., letters,
> >>    digits, and special characters.  A URL may be represented in a
> >
> >change one sentence:
> >
> >A URL is a sequence of characters from a very limited set, i.e. the
> >letters of the basic Latin alphabet, digits, and some special characters.
> >
> >[Justification: "character" is used in different circumstances and
> >senses later. It is important to make things clear up front.]
> 
> That seems like a good idea.

Thanks.

> >>    These design concerns are not always in alignment.  For example, it
> >>    is often the case that the most meaningful name for a URL component
> >>    would require characters which cannot be typed on most keyboards.
> >>    The ability to transcribe the resource
> >>    location from one medium to another was considered more
> >>    important than having its URL consist of the most meaningful of
> >>    components.
> >
> >Add:
> >In local and regional contexts and with improving technology, users
> >may greatly benefit from being able to use a wider range of characters.
> >However, at the current point of time, such use is not guaranteed to
> >work, and should therefore be avoided.
> 
> I would strike the word "greatly", but otherwise this is true.

Agreed.


> >Add (this is CRUCIAL!):
> >
> >+ In current practice, all kinds of arbitrary and unspecified character
> >+ encoding schemes are used to represent the characters of the world.
> >+ This means that only the originator of the URL can determine which
> >+ character is represented by which octets.
> 
> Replace "all kinds of arbitrary and" with "multiple" and its okay.

Okay with me.


> However, the wording that existed
> in earlier drafts was considerably better, since it didn't preclude an
> application from showing what it did know about the character encoding.

The above describes current practice. To my knowledge, there is currently
no application that shows what it does know about the character encoding.


> >+ To improve this, UTF-8 [RFC 2044] should be used to encode characters
> >+ represented by URLs wherever possible. UTF-8 is fully compatible with
> >+ US-ASCII, can encode all characters of the Universal Character Set,
> >+ and is in most cases easily distinguishable from legacy encodings
> >+ or random octet sequences.
> >+
> >+ Schemes and mechanisms and the underlying protocols are suggested
> >+ to start using UTF-8 directly (for new schemes, similar to URNs),
> >+ to make a gradual transition to UTF-8 (see draft-ietf-ftpext-intl-ftp-00.txt
> >+ for an example), or to define a mapping from their representation
> >+ of characters to UTF-8 if UTF-8 cannot be used directly
> >+ (see draft-duerst-dns-i18n-00.txt for an example).
> >
> >[Comment: the references can be removed from the final text.]
> >
> >+ Note: RFC 2044 specifies UTF-8 in terms of Unicode Version 1.1,
> >+ corresponding to ISO 10646 without ammendments. It is widespread
> >+ consensus that this should indeed be Unicode Version 2.0,
> >+ corresponding to ISO 10646 including ammendment 5.
> 
> None of the above belongs in this document.  That is the purpose of
> the "defining new URL schemes" document, which was previously removed
> from the discussion of the generic syntax.

The process document describes what should be done with new URL
schemes. But this is, as I hope you might have noticed, only
one part (and probably the smaller part) of my proposal.

UTF-8 can, and should, be used for existing URLs too. As I have
shown, when done correctly, this can be done without having a
single URL break, and without having to change software that
doesn't want to take advantage of it.

There are protocols and mechanisms that currently are not
internationalized at all (officially, this is true for FTP;
in fact, News names and domain names are examples), for which
the process draft, according to the oppinion of some people,
is not applicable, but which nevertheless can benefit greatly
if they choose to use UTF-8 in their URL representation when
and if they should decide to go internationalized (ftp already
did, and for good reasons choose UTF-8 directly in the protocol,
without the need of a special translation to URLs, so we already
have a good example.


> >> 2.3.3. Excluded Characters
> >
> >Change to "Characters and Octets"
> >
> >>    Although they are not used within the URL syntax, we include here a
> >>    description of those US-ASCII characters which have been excluded
> >>    and the reasons for their exclusion.
> >
> >Change "US-ASCII characters" to "US-ASCII characters and other octets"
> 
> I'll leave that to Larry's judgement, since the reemphasis of octets
> over characters may have left some confusion in the document.
> 
> >>       excluded    = control | space | delims | unwise | national
> >
> >Change "national" to "others". There is nothing particularly
> >national about octet values above 0x7F. There is also nothing
> >particularly national about a character such as A-grave. It is
> >used in many languages, by many nations.
> 
> Okay -- it was just a leftover from the old BNF.
> 
> >>    All characters corresponding to the control characters in the
> >
> >Change "characters" to "octets".
> 
> The first ocurrence, yes.

Agreed, the second ocurrence can stay as it is, it's indeed
"control characters" and not "control octets".
But if we follow the sencence a little bit more, we discover
another problems:


#   All characters corresponding to the control characters in the
#   US-ASCII coded character set are unsafe to use within a URL, both

The (US) ASCII coded character set only contains 94 characters.
SPACE and DEL, and the control characters, are officially not
part of it. Please see ECMA registration 006. This is a detail,
and I don't think it will cause confusion if it remains, but
while we are at it we might as well clean it up.


> Larry, please write these changes such
> that they still make sense when the URL is pasted on a billboard sign
> instead of in a protocol stream.
> 
> >Up to here, it's easier to speak about characters. But from here
> >on, it's definitely easier and clearer to speak about octets.
> >
> >>    Finally, all other characters besides those mentioned in the above
> >>    sections are excluded because they are often difficult or impossible
> >>    to transcribe using traditional computer keyboards and software.
> >
> >Change to:
> >
> >Finally, octet values above 0x7F are excluded because with the
> >current lack of a common convention for encoding the characters
> >they represent, they can neither be transcribed nor transcoded
> >reliably.
> 
> No, we are still talking about characters here -- octets are not
> relevant to whether or not A-grave is excluded.  The existing paragraph
> is better than the proposed change.

Section 2.3 speaks about escaping. Now escaping happens when
mapping from octets to URL characters. Whether these octets
represent some characters, or something else, is irrelevant
at that stage. Whether these octets, escaped as URL characters,
are posted to a billboard sign (see above) or put into a protocol
stream is also irrelevant at this stage.


> >>       national    = <Any character not in the reserved, unreserved,
> >>                      control, space, delims, or unwise sets>
> >
> >Change to:
> >
> >	others	= <any octets with values above 0x7F>
> 
> No -- "others" is fine, but the BNF definition must remain as is in
> order to correctly define URLs that have no representation in bytes.

Disagree. Same argument as above.


> >>    Data corresponding to excluded characters must be escaped in order
> >>    to be properly represented within a URL.  However, there do exist
> >>    some systems that allow characters from the "unwise" and "national"
> >>    sets to be used in URL references (section 3); a robust
> >>    implementation should be prepared to handle those characters when
> >>    it is possible to do so.
> >
> >It is not "possible to do so", so the above does not make sense.
> 
> That doesn't make any sense -- it is done every day.  Francois had a
> personal URL with a c-cedilla, and it makes sense to admonish
> implementers that such things do occur and should not result in a
> system crash if such is avoidable.

I guess we can find some compromize here. I agree that it makes
sense to admonish implementers that such things do occur and that
they shouldn't crash. On the other hand, it makes sense to admonish
(users and) tool implementors that this practice is not guaranteed
to work currently. Transcoding a document by a server in response
to a request with a different "Accept-Charset" value is a well
established practice that will increase in the future, and it
hopelessly breaks such URLs. This is one of the reasons we need
UTF-8!


> Hmmm, I used to have a section/paragraph on why clients can't convert
> %xx encodings to characters for the purpose of display unless they
> have some knowledge of the character set of the underlying URL-creation
> process, as is the case for all filesystem URLs which are local to
> the client.

Hope you can find it. Would be a nice addition.


> It is unfortunate that it was deleted, since I was going
> to suggest that if the scheme defines that only a single character
> encoding can be used for creating the %xx encoding, then the client does
> have sufficient knowledge to display that data in its natural form.

This is one possibility for an improvement, but it is far from being
the best solution. With other things, in particular relative URLs,
it is very well established that URIs (the I here is on purpose)
should have one solution, and the URL scheme should be defined so
that the scheme-specific implementation makes the necessary translations.
You yourself have stressed this explicitly in a recent mail, and
you know why. I see no reason why this well-tested paradigm should
not be used, with great benefits, for the case at hand.


Regards,	Martin.