Re: "Difficult Characters" draft

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Wed, 07 May 1997 09:56 UTC

Received: from cnri by ietf.org id aa23567; 7 May 97 5:56 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa06548; 7 May 97 5:56 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id FAA01611 for uri-out; Wed, 7 May 1997 05:24:39 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id FAA01606 for <uri@services.bunyip.com>; Wed, 7 May 1997 05:24:36 -0400 (EDT)
Received: from josef.ifi.unizh.ch (josef.ifi.unizh.ch [130.60.48.10]) by mocha.bunyip.com (8.8.5/8.8.5) with SMTP id FAA03094 for <uri@bunyip.com>; Wed, 7 May 1997 05:24:32 -0400 (EDT)
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <05450-0@josef.ifi.unizh.ch>; Wed, 7 May 1997 11:23:20 +0200
Date: Wed, 07 May 1997 11:23:12 +0200
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Alain LaBont/e'/ <alb@sct.gouv.qc.ca>
cc: URI mailing list <uri@bunyip.com>
Subject: Re: "Difficult Characters" draft
In-Reply-To: <3.0.1.16.19970421154814.093f3814@riq.qc.ca>
Message-ID: <Pine.SUN.3.96.970507104936.245Y-100000@enoshima>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

On Mon, 21 Apr 1997, Alain LaBont/e'/ wrote:

> Don't forget that French French don't have all uppercase letters on their
> PC keyboards... even if there are Canadian standards (CAN/CSA Z243.200) and
> ISO (ISO/IEC 9995-3) standards for doing so. So capitalization remains a
> problem in practice for the French people on upper case letters. Some
> French keyboards have this, not all.

I think we pretty much agree that we should discourage URLs with
accented uppercase letters.


> Not so, I demonstrated this in my earlier note about my insurance agent web
> page. People care (of course), servers care, or browsers care and whoever
> or whichwever does the correction, the net result is that equivalences are
> done today and end-users got used to this... at least some... and likely a
> big lot.

End users that have the perception that URLs ignore case will meet
bad surprises and have to correct their oppinion some day.
And the main reason that we have case equivalence in DNS is the
time at which DNS was created, when case distinction was not
something you could assume a computer could do (human beings
always have been able to do it :-).


> >> Fortunately, it's possible that equivalence-based matching
> >> could be deployed for URLs;
> >
> >That's interesting. But it would be a lot more work than the
> >conversions from and to UTF-8 that I have suggested for backwards
> >compatibility and that have raised great concerns from Roy.
> 
> There exists methods for this in actual practice and it is about to be
> standardized in ISO/IEC 14651 which defines an API for charactre string
> comparisons at different level of precision.

It works if you have the expectations of the user available
when doing the eqivalence. But it doesn't work otherwise.
This is easily shown by example. Assume somebody in Turkey
puts up a server, and installs this server so that equivalences
are done on the various variants of I according to Turkish
expectations (matching uppercase and lowercase dotted i and
uppercase and lowercase dotless I). Now assume that there is
an URL http://www.xxx.com/izmir. If this is accessed by a
Western European user, and this user types
	HTTP://WWW.XXX.COM/IZMIR
the URL won't match because of the "I" that for the Turkish
doesn't match with the "i". This will be a rare case, but
it will be all the more surprising. It will be impossible
for an average user to learn the message: "Always care about
case to be on the safe side" because there are not enough
examples to strengthen this message. But it will nevertheless
still be true; it will still be the only thing that guarantees
a response.


> >We don't want to ask the French user more than the US user,
> >when compared to his/her language abilities. And up to now,
> >we don't.
> 
> You do. If equiavlences are not processed adequately, given that
> equivalence processing exists today. You ask either exact match or match
> independent of case but dependent on accents... that's not good enough...

Well, I don't actually propose that. I just say that it wouldn't
be to strange to consider case equivalences but not accent equivalences.
In sorting, accents also have higher distinctive power than case, don't they.


> See ISO/IEC CD 14651 or CAN/CSA Z243.4.1 (published in 1992, revised this
> year -- characters have been added but the logic is the same) and CAN/CSA
> Z243.230 (this one to be published this year)...

The above standard is a sorting standard that can be used for matching
on various levels and for searching, and can be tailored to various
user expectations.

But they don't work for URLs, because they would need tailoring options
to be transmitted with the URL from the client to the server (and these
tailoring options can necessitate a rather large data volume in some cases).
Also, they are unsuitable and lead to surprises for the users if they
are not applied on all servers and services (which is impossible).

I agree that it would be great to have a lot of user-friendliness, with
servers correcting all kinds of mistakes, from case to accents to spelling
and whatnot. But I think it is wrong to create expectations that can only
partially be fulfilled and will confuse the user.

The situation: "Copy it exactly, with case and everything."
is about 5% less user friendly than a highly sophisticated and user-
tailored equivalence engine, in particular if irregular casings
(or uppercase in general where it is not part of the grammar as in
German) and unusual case-accent combinations (as French uppercase
accented characters) are avoided. At least 99% of the users of
bicameral scripts can easily distinguish case and so on if they are
told to do so. [The remaining 1% or less are the people that might
have problems distinguishing similar-looking letters such as
'd', 'b', 'q', and 'p' and so on.]
So in the end, the strategy
	"Copy it exactly, with case and everything."
is much more user friendly, because it is the only one that
works consistently.

Regards,	Martin.