Re: UTF-8 URL for testing

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Mon, 14 April 1997 19:51 UTC

Received: from cnri by ietf.org id aa07160; 14 Apr 97 15:51 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa18796; 14 Apr 97 15:51 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id PAA09000 for uri-out; Mon, 14 Apr 1997 15:10:21 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id PAA08989 for <uri@services.bunyip.com>; Mon, 14 Apr 1997 15:10:14 -0400 (EDT)
Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA06676 (mail destined for uri@services.bunyip.com); Mon, 14 Apr 97 15:10:11 -0400
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <09326-0@josef.ifi.unizh.ch>; Mon, 14 Apr 1997 21:09:18 +0200
Date: Mon, 14 Apr 1997 21:09:16 +0200
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: John C Klensin <klensin@mci.net>
Cc: masinter@parc.xerox.com, Francois Yergeau <yergeau@alis.com>, uri@bunyip.com, Dan Connolly <connolly@w3.org>
Subject: Re: UTF-8 URL for testing
In-Reply-To: <SIMEON.9704121139.H@tp7.Jck.com>
Message-Id: <Pine.SUN.3.96.970414204206.245I-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

On Sat, 12 Apr 1997, John C Klensin wrote:

> 
> On Fri, 11 Apr 1997 16:29:57 -0700 (PDT) Larry Masinter 
> <masinter@parc.xerox.com> wrote:
> 
> > Just because a problem is important doesn't
> > mean that we should recommend something that has not yet
> > been demonstrated to actually solve the problem.
> >...
> 
> Dan and Francois,
> 
> While I'm very anxious to see a real solution that 
> addresses the underlying issues here, I'm forced to agree 
> with Larry.  We don't "make" things happen by standardize 
> untested ideas and arguments, however logical, that 
> things are easy to do don't move the discussion forward 
> much.

Thanks for admitting that there is some logic behind what
we have been proposing.


> I don't think that timing of standards are much of 
> the issue here.  It is just that we have a large installed 
> base and I'd prefer to see a demonstration that it works 
> well, that it won't cause significant problems with 
> existing (unmodified) clients, servers, or users, etc.

I very much appreciate your concern. However, I have
great difficulties to immaging what might actually go
wrong. For example, as long as we stay with %HH, there
can't possibly be anything going wrong, can it? And if
it did, it wouldn't be UTF-8 that had to be blamed, but
the implementation that didn't handle %HH correctly.

If we start to remove %HHs and replace it with 8-bit
octets, more things can go wrong. But they are exactly
the same things that can happen now when this is done
with a legacy encoding. They are mainly related to
the fact that transcoding conserves character identity,
whereas URLs assume octet identity. The recommendation
for UTF-8 will finally remove these problems, but in
a transition period, they will show up more strongly.

The above applies as long as we don't have a look at the
exact characters encoded. If we do this, we get problems
similar to the 0O0O0O problems with ASCII. Again nothing
really new.


When asked for implementations, I immediately made two
URLs with UTF-8 encoded characters. Francois made a few
more and included them in a web page. They are here for
anybody to test. We have tested the browsers we have
around. When asked to write some software to convert
URLs to UTF-8, Francois also wrote such software.
Everybody can use it and test it.

If you have any ideas of what else would have to be
tested, and how, please tell the list. Everybody
knows that it is hard to test one's own software or
ideas. It's much easier for other people to spot
problems.


Many thanks for your help,	Martin.