Re: revised "generic syntax" internet draft

"Martin J. Duerst" <> Mon, 21 April 1997 10:37 UTC

Received: from cnri by id aa01191; 21 Apr 97 6:37 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa22283; 21 Apr 97 6:37 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id GAA22824 for uri-out; Mon, 21 Apr 1997 06:03:14 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id GAA22819 for <>; Mon, 21 Apr 1997 06:03:10 -0400 (EDT)
Received: from by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA20363 (mail destined for; Mon, 21 Apr 97 06:03:07 -0400
Received: from by with SMTP (PP) id <>; Mon, 21 Apr 1997 12:02:55 +0200
Date: Mon, 21 Apr 1997 12:02:53 +0200
From: "Martin J. Duerst" <>
To: "Roy T. Fielding" <>
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <>
Message-Id: <Pine.SUN.3.96.970421113701.245E-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Precedence: bulk

Hello Roy,

On Fri, 18 Apr 1997, you wrote:

> Martin, I haven't forgotten about your very detailed problem statement
> at <>.  My question was
> whether all the other people advocating non-ASCII URLs agree to that
> problem statement,

I guess even though many people have expressed the problems (and solutions)
in different words, there is wide agreement on it. The summary you have
written also expresses the same problems, but the solutions you give
are not satisfactory to many of us.

> and in particular to the course of action for the
> current draft revision.

There have been various oppinions, from "leave it as it is, deal with
internationalization separately" to "take the chance to recycle and
deal with i18n completely". I personally tend towards the later, but
I think that you and Larry have worked hard on the current draft, and
that there are many aspects in i18n URLs that need long and detailled
specs (such as BIDI) so that I think there should be some middle solution.
The middle solution would be to include, in the current draft, a clear
indication of where we are heading (UTF-8), so that people stay tuned
and can take the necessary steps (for example if they have to decide
how to set up their server site, whether to use some legacy encoding
or UTF-8 for filenames, they can choose UTF-8 because they will know
that this will make things easier in the long run), and then write
some other documents to describe more advanced things.

> >and looks into the way configuration information can be
> >setup for Apache to inform it about special needs of scripts
> >and stuff, before he again claims things to be impossible.
> It is impossible for Apache to correctly transcode incoming URLs for the
> same reason that it is impossible for current browsers to decode and display
> the encoded octets of received URLs -- a program cannot transcode bytes to
> a different charset unless it knows how the bytes are currently encoded.
> There is nothing you can do in the Apache configuration to change that
> fact, since it is a property of how the URL is generated (either by some
> other part of the server or some part of the user agent or some author
> of any page in the Web).

I meant the comment about Apache configuration to know which encoding
the target of the URL (filename, cgi parameter,...) is in. To convert
correctly, you need to know the "charset" of both the source and the

As for the source (I have explained this already), if we expand the
current heuristic "same as target" to "same as target or UTF-8", and
not to "whatever it might be", then in sparse namespaces, we have
something like a 99.999% hit rate, and because of the properties
of UTF-8, we only occasionally need a second file system access.
For dense name spaces, we need some information from the browser
to distinguish "same as up to now" and "UTF-8", and I have already
described the "FORM-UTF8: Yes" that does this job.

> I think there is a way to define UTF-8 preference for URL encoding
> such that it won't break existing services, by forbidding transcoding
> of already-encoded octets.

By "already-encoded", do you mean already encoded with %HH?
Of course, things that are encoded in %HH should be treated as
binary and not messed around with it. Once UTF-8 is firmly established,
there might be instances that have a look at the %HH-sequences,
find out that they look like UTF-8 (very rare for arbitrary sequences,
unless they are only ASCII), and convert them to real characters.
On converting back from real characters, UTF-8 would also be used,
and so the same octets would be reproduced even if they were not
originally UTF-8. However, apart from user interface only cases,
this won't be frequent.

> However, I won't bother to explain that
> until there is broad agreement on what needs to be solved.

Please go on and explain your ideas! Maybe they are even closer
to mine than you think :-).

Regards,	Martin.