Re: revised "generic syntax" internet draft

"Martin J. Duerst" <> Mon, 21 April 1997 13:12 UTC

Received: from cnri by id aa03436; 21 Apr 97 9:12 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa02080; 21 Apr 97 9:12 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id IAA24131 for uri-out; Mon, 21 Apr 1997 08:22:35 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id IAA24126 for <>; Mon, 21 Apr 1997 08:22:32 -0400 (EDT)
Received: from by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA20682 (mail destined for; Mon, 21 Apr 97 08:22:28 -0400
Received: from by with SMTP (PP) id <>; Mon, 21 Apr 1997 14:22:12 +0200
Date: Mon, 21 Apr 1997 14:22:11 +0200
From: "Martin J. Duerst" <>
To: "Roy T. Fielding" <>
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <>
Message-Id: <Pine.SUN.3.96.970421135245.245G-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Precedence: bulk

On Wed, 16 Apr 1997, Roy T. Fielding wrote:

>                   Can the people who are advocating change(s) to the
> existing draft please communicate with each other and develop
> at least one problem statement that covers what it is you want the
> editors to fix, how you want it fixed, and an estimation of what
> it will take to deploy the change? 

This may be somewhat repetituous, but just to be clear:

> In the past 24 hours I have been
> told four different and conflicting goals:
>    1. natural language URLs (i.e., non-ASCII, non-encoded URL strings)

This is the final aim. In order to progress towards that aim, we first
have to establish a single widely-used character<->octet conversion.

>    2. URLs that must always be UTF-8 in order to pass form data around
>       in a deprecated manner that is somehow different than the charset
>       of the form.

Because there is low overlap between "entry point URLs" and
"form URLs", this is only marginally relevant to fully achieve 1.
On the other hand, using the charset of the form is not guaranteed
to work under transcoding (I will explain that in another mail),
and so converting to UTF-8 here is also desirable.

>    3. URLs that are not natural language and always represented as
>       ASCII, but are restricted to UTF-8 in order to avoid future
>       transcoding problems [which itself is odd, since there are
>       no transcoding problems if the URL is always represented as ASCII
>       and never converted by the recipient].

No. We definitely don't need UTF-8 just for telephone-number-like

>    4. URLs that are always transmitted as ASCII, but must be encoded
>       as UTF-8 so that browsers can display the URL in decoded form
>       [this is the URN syntax compromise].

This is an intermedate solution, if you want. If people are not
allowed to write native characters on paper, there won't be any
benefit. But once UTF-8 is established for URLs, adding the
natural display/input facilities and so on will come naturally.
The "must" of course should be a "should".

>    5. URLs that are recommended to be UTF-8 but only if that's okay
>       with the server, apparently so that more people will use UTF-8
>       instead of some other charset.  [this is Martin's proposal]

You can add pretty much anything after "only if that's okay with".
But on the other hand, you also should add "servers (or whatever)
should work on showing/accepting their URLs with UTF-8".

> I thought the goal was to solve 1 and 2 (and that 2 is already solved).

The main goal is 1, and 2 is also an important goal. 2 is solved
in some cases, but not in all.

> Number 3 appears to be a solution without a goal.

Exactly. Not worth discussing anymore.

> Number 4 is a solution
> if you don't mind invalidating existing use of %xx encoding.

Is an intermediate step to achieve 1. Because it's a recommendation,
existing %xx are not invalidated.

> Number 5 appears to be a political statement, since it doesn't solve any
> problem (at least with existing systems).

In some sense, it is indeed a statement of intended standard development.
The message we want to be passed along, in full length, could read
about as follows:

Currently, there is a discrepancy in the handling of characters by
URLs in that ASCII characters are (apparently) handled as characters,
while characters beyond ASCII are handled by their octet values in
an arbitrary character set, which makes it impossible in most situations
to use such characters.
As a first step towards remedying this situation, we need to establish
a single character<->octet mapping for characters beyond ASCII, and
we have decided to use UTF-8 for this purpose.
Wherever possible, mechanisms for generating and accepting URLs
are requested (but not required) to use UTF-8 for their character<->
octet mapping.

> It is pointless to argue about what will or will not solve the problem
> when we are all talking about different problems.

We may be talking about different steps, but always towards the
same basic problem.

> I don't want to see eight different problem statements vaguely related
> to UTF-8, just one that defines an actual protocol problem that has
> to be fixed right now.

"right now" is a little dangerous. Ideally, it should have been
fixed 5 years ago or so :-). But that shows that the earlier we
start to change, and the clearer we are saying what should
go on, the better.

Regards,	Martin.