Re: UTF-8 and URLs

"Martin J. Duerst" <> Fri, 25 April 1997 15:57 UTC

Received: from cnri by id aa28472; 25 Apr 97 11:57 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa12713; 25 Apr 97 11:57 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id JAA14185 for uri-out; Fri, 25 Apr 1997 09:46:16 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id JAA14179 for <>; Fri, 25 Apr 1997 09:46:12 -0400 (EDT)
Received: from by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA27838 (mail destined for; Fri, 25 Apr 97 09:45:53 -0400
Received: from by with SMTP (PP) id <>; Fri, 25 Apr 1997 15:27:03 +0200
Date: Fri, 25 Apr 1997 15:27:02 +0200
From: "Martin J. Duerst" <>
Reply-To: "Martin J. Duerst" <>
To: Larry Masinter <>
Cc: John C Klensin <>,
Subject: Re: UTF-8 and URLs
In-Reply-To: <>
Message-Id: <Pine.SUN.3.96.970425100102.245p-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Precedence: bulk

Hello Larry,

Many thanks for your recent message, which is extremely encouraging
and is focussing the discussion in the right direction. I will
mainly address the general issues in this mail, and send
a separate mail with a few technical comments.

Please read to the end of the mail, because I have to make
an important precondition.

On Thu, 24 Apr 1997, you wrote:

> I think to actually solve the problem of Internationalization
> of URLs we need two recommendations:

[comprehensive summary deleted]

> These three recommendations affect software from a large number
> of different producers. To make progress in the community,
> those software implementors will need to agree that this is
> the best solution to interoperability of URLs internationally.
> I think given its likely controversial nature, we should clearly
> make these recommendations in a separate RFC, and perhaps with
> a new working group.

I definitely think that a working group is a good thing to have,
so that we have an agenda, a chair who can cool down things when
we get heated up, and so on.

I also think that a separate RFC, or indeed several separate
RFCs, are needed, mainly because some of them may come to contain
much text/data, and they will address various issues that
touch various areas outside URLs proper.

Currently, I see the following possible RFCs:

- Rationale for Internationalized Identifiers
	Explaining where and why internationalized identifiers
	are useful/necessary/possible, answering some of the
	most frequently raised objections, and stating where
	and when internationalized identifiers should not
	be used. This should not be seen as a rationale
	document that needs to proceed everything else, but
	rather as an explanatory (informal) document that
	we can refer to when people ask basic questions.

- Internationalized URL Architecture
	Central document explaining the basic workings, what
	to interpret/convert to what in what case, including
	some details about upgrading strategy

- Normalization for Internationalized Identifiers
	Listing character ranges and characters that should
	be mapped to others to avoid problems, such as
	compatibility characters, combining sequences,...;
	this would be written to be useful for other things
	than just URLs (things that might end up in URLs
	eventually anyway); we would expect input from
	the character sets standards groups here

- Bidirectionality for Internationalized Identifiers
	Giving exact specifications for handling BIDI
	in internationalized identifiers. This might
	be merged with the previous one, as it should
	be quite short.

- Handling Internationalized Forms and Query Parts
	This would define the conventions and additions to
	HTML/HTTP along the lines we have been discussing. 

> I'm willing to put this all down in a separate internet draft,
> if it will help focus the process on actually making progress.
> Some of the examples that have been sent out to the mailing list
> will be useful to guide the recommendations in the RFC.

I have repeatedly volunteered to author such drafts, and I would
be very happy to work together with Larry and others on these.
Given the amount of mail I have written on these issues in the
past few weeks/months and the discussions and presentations
I have enjoyed with many other people concerned about these
topics, and given my current workload, I think it should be
possible to produce first drafts of the above by mid May or
end of May.

Because the general theme of the above is internationalization
of identifiers, I propose to name the group iii, standing for
Internationalization of Internet Identifiers. I know that the
IETF likes very focussed groups with clear goals, and I think
we can very well focus on one goal and a few documents at the
time. On the other hand, it seems sensible to have a group
with a general name so that we can retarget it if necessary
after having completed our first goals.

After all these proposals, I have to add an important
RESERVATION. I think it makes sense to have separate drafts,
because of the things discussed above and because the current
draft is rather advanced, and should proceed quickly now.
However, having the two things *completely* unrelated, and
leaving readers of the new RFC-to-be ignorant about what is
going on, is in my oppinion a very dangerous and bad idea.

Therefore, I think we need SOME language in the current
draft that makes its readers aware of what is going on,
so that incompatibilities and bad surprises can be avoided,
or at least so that we don't be blamed for them if people
don't read the draft.

For formal reasons, these comments should probably take the
form of a note. They should say that:

- This document does not define how to handle or to map
	characters outside the US-ASCII repertoire.
	[this point doesn't need to be a note]
- An extension of this standard to make it possible for URLs
	to contain characters beyond US-ASCII where this
	is feasible are under discussion.
- This extension will be based on using UTF-8 for character<->octet
- The use of characters outside US-ASCII to write down URLs might
	currently seem to work in some cases and configurations,
	but is not guaranteed widely enough and is strongly
	discouraged until the extension is available.
- URLs where non-US-ASCII octets are correctly escaped with %HH
	will not be affected by the extension and will continue
	to work correctly.
- When making available new URLs which represent characters outside
	US-ASCII, where feasible these should be made available
	by using UTF-8 as a character->octet mapping.

The above points cover, I hope, what we can say currently, in a
way that does not restrict us too much or have us promize too
much, while avoiding bad surprises for the readers and for us.
Also, it should be feasible as a note even in a draft standard.
Of course, I am open to any kind of suggestions for better wording.

I hope that we can proceed along these lines, which I think
form a compromize acceptable to the widest part of our group.

With kind regards,	Martin.