Re: revised "generic syntax" internet draft

"Roy T. Fielding" <> Tue, 15 April 1997 23:33 UTC

Received: from cnri by id aa15037; 15 Apr 97 19:33 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa23196; 15 Apr 97 19:33 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id TAA13605 for uri-out; Tue, 15 Apr 1997 19:19:08 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id TAA13588 for <>; Tue, 15 Apr 1997 19:19:05 -0400 (EDT)
Received: from by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA17514 (mail destined for; Tue, 15 Apr 97 19:19:04 -0400
Received: from by id aa22167; 15 Apr 97 16:12 PDT
To: Chris Newman <>
Cc: IETF URI list <>
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: Your message of "Tue, 15 Apr 1997 13:07:23 PDT." <>
Date: Tue, 15 Apr 1997 16:12:43 -0700
From: "Roy T. Fielding" <>
Message-Id: <>
Precedence: bulk

>Here's the approaches to i18n I've seen:
>(1) US-ASCII only
>(2) ISO-8859-1 only
>(3) whatever localized character set is in use
>(4) Explicit labelling of character set
>(5) Unicode derivative.
>(1) Never works because it doesn't satisfy demand.
>(2) Never works and is even worse than (1) because not only does it fail
>to satisfy demand, but it uses up the "undefined" codepoints in such a way
>that an interoperable solution *can't* be deployed.
>(3) Never works, because it doesn't interoperate.  It results in a bunch
>of islands which can't communicate, except via US-ASCII.

But that is what Martin said he wanted -- the ability of an author to
decide what readership is most important.  Why is it that it is okay
to localize the address, but not to localize the charset?

Will it lead to interoperability problems?  Yes, at least until the world
accepts a common charset on its own accord.

>(4) Works fine, but is very hard to support for ideographic characters.
>Dealing with mapping tables between ISO-2022, Unicode and whatever
>character set is supported by the display system is very hard.
>(5) Works fine, and has potential to be easier to support than (4).

Excuse me, but it doesn't work at all unless all systems use the same
charset for encoding URLs.  Since that is not the case today, we would
have to scrap all existing servers and browsers in order for (5) to work.
In other words, it is not an acceptable solution to those of use who
have to implement the specified protocol.

>The status quo in URLs is a mixture of (1), (2), and (3).  This is
>completely unacceptable for an interoperable solution.  We *MUST* move
>towards (4) or (5).  Given that I've heard no proposals along the lines of
>MIME header encoded words, the only solution on the table is (5).

(3) does move toward (5).  It even becomes (5) when people are using UTF-8.

>I will also point out than when a URL contains unencoded 8-bit characters
>and is embedded in a properly charset-labelled document, there are no
>problems as the interpretation is clear.   We do need to deal with the
>interpretation of %-encoded 8-bit characters.  If we're ambitious, we can
>also address the issue of unlabelled unencoded 8-bit characters, but I'd
>be tempted to avoid that rathole.
>The biggest failure of HTTP/HTML was choosing (2) above when MIME already
>had a perfectly functional solution (4).

This is totally unrelated, but you seem to be confused.  HTTP has always
defined (4).