Re: revised "generic syntax" internet draft

Chris Newman <> Wed, 16 April 1997 00:23 UTC

Received: from cnri by id aa18164; 15 Apr 97 20:23 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa24002; 15 Apr 97 20:23 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id UAA19407 for uri-out; Tue, 15 Apr 1997 20:11:57 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id UAA19398 for <>; Tue, 15 Apr 1997 20:11:54 -0400 (EDT)
Received: from THOR.INNOSOFT.COM by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA17912 (mail destined for; Tue, 15 Apr 97 20:11:50 -0400
Received: from by INNOSOFT.COM (PMDF V5.1-8 #8694) with SMTP id <01IHQQ3QK6QW99FBC2@INNOSOFT.COM> for; Tue, 15 Apr 1997 17:10:46 PDT
Date: Tue, 15 Apr 1997 17:12:09 -0700
From: Chris Newman <>
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <>
To: "Roy T. Fielding" <>
Cc: IETF URI list <>
Message-Id: <>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Precedence: bulk

On Tue, 15 Apr 1997, Roy T. Fielding wrote:
> >(3) whatever localized character set is in use
> >
> >(3) Never works, because it doesn't interoperate.  It results in a bunch
> >of islands which can't communicate, except via US-ASCII.
> But that is what Martin said he wanted -- the ability of an author to
> decide what readership is most important.  Why is it that it is okay
> to localize the address, but not to localize the charset?

I can't speak for Martin.  But if I understand what you're
saying, my response is that people want to use their own language in URLs
and will do so whatever the standard says.  If we define a standard way
for them to include their national characters in such a way that those
characters won't be misinterpreted by the recipient, then we've achived 
interoperability.  That's the goal of protocol design.

> >(5) Works fine, and has potential to be easier to support than (4).
> Excuse me, but it doesn't work at all unless all systems use the same
> charset for encoding URLs.  Since that is not the case today, we would
> have to scrap all existing servers and browsers in order for (5) to work.
> In other words, it is not an acceptable solution to those of use who
> have to implement the specified protocol.

I don't think any of the programs which display URLs try to interpret hex
encoded %80 - %FF.  So no URL display programs will break.  Now if there's
a URL entry program which permits non-ASCII characters and maps them to
%80 - %FF using local conventions, that program will break.  But that
program is also already in violation of the current specification (which
restricts URLs to US-ASCII).  Therefore the only software which is forced
to upgrade by this change is software which already violates the standard.
If anything, that's an argument to make this change.

So the transition plan is simple:

(A) URL entry programs (which currently are restricted to US-ASCII by the
specification) are upgraded so they map non-ASCII characters to hex
encoded UTF-8.

(B) URL display programs are upgraded so they map hex encoded UTF-8 to the
correct display characters.

(C) URL display programs which aren't upgraded just show hex encoded
UTF-8, as they do today.

> (3) does move toward (5).  It even becomes (5) when people are using UTF-8.

(4) can move towards (5), but (3) can't.   With unlabelled character sets
you just get interoperability problems.  Look at it this way: if fred and
sam are using localized character set thingbats, and fred tries to
transition to UTF-8, all of a sudden fred and sam are completely unable to
communicate and see garbage at the other end.  A transition is only
achievable if the character set is labelled.

Any time a spec either implicity or explicitly says X is implementation
defined, it is promoting a non-interoperable solution.  The URL spec
currently leaves the interpretation of %80 - %FF as implementation