Re: revised "generic syntax" internet draft

Chris Newman <> Tue, 15 April 1997 20:51 UTC

Received: from cnri by id aa08599; 15 Apr 97 16:51 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa20012; 15 Apr 97 16:51 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id QAA08262 for uri-out; Tue, 15 Apr 1997 16:33:19 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id QAA08257 for <>; Tue, 15 Apr 1997 16:33:16 -0400 (EDT)
Received: from THOR.INNOSOFT.COM by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA16108 (mail destined for; Tue, 15 Apr 97 16:33:14 -0400
Received: from by INNOSOFT.COM (PMDF V5.1-8 #8694) with SMTP id <01IHQIGRNAOM99ESLE@INNOSOFT.COM> for; Tue, 15 Apr 1997 13:32:12 PDT
Date: Tue, 15 Apr 1997 13:33:35 -0700
From: Chris Newman <>
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <>
To: John C Klensin <>
Cc: IETF URI list <>
Message-Id: <>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Precedence: bulk

On Tue, 15 Apr 1997, John C Klensin wrote:
> It would have been better had URLs been carefully and 
> thoughtfully internationalized from the very beginning.  
> For whatever reasons, they weren't.  A conversion now is 
> going to be painful.  But, if the pain is worth it, and I 
> suspect it might be, then let's look to a balanced, 
> equitable, *international* solution, not using UTF-8 
> encoding in the hope that no one who uses ideographic 
> characters will be bothered about what happens to them.

UTF-8 requires 2 octets to encode characters from the 8859-1 set which
normally take 1 octet.  UTF-8 requires 3 octets to encode ideographic
characters from UCS-2 which normally require 2 octets.  So
western Europeans take a worse storage hit from UTF-8 than ideographic
languages do.

I'd be willing to consider an alternative proposal to hex-encoded UTF-8
in URLs, but I can't think of one that's viable in practice other than
MIME encoded words (which are too disgusting to consider).

I will say that it took me about 10 minutes to write a hex-encoded UTF-8
to UCS 2 converter which looked up the character descriptions in the
publicly available Unicode tables.