Re: revised "generic syntax" internet draft

"Martin J. Duerst" <> Sat, 19 April 1997 18:21 UTC

Received: from cnri by id aa14647; 19 Apr 97 14:21 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa03556; 19 Apr 97 14:21 EDT
Received: (from daemon@localhost) by (8.8.5/8.8.5) id NAA23795 for uri-out; Sat, 19 Apr 1997 13:06:51 -0400 (EDT)
Received: from (mocha.Bunyip.Com []) by (8.8.5/8.8.5) with SMTP id NAA23790 for <>; Sat, 19 Apr 1997 13:06:45 -0400 (EDT)
Received: from by with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA14128 (mail destined for; Sat, 19 Apr 97 13:06:41 -0400
Received: from by with SMTP (PP) id <>; Sat, 19 Apr 1997 19:06:36 +0200
Date: Sat, 19 Apr 1997 19:06:27 +0200
From: "Martin J. Duerst" <>
To: John C Klensin <>
Cc:,,, Dan Oscarsson <>
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <>
Message-Id: <Pine.SUN.3.96.970419185030.708e-100000@enoshima>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Precedence: bulk

On Wed, 16 Apr 1997, John C Klensin wrote:

> wrote:
> > Factoid:
> > 
> > UTF-8 is not user-friendly in 8859-1; the standard coding octets for
> > putting the 8859-1 charset into UTF-8 insert one character in front of
> > each character, and also change the last character for the 4 uppermost
> > columns of the 8859-1 character table.
> My apologies.  I should have said something more like "more 
> user-friendly for Latin-1 than it is for upper-end 
> ideographic characters, where it deteriorates even more 
> severely :-(

You might come to the state where you have to view UTF-8 with
a terminal emulator or editor not set to view it, where the
above effects are occurring, but this should actually be rare.
And it wouldn't be better if you looked at ideographic characters
with an 8859-1 editor or so.

First, we don't want to have UTF-8 and 8859-1 (or any other legacy
coding) mixed in the same document. Once everything is working as
envisioned, if you transport a Western European URL in 8859-1,
you transport the characters, as 8859-1. It's only when this is
changed to %HH, or to binary 8-bit URLs as such which lack any
information on character encoding, that you change to UTF-8.

So you would edit a list of 8-bit URLs with an UTF-8 editor,
and you would edit a Japanese HTML document with some URLs
e.g. with an EUC editor (the two editors may be the same and
use autodetection). If you do cut-and-paste between the
two editors (or the two windows), the characters should
stay the same, while the underlying representation will
change. That is what will be expected by all other kinds
of text processing.

> Given the bad behavior *even* for 8859-1, could someone 
> please remind me why we are pushing the thing rather than a 
> straight 16 or 32-bit encoding with compression if needed?  

Given that for URLs intended for global exchangability,
pure ASCII is still the best choice, and that enormous
amounts of energy can be saved if we don't invent everything
for new, given that the bad behaviour described above can
happen as an accident, but is not part of what should happen,
and given that designing a compression scheme for short
strings such as URLs is not exactly easy, I think
using UTF-8, which is supported by a lot of software
and used in many other places, is not the worst thing to do.

Regards,	Martin.