Re: [URN] Re: I18N does not belong in URNs

Martin J Duerst <mduerst@ifi.unizh.ch> Fri, 15 November 1996 18:42 UTC

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id NAA15446 for urn-ietf-out; Fri, 15 Nov 1996 13:42:31 -0500
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id NAA15437 for <urn-ietf@services.bunyip.com>; Fri, 15 Nov 1996 13:42:28 -0500
Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA29572 (mail destined for urn-ietf@services.bunyip.com); Fri, 15 Nov 96 13:41:39 -0500
Received: from ifi.unizh.ch by josef.ifi.unizh.ch id <00924-0@josef.ifi.unizh.ch>; Fri, 15 Nov 1996 19:40:39 +0100
Subject: Re: [URN] Re: I18N does not belong in URNs
To: yergeau@alis.com
Date: Fri, 15 Nov 1996 19:40:38 +0100
Cc: dgd@cs.bu.edu, urn-ietf@bunyip.com
In-Reply-To: <2.2.32.19961115155024.007169c0@genstar.alis.ca> from "Francois Yergeau" at Nov 15, 96 10:50:24 am
Mime-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Content-Length: 4755
From: Martin J Duerst <mduerst@ifi.unizh.ch>
Message-Id: <"josef.ifi..890:15.10.96.18.40.40"@ifi.unizh.ch>
Sender: owner-urn-ietf@services.bunyip.com
Precedence: bulk
Reply-To: Martin J Duerst <mduerst@ifi.unizh.ch>
Errors-To: owner-urn-ietf@bunyip.com

Francois Yergeau wrote:

>I fail to see why the %-encoded URN should be the reference.  This is a
>fallback to the bad old 7-bit days, and results in a needless waste of
>bandwidth and storage resources. Reading the recent report of the IAB
>Character Set Workshop (draft-weider-iab-char-wrkshop-00.txt)

It's nice to have this available finally.

>, I find in
>section 8.2 (Recommendations for new Internet protocols):
>
>  "New protocols do not suffer from the need to be compatible
>   with old 7-bit pipes. New protocol specifications SHOULD
>   use ISO 10646 as the base charset unless there is an
>   overriding need to use a different base charset."

That's indeed what we are doing. Pipe width and base charset are
not directly related.

>Elsewhere (3.4.3), UTF-8 is recommended as the encoding and use of escape
>mechanisms is warned against ("...must be weighed very carefully").

This warns against techniques such as SGML &#nnn;. %HH is not on
the character level, it is on the octet level. And it is already
well established for URLs.

>>   We can define the standard as %-encoded UTF-8, and if people implement
>>this other ways, they are implementing convenience features in the
>>interface: the software will always have the %-encoded URN available.

Much software will probably do so anyway, despite what the standard
says, and without creating a conflict, because storing and comparing
is more efficient on the 8-bit form.

>As if 8-bit octets on-the-wire were something evil!  It is much wiser, IMHO,
>to have the real UTF-8 as the reference value, and have the %-encoding as
>the convenience feature (it must be there anyway for reserved and unsafe
>characters, so there is no risk that an application will not support it).
>If a user needs ASCII-only, let *his* software do the %-encoding for him,
>but let's not force a 9-byte encoding on CJK characters when 3 are enough.
>
>There should be a good reason to burden the whole world forever with
>%-encoding of all 8-bit octets, and I see none at all, except for a visceral
>and unwarranted fear of 8-bit octets.

I think we have to be careful, because there are at least two
ways in which URNs can be transferred/stored:

- In "dedicated" protocols and databases. An example is the header
	of an HTTP request.

- In text. An example is HTML.

For the former, raw 8-bit (i.e. UTF-8) can be used. According to
the standards, officially HTTP headers are limited to ASCII, but
in practice, they will pass 8 bits without problems. (If not,
please don't make a long discussion out of this. It only serves
as an example of a (part of) a protocol that up to now
transmitted "raw" data without consideration to character set
issues.

For the later, as Francois probably knows even better than I do
from his work on URL internationalization, putting an URN with
10646 characters into a HTML document written in iso-8859-1
in raw 8-bit form will produce bad results. Without extremely
clever tool support, it will neither be possible to input
such an URN, nor will an URN show up with the characters it
represents. Transcoding, as well as other operations such as
cut-and-paste, will also not do what everybody would hope for.

Just saying "use 8 bits, use 8 bits" could however give the
impression to some implementors that the UTF-8 8-bit octets
should appear as such in an HTML document in iso-8859-1.

Whatever we make the "standard" or "base" form, or whether
we such a form or not, we should therefore clearly say that
URNs

- Can be transmitted/stored in 8-bit form in protocols/databases
	that accomodate URNs as such, and not as part of text
	and/or associated with character encoding information.
- Have to be interpreted and treated as characters when transmitted
	as part of an encoded text with (explicitly or implicitly)
	associated character encoding information. Those characters
	that cannot be represented in the choosen encoding, as
	well as %HH sequences that do not form valid UTF-8 sequences
	(and of course reserved characters) have to stay in %HH form.

I know that the last point may again frighten some of you. It seems
to introduce a new representation. But if you think about URLs in
EBCDIC, you will see that it is nothing new.

Personally, I think that the second paragraph above could be amended
with a sentence saying that to avoid eventual misinterpretations due
to lack of appropriate information about character encoding, and to
make the URN transcribable to the widest audience, full %HH encoding
can/should be choosen. We may have to discuss about how strong
this wording should be. But we definitely have to include something
that avoids misunderstandings so that raw 8-bit UTF-8 will never
turn up as such in e.g. iso-8859-1 documents.


Regards,	Martin.