Re: [URN] Re: I18N does not belong in URNs

Francois Yergeau <yergeau@alis.com> Fri, 15 November 1996 15:55 UTC

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id KAA10506 for urn-ietf-out; Fri, 15 Nov 1996 10:55:27 -0500
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id KAA10501 for <urn-ietf@services.bunyip.com>; Fri, 15 Nov 1996 10:55:25 -0500
Received: from ns.alis.com by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA28075 (mail destined for urn-ietf@services.bunyip.com); Fri, 15 Nov 96 10:55:18 -0500
Received: from fyergeau.alis.com ([207.81.28.17]) by genstar.alis.ca (8.7.5/8.7.3) with SMTP id KAA26667; Fri, 15 Nov 1996 10:54:40 -0500 (EST)
Message-Id: <2.2.32.19961115155024.007169c0@genstar.alis.ca>
X-Sender: yergeau@genstar.alis.ca
X-Mailer: Windows Eudora Pro Version 2.2 (32)
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Date: Fri, 15 Nov 1996 10:50:24 -0500
To: dgd@cs.bu.edu
From: Francois Yergeau <yergeau@alis.com>
Subject: Re: [URN] Re: I18N does not belong in URNs
Cc: urn-ietf@bunyip.com
Content-Transfer-Encoding: quoted-printable
Sender: owner-urn-ietf@services.bunyip.com
Precedence: bulk
Reply-To: Francois Yergeau <yergeau@alis.com>
Errors-To: owner-urn-ietf@bunyip.com

À 12:13 14-11-96 -0500, David G. Durand a écrit :
>On the other hand, if the reference
>string is the %-encoded UTF-8 value, then we should be OK for
>transcribability. The issue of user-friendly software that hides %-encoding
>is not part of the protocol, so its _possibility_ shouldn't unduly
>influence us.
>   We can define the standard as %-encoded UTF-8, and if people implement
>this other ways, they are implementing convenience features in the
>interface: the software will always have the %-encoded URN available.

I fail to see why the %-encoded URN should be the reference.  This is a
fallback to the bad old 7-bit days, and results in a needless waste of
bandwidth and storage resources. Reading the recent report of the IAB
Character Set Workshop (draft-weider-iab-char-wrkshop-00.txt) , I find in
section 8.2 (Recommendations for new Internet protocols):

  "New protocols do not suffer from the need to be compatible
   with old 7-bit pipes. New protocol specifications SHOULD 
   use ISO 10646 as the base charset unless there is an 
   overriding need to use a different base charset."

Elsewhere (3.4.3), UTF-8 is recommended as the encoding and use of escape
mechanisms is warned against ("...must be weighed very carefully").

>   We can define the standard as %-encoded UTF-8, and if people implement
>this other ways, they are implementing convenience features in the
>interface: the software will always have the %-encoded URN available.

As if 8-bit octets on-the-wire were something evil!  It is much wiser, IMHO,
to have the real UTF-8 as the reference value, and have the %-encoding as
the convenience feature (it must be there anyway for reserved and unsafe
characters, so there is no risk that an application will not support it).
If a user needs ASCII-only, let *his* software do the %-encoding for him,
but let's not force a 9-byte encoding on CJK characters when 3 are enough.

There should be a good reason to burden the whole world forever with
%-encoding of all 8-bit octets, and I see none at all, except for a visceral
and unwarranted fear of 8-bit octets.

-- 
François Yergeau <yergeau@alis.com>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561