Re: revised "generic syntax" internet draft

Chris Newman <Chris.Newman@innosoft.com> Tue, 15 April 1997 20:29 UTC

Received: from cnri by ietf.org id aa07554; 15 Apr 97 16:29 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa19530; 15 Apr 97 16:29 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id QAA07621 for uri-out; Tue, 15 Apr 1997 16:06:40 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id QAA07616 for <uri@services.bunyip.com>; Tue, 15 Apr 1997 16:06:37 -0400 (EDT)
Received: from THOR.INNOSOFT.COM by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA15589 (mail destined for uri@services.bunyip.com); Tue, 15 Apr 97 16:06:35 -0400
Received: from eleanor.innosoft.com by INNOSOFT.COM (PMDF V5.1-8 #8694) with SMTP id <01IHQHJAUWLI99ESLE@INNOSOFT.COM> for uri@bunyip.com; Tue, 15 Apr 1997 13:06:00 PDT
Date: Tue, 15 Apr 1997 13:07:23 -0700
From: Chris Newman <Chris.Newman@innosoft.com>
Subject: Re: revised "generic syntax" internet draft
In-Reply-To: <9704141932.aa24523@paris.ics.uci.edu>
To: IETF URI list <uri@bunyip.com>
Message-Id: <Pine.SOL.3.95.970415124833.22015J-100000@eleanor.innosoft.com>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

Here's the approaches to i18n I've seen:

(1) US-ASCII only

(2) ISO-8859-1 only

(3) whatever localized character set is in use

(4) Explicit labelling of character set

(5) Unicode derivative.
----
(1) Never works because it doesn't satisfy demand.

(2) Never works and is even worse than (1) because not only does it fail
to satisfy demand, but it uses up the "undefined" codepoints in such a way
that an interoperable solution *can't* be deployed.

(3) Never works, because it doesn't interoperate.  It results in a bunch
of islands which can't communicate, except via US-ASCII.

(4) Works fine, but is very hard to support for ideographic characters.
Dealing with mapping tables between ISO-2022, Unicode and whatever
character set is supported by the display system is very hard.

(5) Works fine, and has potential to be easier to support than (4).
----

The status quo in URLs is a mixture of (1), (2), and (3).  This is
completely unacceptable for an interoperable solution.  We *MUST* move
towards (4) or (5).  Given that I've heard no proposals along the lines of
MIME header encoded words, the only solution on the table is (5).

I will also point out than when a URL contains unencoded 8-bit characters
and is embedded in a properly charset-labelled document, there are no
problems as the interpretation is clear.   We do need to deal with the
interpretation of %-encoded 8-bit characters.  If we're ambitious, we can
also address the issue of unlabelled unencoded 8-bit characters, but I'd
be tempted to avoid that rathole.

The biggest failure of HTTP/HTML was choosing (2) above when MIME already
had a perfectly functional solution (4).