Re: UTF-8 URL for testing
Francois Yergeau <yergeau@alis.com> Mon, 14 April 1997 04:19 UTC
Received: from cnri by ietf.org id aa20039; 14 Apr 97 0:19 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa01536; 14 Apr 97 0:19 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id XAA12786 for uri-out; Sun, 13 Apr 1997 23:56:18 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with SMTP id XAA12781 for <uri@services.bunyip.com>; Sun, 13 Apr 1997 23:56:15 -0400 (EDT)
Received: from ns.alis.com by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA00638 (mail destined for uri@services.bunyip.com); Sun, 13 Apr 97 23:56:13 -0400
Received: from fyergeau.alis.com ([207.81.28.104]) by genstar.alis.ca (8.7.5/8.7.3) with SMTP id XAA13010; Sun, 13 Apr 1997 23:55:30 -0400 (EDT)
Message-Id: <3.0.1.32.19970413225937.007819fc@genstar.alis.com>
X-Sender: yergeau@genstar.alis.com
X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
Date: Sun, 13 Apr 1997 22:59:37 -0400
To: John C Klensin <klensin@mci.net>
From: Francois Yergeau <yergeau@alis.com>
Subject: Re: UTF-8 URL for testing
Cc: uri@bunyip.com
In-Reply-To: <SIMEON.9704121139.H@tp7.Jck.com>
References: <334EC975.586@parc.xerox.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
X-MIME-Autoconverted: from quoted-printable to 8bit by services.bunyip.com id XAA12782
Sender: owner-uri@bunyip.com
Precedence: bulk
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by services.bunyip.com id XAA12786
À 11:41 12-04-97 -0400, John C Klensin a écrit : >While I'm very anxious to see a real solution that >addresses the underlying issues here, I'm forced to agree >with Larry. We don't "make" things happen by standardize >untested ideas and arguments, however logical, that >things are easy to do don't move the discussion forward >much. Yet this is exactly how HTTP/1.1 was made to happen. Untested things were discussed and put into drafts. Some testing took place along the way, but at some point the spec was declared a Proposed Standard, before there was a single full implementation that embodied what you want here: > ... a demonstration that it works >well, that it won't cause significant problems with >existing (unmodified) clients, servers, or users, etc. By contrast, what we have now is a refusal to even do the first step, to put things into the draft so that the issue can be addressed. > I don't think that timing of standards are much of >the issue here. Indeed, it doesn't matter much if URL syntax becomes Draft Standard now or 6 months later. But it does matter that an unsound spec doesn't make it to DS. URLs are written on paper (characters) and transmitted over the wire (bytes). Thus an unambiguous mapping between characters and bytes is *required*. This mapping currently only exists for only a tiny fraction of possible characters, namely ASCII. Since Web forms are submitted using URLs, and can contain almost any text, it is neither desirable nor possible to restrict the repertoire of characters. The current spec does not recognize this and pretends that (section 2): "All URLs consist of a restricted set of characters, primarily chosen to aid transcribability and usability both in computer systems and in non-computer communications." In other words, it places a purported transcribability requirement ahead of the simple fact that current practice uses other characters all the time. Oh, of course, these non-ASCII characters are escaped to ASCII using %-encoding, but there is still no defined mapping from characters to bytes. And there is no defined mapping from bytes to characters for half the possible byte values, precluding any sensible display of URLs representing non-ASCII characters. In short, the current spec is technically unsound and broken, and needs fixing not to extend it to new capabilities, but to bring it in line with widespread current practice. This discussion has been going on for months in various circles, lists and conferences, with no resolution. The reason, it seems to me, is the continued failure to fully recognize that mapping only ASCII characters in not a solution. While it may be acceptable to restrict bytes over the wire to 7 bits (but why?), it is not to limit the character repertoire to a subset of ASCII. URLs are widely put to uses where there is no such limit. >And, as I have said many times before, while I recognize >and accept the enthusiasm for UTF-8, especially among users >of languages with Latin-based alphabetic systems, I would >prefer that, when we make protocol decisions that are >expected to have very long lifetimes, we use systems that >don't penalize non-Roman language groups as severely as >UTF-8 tends to do. This has also been discussed at length. The trade-off is compatibility with all of current practice (ASCII-based) vs this undeniable byte-count penalty for non-Latin scripts. For short string such as URLs, I'm afraid the technical choice is clear. -- François Yergeau <yergeau@alis.com> Alis Technologies Inc., Montréal Tél : +1 (514) 747-2547 Fax : +1 (514) 747-2561
- UTF-8 URL for testing Francois Yergeau
- Re: UTF-8 URL for testing Larry Masinter
- Re: UTF-8 URL for testing Dan Connolly
- Re: UTF-8 URL for testing Larry Masinter
- Re: UTF-8 URL for testing John C Klensin
- Re: UTF-8 URL for testing Edward Cherlin
- Re: UTF-8 URL for testing Francois Yergeau
- Re: UTF-8 URL for testing Francois Yergeau
- Re: UTF-8 URL for testing Harald.T.Alvestrand
- Re: UTF-8 URL for testing Martin J. Duerst
- Attn: Martin Dürst: Re: UTF-8 URL for testing Edward Cherlin
- Re: UTF-8 URL for testing Martin J. Duerst
- Re: Attn: Martin Dürst: Re: UTF-8 URL for testing Martin J. Duerst
- Re: UTF-8 URL for testing Martin J. Duerst
- Re: UTF-8 URL for testing Chris Newman