Re: Using UTF-8 for non-ASCII Characters in URLs

Edward Cherlin <cherlin@newbie.net> Fri, 02 May 1997 18:02 UTC

Received: from cnri by ietf.org id aa29137; 2 May 97 14:02 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa16940; 2 May 97 14:02 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id NAA07666 for uri-out; Fri, 2 May 1997 13:41:55 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id NAA07659 for <uri@services.bunyip.com>; Fri, 2 May 1997 13:41:52 -0400 (EDT)
Received: from mtshasta.snowcrest.net (mtshasta.snowcrest.net [206.245.192.1]) by mocha.bunyip.com (8.8.5/8.8.5) with ESMTP id NAA22494 for <uri@Bunyip.Com>; Fri, 2 May 1997 13:41:39 -0400 (EDT)
Received: from [206.245.192.60] (ttyD3.mtshasta.snowcrest.net [206.245.192.35]) by mtshasta.snowcrest.net (8.8.5/8.6.5) with ESMTP id KAA24598 for <uri@Bunyip.Com>; Fri, 2 May 1997 10:41:13 -0700 (PDT)
X-Sender: cherlin@snowcrest.net
Message-Id: <v0300783faf8f314b10e6@[206.245.192.60]>
In-Reply-To: <Pine.SUN.3.96.970501211303.245P-100000@enoshima>
References: <199705010017.RAA27111@mailsun3-fddi.us.oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Thu, 01 May 1997 23:32:37 -0700
To: uri@bunyip.com
From: Edward Cherlin <cherlin@newbie.net>
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
Sender: owner-uri@bunyip.com
Precedence: bulk

"Martin J. Duerst" <mduerst@ifi.unizh.ch> wrote:

[snip]
>
>Internet Draft                                               M. Duerst
><draft-duerst-i18n-norm-00?.txt>                  University of Zurich
>Expires in six months                                         May 1997
>
>
[snip]
>
>1.? Notation
>
>
>   Codepoints from the UCS are denoted as U+XXXX, where XXXX is their
>   hexadecimal representation, according to [Unicode, p.???].

The Unicode Standard Version 2.0, p. 1-5.
>
>   Stretches of characters?

"A range of Unicode values is expressed as U+xxxx-->U+yyyy or
U+xxxx--U+yyyy..." p. 1-5.

>   Official character names and components all
>   upper case.

"...uppercase Latin letters A through Z, space, and hyphen-minus;..." p. 1-5
>
>
>
>
>                          Expires in six months         [Page 3]
>
>Internet DrafNormalization of Internationalized Identifiers     May 1997
>
>
>2. Categories of Ambiguity and Problems
>
>
>   Comparing two sequences of codepoints from the UCS, various degrees
>   of ambiguity can arise:
>
>   Category A: The two sequences are expected to be rendered exactly the
>   same, considered identical by the user, and cannot be disambiguated
>   by context.
>
>   Category B: The two sequences are "semantically" different but diffi-
>   cult or impossible to distinguish in rendering.
>
>   Category C: ?????
>
>   ????
>
>   There are also a number of codepoints in the UCS that should not be
>   used for various reasons, mainly that they are not available on usual
>   keyboards. These go into Category X.

That could be taken to apply to math and APL characters, which would be
unfortunate. There are strong reasons for allowing math and APL expressions
in identifiers for math and APL pages. I published a book, "The
Encyclopedia of APL" which was indexed in APL as well as in English names
of APL symbols, functions, and operators. It would have been a useful Web
site.

All codepoints can be entered from standard keyboards. There are keyboards
and other entry methods for almost all Unicode characters implemented in
some software, and all can be used in keyboard layouts of standard form. We
must expect keyboard layout utilities to appear in future multilingual
software. I think we need some other distinction. When we start listing the
Category X characters, we can discuss their characteristics more
meaningfully.

[snip]
>Bibliography
>
>   [HTML]         T. Berners-Lee and D. Connolly, "Hypertext Markup Lan-
>                  guage - 2.0" (RFC1866), MIT/W3C, November 1995.
>
>   [Unicode2]     Unicode????, Version 2, Addisson-Wesley, Reading, MA,
>                  1996.

The Unicode Standard, Version 2, Addison-Wesley...

>   [HTML-I18N]    F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter-
>                  nationalization of the Hypertext Markup Language",
>                  Work in progress (draft-ietf-html-i18n-05.txt), August
>                  1996.
>
>
>
>
>
>
>                          Expires in six months         [Page 7]
>
>Internet DrafNormalization of Internationalized Identifiers     May 1997
>
>
>Author's Address
>
>   Martin J. Duerst
>   Multimedia-Laboratory
>   Department of Computer Science
>   University of Zurich
>   Winterthurerstrasse 190
>   CH-8057 Zurich
>   Switzerland
>
>   Tel: +41 1 257 43 16
>   Fax: +41 1 363 00 35
>   E-mail: mduerst@ifi.unizh.ch
>
>
>     NOTE -- Please write the author's name with u-Umlaut wherever
>     possible, e.g. in HTML as D&uuml;rst.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>                          Expires in six months         [Page 8]
>


--
Edward Cherlin     cherlin@newbie.net     Everything should be made
Vice President     Ask. Someone knows.       as simple as possible,
NewbieNet, Inc.                                 __but no simpler__.
http://www.newbie.net/                Attributed to Albert Einstein