Re: Using UTF-8 for non-ASCII Characters in URLs

"Martin J. Duerst" <mduerst@ifi.unizh.ch> Thu, 01 May 1997 20:01 UTC

Received: from cnri by ietf.org id aa27930; 1 May 97 16:01 EDT
Received: from services.Bunyip.Com by CNRI.Reston.VA.US id aa12459; 1 May 97 16:01 EDT
Received: (from daemon@localhost) by services.bunyip.com (8.8.5/8.8.5) id PAA06390 for uri-out; Thu, 1 May 1997 15:25:26 -0400 (EDT)
Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.8.5/8.8.5) with ESMTP id PAA06383 for <uri@services.bunyip.com>; Thu, 1 May 1997 15:25:20 -0400 (EDT)
Received: from josef.ifi.unizh.ch (josef.ifi.unizh.ch [130.60.48.10]) by mocha.bunyip.com (8.8.5/8.8.5) with SMTP id PAA13086 for <uri@bunyip.com>; Thu, 1 May 1997 15:25:12 -0400 (EDT)
Received: from enoshima.ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <26985-0@josef.ifi.unizh.ch>; Thu, 1 May 1997 21:24:53 +0200
Date: Thu, 01 May 1997 21:24:52 +0200
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: "Michael Kung <MKUNG.US.ORACLE.COM>" <MKUNG@us.oracle.com>
cc: URI mailing list <uri@bunyip.com>
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
In-Reply-To: <199705010017.RAA27111@mailsun3-fddi.us.oracle.com>
Message-ID: <Pine.SUN.3.96.970501211303.245P-100000@enoshima>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset="US-ASCII"
Sender: owner-uri@bunyip.com
Precedence: bulk

On 30 Apr 1997, Michael Kung <MKUNG.US.ORACLE.COM> wrote:

> Agree on the 'key words'.  But this rule also implies that I cannot put any
> double byte English Alphabet in my company name (or I have to change my
> company name for URL).

Well, I think a company would not be well-advised (and probably even
challenged in court) if it choose a name identical with another one
except for the fact that it uses full-width instead of half-width
characters.

Also, it should probably be in the company's own interest to use
an URL that doesn't bring problems to the users. Most users, esp.
new ones, are not very familliar with the artificial separation
of character codes brought about by some confusion between characters
and glyphs on current computers. Most companies with
English letter names will already have their ASCII URL long
before full-width characters will work.

That said, I know like many others on this list that this is only
one case of a whole can of worms that we have to address in some
way or another. As promized, I have therefore started to write
a draft, which is currently at about the same stage as Larry's
recent draft, so that I can bother you with it below. Any
comments are wellcome. You don't need to have a solution for
a problem, just helping me listing the problems is extremely
valuable.

Regards,	Martin.








Internet Draft                                               M. Duerst
<draft-duerst-i18n-norm-00?.txt>                  University of Zurich
Expires in six months                                         May 1997


             Normalization of Internationalized Identifiers


Status of this Memo

   This document is an Internet-Draft.  Internet-Drafts are working doc-
   uments of the Internet Engineering Task Force (IETF), its areas, and
   its working groups. Note that other groups may also distribute work-
   ing documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months. Internet-Drafts may be updated, replaced, or obsoleted by
   other documents at any time.  It is not appropriate to use Internet-
   Drafts as reference material or to cite them other than as a "working
   draft" or "work in progress".

   To learn the current status of any Internet-Draft, please check the
   1id-abstracts.txt listing contained in the Internet-Drafts Shadow
   Directories on ds.internic.net (US East Coast), nic.nordu.net
   (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
   Rim).

   Distribution of this document is unlimited.  Please send comments to
   the author at <mduerst@ifi.unizh.ch> or to the uri mailing list at
   uri@bunyip.com. This document is currently a pre-draft, for
   restricted discussion only. It is intended to become part of a suite
   of documents related to the internationalization of URLs.


Abstract

   The Universal Character Set (UCS) makes it possible to extend the
   repertoire of characters used in non-local identifiers beyond US-
   ASCII. The UCS contains a large overall number of characters, many
   codepoints for backwards compatibility, and various mechanisms to
   cope with the features of the writing systems of the world.  These
   features can lead to ambiguities in representation.  Such ambiguities
   are not a problem when representing running text, and therefore
   existing standards have only defined equivalences.  For the use in
   identifiers, which are compared using their binary representation by
   most software, this is not sufficient.  This document defines a nor-
   malization algorithm and gives usage guidelines to avoid such ambigu-
   ities.



                          Expires in six months         [Page 1]

Internet DrafNormalization of Internationalized Identifiers     May 1997


Table of contents

   1. Introduction ................................................... ?
     1.1 General ......................................................?
   To be completed
   Bibliography .......................................................?
   Author's Address ...................................................?



1. Introduction



1.1 General


   For the identification of resources in networks, many kinds of iden-
   tifiers are in use. Locally, many kinds of identifiers can contain
   characters from all kinds of languages and scripts, but because the
   characters were encoded differently, network identifiers had to be
   limited to a very restricted character repertoire, usually a subset
   of US-ASCII [US-ASCII].

   With the definition of the Universal Character Set (UCS) [ISO 10646]
   [Unicode2], it becomes possible to extend the character repertoire of
   such identifiers. In some cases, this has already been done
   [Java][URN-Syntax]; other cases are under study.  While identifiers
   for resources of full worldwide interest should continue to be lim-
   ited to a very restricted set of widestly known characters, names for
   resources mainly used in a language-local or script-local context may
   provide significant additional user convenience if they can make use
   of a wider character repertoire [iURL rationale].

   The UCS contains a large overall number of characters, many code-
   points for backwards compatibility, and various mechanisms to cope
   with the features of the writing systems of the world. These all lead
   to ambiguities that in some cases can be resolved by careful display,
   printing, and examination by the reader, but which in other cases are
   intended to be unnoticable by the reader.

   Such ambiguities can be dealt with in systems dealing with running
   text by using various kinds of equivalences and normalizations, which
   may differ by implementation.  However, software processing identi-
   fiers usually compares their binary representation to establish that
   two identifiers are identical. In some cases, some additional pro-
   cessing is also done to account for the specifics of identifier syn-
   tax variation. To upgrade all such software to taking into account



                          Expires in six months         [Page 2]

Internet DrafNormalization of Internationalized Identifiers     May 1997


   the equivalences and ambiguities in the UCS would be extremely
   tedious. For some classes of identifiers, it would be impossible
   because their binary representation is transparent in the sense that
   it may allow legacy character encodings besides a character encoding
   based on UCS to be used and/or it may allow for arbitrary binary data
   to be contained in identifiers.

   In order to facilitate widespread use of identifiers containing char-
   acters from UCS, this document therefore develops clear specifica-
   tions for a normalization algorithm removing basic ambiguities and
   guidelines for the use of characters with potential ambiguity.


1.? Guidelines


   The specifications and guidelines in this document have been devel-
   opped with the following goals in mind:

   -  Avoid bad surprises for users when they cannot understand that two
      identifiers looking exactly the same don't match.  The user in
      this case is an average user without any specific knowledge of
      character encoding, but with a basic dose of "computer literacy"
      (e.g. know that 0 and O have distinct keys on a keyboard).

   -  Restrict normalization to cases where it is really necessary;
      cover remaining ambiguities by guidelines.

   -  Define normalization so that it can be implemented using widely
      accessible documentation.

   -  Define normalization so that most identifiers currently existing
      locally are not affected.

   -  Take measures for best possible compatibility with future addi-
      tions to the UCS.



1.? Notation


   Codepoints from the UCS are denoted as U+XXXX, where XXXX is their
   hexadecimal representation, according to [Unicode, p.???].

   Stretches of characters? Official character names and components all
   upper case.




                          Expires in six months         [Page 3]

Internet DrafNormalization of Internationalized Identifiers     May 1997


2. Categories of Ambiguity and Problems


   Comparing two sequences of codepoints from the UCS, various degrees
   of ambiguity can arise:

   Category A: The two sequences are expected to be rendered exactly the
   same, considered identical by the user, and cannot be disambiguated
   by context.

   Category B: The two sequences are "semantically" different but diffi-
   cult or impossible to distinguish in rendering.

   Category C: ?????

   ????

   There are also a number of codepoints in the UCS that should not be
   used for various reasons, mainly that they are not available on usual
   keyboards. These go into Category X.


?. Normalization of Combining Sequences


   One of the main reasons for Category A ambiguities is the fact that
   the UCS contains a general mechanism for encoding diacritic combina-
   tions from base letters and modifying diacritics, but that many com-
   binations also exist as precomposed codepoints.

   The following algorithm normalizes such combinations:

   Step 1: Starting from the beginning of the identifier, find a maximal
   sequence of a base character (possibly decomposable) followed by mod-
   ifying letters.

   Step 2: Fully decompose the sequence found in step 1, using all
   canonical decompositions defined in [Unicode2] and all canonical
   decompositions defined for future additions to the UCS.

   Step 3: Sort the sequence of modifying letters found in Step 2
   according to the canonical ordering algorithm of Section 3.9 of [Uni-
   code2].

   Step 4: Try to recombine as much as possible of the sequence result-
   ing from Step 3 into a precomposed character by findind the longest
   initial match with any canonical decomposition sequence defined in
   [Unicode2], ignoring decomposition sequences of length 1.



                          Expires in six months         [Page 4]

Internet DrafNormalization of Internationalized Identifiers     May 1997


   Step 5: Use the result of Step 4 as output and continue with Step 1.

   Note: In Step 4, the decomposition sequences in [Unicode2] have to be
   recursively expanded for each character (except for decomposition
   sequences of length 1) before application. Otherwise, a character
   such as U+1E1C, LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE, will
   not be recomposed correctly.

   Note: In Step 4, canonical decompositions defined for future addi-
   tions to the UCS are explicitly not considered to ease forwards com-
   patibility. It is assumed that systems knowing about newly defined
   precompositions will be able to decompose them correctly in Step 2,
   but that it is hard to change identifierst on older systems using a
   decomposed representation.

   Note: A different definition of Step 4 may lead to shorter normaliza-
   tions for some identifiers. The current definition was choosen for
   simplicity and implementation speed.  (this may be subject to discus-
   sion, in particular if somebody has an implementation and is ready to
   share the code).

   Note: The above algorithm can be sped up by shortcuts, in particular
   by noting that precomposed characters [with the important exception
   of those that have a decomposition sequence of length 1] which are
   not followed by modifying letters, are already normalized.

   Note: A completely different algorithm that results in the same
   observed input-output behaviour is also acceptable.

   Note: The exception for "precomposed letters that have a decomposi-
   tion sequence of length 1" in Step 4 is necessary to avoid e.g. the
   letter "K" being "aggregated" to "KELVIN SIGN" U+212A.


?. Hangul Jamo Normalization


   Hangul Jamo (U+1100-U+11FF) provide ample possibilities for ambiguous
   notations and therefore must be carefully normalized.  The following
   algorithm or its equivalents in terms of input-output behaviour
   should be used:

   Step 1: A seqence of Hangul jamo is split up into syllables according
   to the definition of syllable boundaries on page 3-12 of [Unicode2].
   Each of these syllables is processed according to Steps 2-4.

   Step 2: Fillers are inserted as neccessary to form a canonical sylla-
   ble as defined on page 3-12 of [Unicode2].



                          Expires in six months         [Page 5]

Internet DrafNormalization of Internationalized Identifiers     May 1997


   Step 3: Sequences of choseong, jungseong, and jongseong (leading con-
   sonants, vowels, and trailing consonants) are replaced by a single
   choseong, jungseong, and jongseong respectively according to the com-
   patibility decompositions given in [Unicode2]. If this is not possi-
   ble, the sequence is malformed and the user should be warned.

   Step 4: The seqence is replaced by a Hangul Syllable (U+AC00-U+D7AF)
   if this is possible according to the algorithm given on pp. 3-12/3 of
   [Unicode2].

   Note: We need some  for dealing with compatibility Jamo (U+3130...).


?. Other Cases of Ambiguities


   General considerations about case.

   Similar letters in different alphabets (e.g. Latin/Greek/Cyrillic A):
   The letter from the correct alphabet should be used in context with
   other letters from that alphabet. Mixed-alphabet identifiers have to
   be avoided. In the case of single letters mixed with numbers and
   such, which should be avoided in the first place, it should be
   assumed that such letters are Latin if possible, and Cyrillic other-
   wise.  Lower-case identifiers should be prefered because lower-case
   has less such problems.  (should heuristics based on wider context
   (e.g. domain names) be mentionned?)

   Half-width and full-width compatibility characters (U+FF00...): The
   version not in the compatibility section (i.e. half-width for Latin
   and symbols, full-width for Katakana, Hangul, "LIGHT VERTICAL",
   arrows, black square, and white circle) should be used wherever pos-
   sible. Because half-with Latin characters may be needed in certain
   parts of certain identifiers anyway, keyboard settings in places
   where identifiers are input may be set to produce half-width Latin
   characters by default, making the input of full-width characters more
   tedious. Also, while the difference between half-width and full-width
   characters is well visible on computers in contexts that use fixed-
   pitch displays, they are not well transcribed on paper or with high
   quality printing.  Identifiers should never differ by a half-
   width/full-width difference only.

   Vertical variants (U+FE30...): Should not be used, in particular
   because they are variants of characters that are already discouraged
   :-).

   Small form variants (U+FE50...): Strongly discouraged (where do they
   come from?).



                          Expires in six months         [Page 6]

Internet DrafNormalization of Internationalized Identifiers     May 1997


   Ligatures (Latin and Arabic). Not covered by canonical decomposition.
   Need to write some normalization specs for them!

   Other script-specific stuff.

   Signs and symbols.

   Punctuation.


?. Ideographic Ambiguities

   Compatibility Ideographs: How to handle the Korean case?  How to han-
   dle the other stuff?

   Warning about JIS 75/83 (97!) problems (~20 pairs).

   Warning about backwards-compatibility non-unifications (about 100
   pairs and some triples of differing seriousness; affecting inter-
   typographic-context work but not intra-TC).

   Explanation about general differences due to simplifications.


Acknowledgements

   I am grateful in particular to the following persons for contributing
   ideas, advice, criticism and help: Mark Davis, Larry Masenter, (to be
   completed).




Bibliography

   [HTML]         T. Berners-Lee and D. Connolly, "Hypertext Markup Lan-
                  guage - 2.0" (RFC1866), MIT/W3C, November 1995.

   [Unicode2]     Unicode????, Version 2, Addisson-Wesley, Reading, MA,
                  1996.

   [HTML-I18N]    F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter-
                  nationalization of the Hypertext Markup Language",
                  Work in progress (draft-ietf-html-i18n-05.txt), August
                  1996.






                          Expires in six months         [Page 7]

Internet DrafNormalization of Internationalized Identifiers     May 1997


Author's Address

   Martin J. Duerst
   Multimedia-Laboratory
   Department of Computer Science
   University of Zurich
   Winterthurerstrasse 190
   CH-8057 Zurich
   Switzerland

   Tel: +41 1 257 43 16
   Fax: +41 1 363 00 35
   E-mail: mduerst@ifi.unizh.ch


     NOTE -- Please write the author's name with u-Umlaut wherever
     possible, e.g. in HTML as D&uuml;rst.


































                          Expires in six months         [Page 8]