Supporting national characters in the Directory

Geir Pedersen <geirp@usit.uio.no> Mon, 09 March 1992 01:57 UTC

Received: from nri.nri.reston.va.us by ietf.NRI.Reston.VA.US id aa01235; 8 Mar 92 20:57 EST
Received: from nri.reston.va.us by NRI.Reston.VA.US id aa14615; 8 Mar 92 20:58 EST
Received: from bells.cs.ucl.ac.uk by NRI.Reston.VA.US id aa14609; 8 Mar 92 20:57 EST
Received: from gollum.uio.no by bells.cs.ucl.ac.uk with Internet SMTP id <g.15262-0@bells.cs.ucl.ac.uk>; Mon, 9 Mar 1992 01:34:06 +0000
Received: from gandalf.uio.no by gollum.uio.no with SMTP id <AAgollum25832>; Mon, 9 Mar 1992 02:34:54 +0100
From: Geir Pedersen <geirp@usit.uio.no>
Date: Mon, 09 Mar 1992 02:34:08 +0100
Message-Id: <9203090134.AAgandalf16611@gandalf.uio.no>
To: osi-ds@cs.ucl.ac.uk
Subject: Supporting national characters in the Directory

Below is a document that discuss how to support national characters in
the X.500 directory. This is a topic that has been discussed several
times within RARE WG3, and we would like to bring this discussion into
a broader international forum. I hope that the enclosed document can
form the basis for a discussion both on the net and at the San Diego
meeting of the IETF OSI-DS group next week. (I will be at the
meeting.)

Geir.

====================

Geir Pedersen
University of Oslo, Norway

eMail: Geir.Pedersen@usit.uio.no

March 8th, 1992


		    Supporting national characters
	      in the Internet and COSINE X.500 Directory
	  ==================================================

				DRAFT


Introduction 
------------ 

We are in the process of building a global X.500 based directory. In
many countries it is seen as essential for the success of the dir-
ectory that it is able to recognise, store and present textual info-
rmation, like personal and organisational names, represented in the
character sets used by those concerned. This means the directory must
be able to handle national characters not found in the English alpha-
bet. As we are still in an early phase in establishing the directory
as a widly available network service, now is a good time to tackle the
issues related to supporting usage of national characters in the
directory.

This document discusses the requirements with respect to support for
national characters from the directory as well as ways to meet the
identified requirements. Finaly a recommendation on how to support
national characters in the Internet and COSINE X.500 Directory is
given.

This document is heavily influenced by discussions within RARE WG3 and
the RARE/CPMU Character Set Group.

Some definitions:

 o "National characters" - alphabetic characters unique to character
   sets used only by a few countries or languages (i.e. not found in
   IS 646).

 o "Textual information" - a text that may contain national char-
   acters.


Character set support in X.500
------------------------------

Most X.500 attributes meant to hold names to be used by human users
are based on the Case Ignore String or Case Exact String X.500
syntaxes.  Both these syntaxes are again based on the T61String and
PrintableString data types as defined in X.208, i.e. an attribute
value for an attribute based on the two mentioned X.500 syntaxes can
be based on either the T61String or PrintableString data types.

The character set defined by PrintableString contains a true subset of
the characters part of the T61String data type. It contains these
characters: The letters A-Z and a-z, the digits 0 through 9, space and
the following characters: '()+,-./:=?. The characters part of the
PrintableString data type is a subset of ISO 646.  The T61String data
type is based on the T.61 character set. According to X.208 the
T61String data type may hold characters as defined in the registered
character sets 87, 102, 103, 106 and 107. (As registered by ECMA
according to ISO 2375.) Of these only the sets 102 and 103 are still
registered. (The definition of these character sets may be found in
Keld Simonsens Internet Draft: "Character Mnemonics & Character
Sets".)

Even though the directory is capable of storing a large number of
different characters, it is easy to see that it is currently not able
to hold all characters in use by its user community. (As an example:
The Greek character set is not supported.)


What do the users require?
--------------------------

There is a general pressure from users in countries that have national
characters in their alphabet to receive support for these characters
from the computer systems they use. 

It is less clear if there is a desire from users to have access to and
use all national characters, i.e. including national characters not
used in the users country. It is our feeling that there may in fact be
a reluctancy by users to relate to letters that are not part of the
alphabet they use for written communication in their native language.
Conversly there seems to be a tendency, in particular for organi-
sations that perform a major part of their business in an internat-
ional environment, to internationalise their names. Thus it may not be
a requirement in a country to support more than the national char-
acters actually in use in that country, perhaps including any
additional national characters used in neighboring countries. It might
even be a requirement for the directory to enable the user to interact
with it using an internationalised version of textual information
containing national characters not used in the country of the user.

If we presume that internationalised versions of names are found on
business cards, on attendance lists, in bibliographies, etc, we see
that it might in fact be a requirement for the directory to support
locating information based on internationalised versions of names. 

To support national characters means that it must be possible to input
text with such characters for storage. Output of information
containing national characters must be done using national characters
or if desired by the user in an internationalised form.


Input and output of textual information including national characters
---------------------------------------------------------------------

Traditionally, terminals often employ a character set variant of ISO
646. Modern terminal equipment (in a wide sence of the word) are
usually based on a newer character set containing a much larger
repertoire, e.g. several parts of ISO 8859. Although these newer
terminals have a very large character repertoire, they will usually be
configured for the needs in one particular country. This will reflect
on how easy it is to enter the various characters through the
keyboard. We currently have a varied situation with respect to how
well terminals support national characters. It the authors opinion
that it is unlikely that terminals in the foreseable future will all
national characters equally well.

Few or no terminals support the coded character set defined by
T61String. This puts a requirement on the users Directory User Agent
to convert properly between the T61String character set and the
character set being used by the terminal.

Often the problem of lack of support for some characters in a terminal
is handled through usage of some form of transcription of the un-
supported characters into a format that can be communicated through
the terminal. Two transcription methods will be mentioned here: trans-
cription to unambiguous character mnemonics and ambiguous trans-
cription to one or more characters part of the supported repertoire.

There is a current Internet Draft by Keld Simonsen titled "Character
Mnemonics & Character Sets".  This ID defines a mechanism for
representing all characters recognised in ISO DIS 10646 using any of a
large number of character sets. Characters not represented directly in
any given character set part of this scheme, may be represented using
an unique mnemonic. The mnemonics are constructed from the repertoire
of ISO 646, which is a subset of a majority of the used coded
character sets. This Internet Draft also defines how to convert text
represented in any of these character sets to any other character set.
During conversion characters not representable in the destination
character set is transcribed to a mnemonic name.  T61String and
PrintableString are both covered by the method described in Keld
Simonsens Internet Draft. 

According to the Internet Draft the mnemonics has been designed "so
the graphical appearance ...  resembles as much as possible ...  the
graphical appearance of the character." In spite of this the author
tend to believe that users that only occasionaly need to enter a
national character they are not familiar with as a mnemonic will need
some form of assistance to enter the right mnemonic. 

The uniqueness of the mnemonics specified by the method described
above is assured by the use of an intro character for the mnemonics -
by default the ampersand character. Another transcription method that
is much used is to establish a conversion table for "down-conversion"
of each national character to one or more characters part of ISO 646.
Employing such a conversion will result in a non reversable result.
Many countries and networks have established such conversion rules
prompted by restrictions on which characters may be present in an
RFC822 address specification. Below we will refer to such a
down-conversion of national characters into an international version
as "internationalisation". Internationalisating national characters
may only be a method applicable to languages/alphabets based on the
latin alphabet.

It is worth noting that users of the directory at times will need to
search the directory based on textual information that earlier have
been output from it. Usage of User Friendly Names is an example of
this. They may also want to read information from the directory based
on a Distinguished Name. This may limit the usefulness of the second
method.


Issues that need to be addressed
--------------------------------

In summary we see the following areas that will have to be addressed
when discussing how to support national characters in X.500:

a) How to represent textual information when input as keys to be
   searched for in the directory.  

b) How to represent textual information that is input into the dir-
   ectory (for storage) at the user-interface.

c) How to represent textual information that is output from the
   directory at the user-interface. 

d) Storage of textual information that may contain national characters
   in the directory.


Discussion
----------

It can probably safely be assumed that users inputing information
containing national characters into the directory will have terminal
equipment capable of communicating the relevant national characters.
There is still the problem of converting text input from the character
set used by the terminal to the T61String character set. Keld
Simonsens Internet Draft mentioned above could be taken as a basis for
an implementation of a mechanism to support conversion between
character sets used by terminal equipment and the T61String character
set. (In an situation where the terminal equipment does not support
the required national characters, mnemonics could be used.)


To support searching the directory for values containing national
characters there are several approaches that could be considered. As
described above the problems that need to be addressed in this
situation are the possibility that the user is unable to input the
required character, and that the user may not know that the name
contains certain national characters, or exactly which national
characters that have been used. 

We will first consider the situation where the user only knows an
internationalised version of the name sought.

It is presumed that the internationalisation of names take place based
on consistent rules, so that the internationalisation performed in
different situations (e.g.  for a business card, for an electronic
mail address and in the directory) will give the same result. For the
directory, storing the convcersion rules in the directory could be a
solution.

Two options are seen for this situation:

o Attribute values containing national characters (i.e. stored using
  T61String) are also stored in an internationalised version in a
  PrintableString.

o The searching algorithms employed by DSAs are modified so that
  values stored in the directory are internationalised before beeing
  compared with the value the user is searching for. 

A benefit of the first method is that it would be applicable for all
the relational operators. For the second method, it is unclear if the
mechanism should be enabled for all relational operators in a search
operation, or restricted to approximate match.

If the first method is chosen one should pay attention to how entries
are shown to users with a goal to avoid showing both the
PrintableString and T61String version of what is essentially the same
information.


In the case where the users terminal is unable to support entering all
needed national characters when entering keys for searching we see two
possible solutions, both based on transliteration:

o Usage of Mnemonics to represent characters not representable in the
  terminals character set.

o The user entering an internationalised version of the value that is
  being sought for. 

There are problems with both of these methods. As noted above entering
text using mnemonics may not be trivial and may be prone to errors.
This as the mnemnonics may not be as intuitive as one could wish. If
such an approach is used one should make sure that users have access
to some form of support making it easy to spot the right mnemonics to
use. For the second approach it may be a problem if the user is
unfamiliar with the rules used for internationalisation of national
characters. This is not unlikely, at least for national characters
that can not be intuitivly internationalised. 


We will now consider presentation of information stored in the
directory. In the situation where the users terminal is capable of
presenting all the relevant national characters, the problem is
reduced to converting the T61String based representation to the
terminals character set. This could be solved in a similar way as
discussed above with respect to entering information into the
directory. If the directory contains PrintableString versions of
T61String attributevalues it will be hard to avoid displaying these as
well as the T61String version, at least without knowledge of the rules
for internationalisation of text with national character applicable to
the entry being shown.

If the terminal is unable to display all the required characters we
see two options:

o Usage of Mnemonics to represent characters not representable in the
  terminals character set.

o Presentation of an internationalised version of the attribute values
  containing national characters. We see two options with respect to
  the origin of such an internationalised version of the attribute
  values: 

   - The values could be stored in the directory when the data was
     entered into the directory.

   - The values could be generated from the corresponding T61String
     based values on the fly during presentation. 

  The last approach would require knowledge of the rules used to
  internationalise text containing national characters. These rules
  may very well be different between countries (and perhaps also
  between regions within a country). Such rules could either be hard-
  coded into the relevant programs, or perhaps be available in certain
  widely replicated entries in the directory.

  When using the first approach one would not display any of the
  T61String based values, presuming that they all have corresponding
  PrintableString versions.


A (suggestion for a) recommendation
-----------------------------------

The author currently has the following view on how national characters
should be supported by the directory. 

Textual information containing national characters should be stored in
the directory using the T61String ASN.1 data type. No internationalised
version of attribute values containing national characters should be
stored. 

For searching the directory DSAs should have searching algorithms that
internationalise stored values and compare them with the key being
searched on in addition to comparing the T61String value. Thus
supporting users unable to enter keys containing national characters.
Experimentation is needed to determine which relational operators
should support this function.

Internationalisation of text containing national characters should
take place according to nationally agreed rules. These rules should be
available in the directory in a widely replicated entry.

If needed DUAs should be prepared to internationalise textual
information containing national characters before being communicated
to the user.

DUAs should support conversion between the terminal equipments
character set and T61String. It is up to the implementator to decide
how to handle characters not representable on the users terminal.
Using mnemonics may be an option.