Supporting national characters in the Directory
Geir Pedersen <geirp@usit.uio.no> Mon, 09 March 1992 01:57 UTC
Received: from nri.nri.reston.va.us by ietf.NRI.Reston.VA.US id aa01235; 8 Mar 92 20:57 EST
Received: from nri.reston.va.us by NRI.Reston.VA.US id aa14615; 8 Mar 92 20:58 EST
Received: from bells.cs.ucl.ac.uk by NRI.Reston.VA.US id aa14609; 8 Mar 92 20:57 EST
Received: from gollum.uio.no by bells.cs.ucl.ac.uk with Internet SMTP id <g.15262-0@bells.cs.ucl.ac.uk>; Mon, 9 Mar 1992 01:34:06 +0000
Received: from gandalf.uio.no by gollum.uio.no with SMTP id <AAgollum25832>; Mon, 9 Mar 1992 02:34:54 +0100
From: Geir Pedersen <geirp@usit.uio.no>
Date: Mon, 09 Mar 1992 02:34:08 +0100
Message-Id: <9203090134.AAgandalf16611@gandalf.uio.no>
To: osi-ds@cs.ucl.ac.uk
Subject: Supporting national characters in the Directory
Below is a document that discuss how to support national characters in the X.500 directory. This is a topic that has been discussed several times within RARE WG3, and we would like to bring this discussion into a broader international forum. I hope that the enclosed document can form the basis for a discussion both on the net and at the San Diego meeting of the IETF OSI-DS group next week. (I will be at the meeting.) Geir. ==================== Geir Pedersen University of Oslo, Norway eMail: Geir.Pedersen@usit.uio.no March 8th, 1992 Supporting national characters in the Internet and COSINE X.500 Directory ================================================== DRAFT Introduction ------------ We are in the process of building a global X.500 based directory. In many countries it is seen as essential for the success of the dir- ectory that it is able to recognise, store and present textual info- rmation, like personal and organisational names, represented in the character sets used by those concerned. This means the directory must be able to handle national characters not found in the English alpha- bet. As we are still in an early phase in establishing the directory as a widly available network service, now is a good time to tackle the issues related to supporting usage of national characters in the directory. This document discusses the requirements with respect to support for national characters from the directory as well as ways to meet the identified requirements. Finaly a recommendation on how to support national characters in the Internet and COSINE X.500 Directory is given. This document is heavily influenced by discussions within RARE WG3 and the RARE/CPMU Character Set Group. Some definitions: o "National characters" - alphabetic characters unique to character sets used only by a few countries or languages (i.e. not found in IS 646). o "Textual information" - a text that may contain national char- acters. Character set support in X.500 ------------------------------ Most X.500 attributes meant to hold names to be used by human users are based on the Case Ignore String or Case Exact String X.500 syntaxes. Both these syntaxes are again based on the T61String and PrintableString data types as defined in X.208, i.e. an attribute value for an attribute based on the two mentioned X.500 syntaxes can be based on either the T61String or PrintableString data types. The character set defined by PrintableString contains a true subset of the characters part of the T61String data type. It contains these characters: The letters A-Z and a-z, the digits 0 through 9, space and the following characters: '()+,-./:=?. The characters part of the PrintableString data type is a subset of ISO 646. The T61String data type is based on the T.61 character set. According to X.208 the T61String data type may hold characters as defined in the registered character sets 87, 102, 103, 106 and 107. (As registered by ECMA according to ISO 2375.) Of these only the sets 102 and 103 are still registered. (The definition of these character sets may be found in Keld Simonsens Internet Draft: "Character Mnemonics & Character Sets".) Even though the directory is capable of storing a large number of different characters, it is easy to see that it is currently not able to hold all characters in use by its user community. (As an example: The Greek character set is not supported.) What do the users require? -------------------------- There is a general pressure from users in countries that have national characters in their alphabet to receive support for these characters from the computer systems they use. It is less clear if there is a desire from users to have access to and use all national characters, i.e. including national characters not used in the users country. It is our feeling that there may in fact be a reluctancy by users to relate to letters that are not part of the alphabet they use for written communication in their native language. Conversly there seems to be a tendency, in particular for organi- sations that perform a major part of their business in an internat- ional environment, to internationalise their names. Thus it may not be a requirement in a country to support more than the national char- acters actually in use in that country, perhaps including any additional national characters used in neighboring countries. It might even be a requirement for the directory to enable the user to interact with it using an internationalised version of textual information containing national characters not used in the country of the user. If we presume that internationalised versions of names are found on business cards, on attendance lists, in bibliographies, etc, we see that it might in fact be a requirement for the directory to support locating information based on internationalised versions of names. To support national characters means that it must be possible to input text with such characters for storage. Output of information containing national characters must be done using national characters or if desired by the user in an internationalised form. Input and output of textual information including national characters --------------------------------------------------------------------- Traditionally, terminals often employ a character set variant of ISO 646. Modern terminal equipment (in a wide sence of the word) are usually based on a newer character set containing a much larger repertoire, e.g. several parts of ISO 8859. Although these newer terminals have a very large character repertoire, they will usually be configured for the needs in one particular country. This will reflect on how easy it is to enter the various characters through the keyboard. We currently have a varied situation with respect to how well terminals support national characters. It the authors opinion that it is unlikely that terminals in the foreseable future will all national characters equally well. Few or no terminals support the coded character set defined by T61String. This puts a requirement on the users Directory User Agent to convert properly between the T61String character set and the character set being used by the terminal. Often the problem of lack of support for some characters in a terminal is handled through usage of some form of transcription of the un- supported characters into a format that can be communicated through the terminal. Two transcription methods will be mentioned here: trans- cription to unambiguous character mnemonics and ambiguous trans- cription to one or more characters part of the supported repertoire. There is a current Internet Draft by Keld Simonsen titled "Character Mnemonics & Character Sets". This ID defines a mechanism for representing all characters recognised in ISO DIS 10646 using any of a large number of character sets. Characters not represented directly in any given character set part of this scheme, may be represented using an unique mnemonic. The mnemonics are constructed from the repertoire of ISO 646, which is a subset of a majority of the used coded character sets. This Internet Draft also defines how to convert text represented in any of these character sets to any other character set. During conversion characters not representable in the destination character set is transcribed to a mnemonic name. T61String and PrintableString are both covered by the method described in Keld Simonsens Internet Draft. According to the Internet Draft the mnemonics has been designed "so the graphical appearance ... resembles as much as possible ... the graphical appearance of the character." In spite of this the author tend to believe that users that only occasionaly need to enter a national character they are not familiar with as a mnemonic will need some form of assistance to enter the right mnemonic. The uniqueness of the mnemonics specified by the method described above is assured by the use of an intro character for the mnemonics - by default the ampersand character. Another transcription method that is much used is to establish a conversion table for "down-conversion" of each national character to one or more characters part of ISO 646. Employing such a conversion will result in a non reversable result. Many countries and networks have established such conversion rules prompted by restrictions on which characters may be present in an RFC822 address specification. Below we will refer to such a down-conversion of national characters into an international version as "internationalisation". Internationalisating national characters may only be a method applicable to languages/alphabets based on the latin alphabet. It is worth noting that users of the directory at times will need to search the directory based on textual information that earlier have been output from it. Usage of User Friendly Names is an example of this. They may also want to read information from the directory based on a Distinguished Name. This may limit the usefulness of the second method. Issues that need to be addressed -------------------------------- In summary we see the following areas that will have to be addressed when discussing how to support national characters in X.500: a) How to represent textual information when input as keys to be searched for in the directory. b) How to represent textual information that is input into the dir- ectory (for storage) at the user-interface. c) How to represent textual information that is output from the directory at the user-interface. d) Storage of textual information that may contain national characters in the directory. Discussion ---------- It can probably safely be assumed that users inputing information containing national characters into the directory will have terminal equipment capable of communicating the relevant national characters. There is still the problem of converting text input from the character set used by the terminal to the T61String character set. Keld Simonsens Internet Draft mentioned above could be taken as a basis for an implementation of a mechanism to support conversion between character sets used by terminal equipment and the T61String character set. (In an situation where the terminal equipment does not support the required national characters, mnemonics could be used.) To support searching the directory for values containing national characters there are several approaches that could be considered. As described above the problems that need to be addressed in this situation are the possibility that the user is unable to input the required character, and that the user may not know that the name contains certain national characters, or exactly which national characters that have been used. We will first consider the situation where the user only knows an internationalised version of the name sought. It is presumed that the internationalisation of names take place based on consistent rules, so that the internationalisation performed in different situations (e.g. for a business card, for an electronic mail address and in the directory) will give the same result. For the directory, storing the convcersion rules in the directory could be a solution. Two options are seen for this situation: o Attribute values containing national characters (i.e. stored using T61String) are also stored in an internationalised version in a PrintableString. o The searching algorithms employed by DSAs are modified so that values stored in the directory are internationalised before beeing compared with the value the user is searching for. A benefit of the first method is that it would be applicable for all the relational operators. For the second method, it is unclear if the mechanism should be enabled for all relational operators in a search operation, or restricted to approximate match. If the first method is chosen one should pay attention to how entries are shown to users with a goal to avoid showing both the PrintableString and T61String version of what is essentially the same information. In the case where the users terminal is unable to support entering all needed national characters when entering keys for searching we see two possible solutions, both based on transliteration: o Usage of Mnemonics to represent characters not representable in the terminals character set. o The user entering an internationalised version of the value that is being sought for. There are problems with both of these methods. As noted above entering text using mnemonics may not be trivial and may be prone to errors. This as the mnemnonics may not be as intuitive as one could wish. If such an approach is used one should make sure that users have access to some form of support making it easy to spot the right mnemonics to use. For the second approach it may be a problem if the user is unfamiliar with the rules used for internationalisation of national characters. This is not unlikely, at least for national characters that can not be intuitivly internationalised. We will now consider presentation of information stored in the directory. In the situation where the users terminal is capable of presenting all the relevant national characters, the problem is reduced to converting the T61String based representation to the terminals character set. This could be solved in a similar way as discussed above with respect to entering information into the directory. If the directory contains PrintableString versions of T61String attributevalues it will be hard to avoid displaying these as well as the T61String version, at least without knowledge of the rules for internationalisation of text with national character applicable to the entry being shown. If the terminal is unable to display all the required characters we see two options: o Usage of Mnemonics to represent characters not representable in the terminals character set. o Presentation of an internationalised version of the attribute values containing national characters. We see two options with respect to the origin of such an internationalised version of the attribute values: - The values could be stored in the directory when the data was entered into the directory. - The values could be generated from the corresponding T61String based values on the fly during presentation. The last approach would require knowledge of the rules used to internationalise text containing national characters. These rules may very well be different between countries (and perhaps also between regions within a country). Such rules could either be hard- coded into the relevant programs, or perhaps be available in certain widely replicated entries in the directory. When using the first approach one would not display any of the T61String based values, presuming that they all have corresponding PrintableString versions. A (suggestion for a) recommendation ----------------------------------- The author currently has the following view on how national characters should be supported by the directory. Textual information containing national characters should be stored in the directory using the T61String ASN.1 data type. No internationalised version of attribute values containing national characters should be stored. For searching the directory DSAs should have searching algorithms that internationalise stored values and compare them with the key being searched on in addition to comparing the T61String value. Thus supporting users unable to enter keys containing national characters. Experimentation is needed to determine which relational operators should support this function. Internationalisation of text containing national characters should take place according to nationally agreed rules. These rules should be available in the directory in a widely replicated entry. If needed DUAs should be prepared to internationalise textual information containing national characters before being communicated to the user. DUAs should support conversion between the terminal equipments character set and T61String. It is up to the implementator to decide how to handle characters not representable on the users terminal. Using mnemonics may be an option.
- Supporting national characters in the Directory Geir Pedersen