PP11: Internationalized Identifiers -- Taking Another Look
Lisa Dusseault <lisa@osafoundation.org> Fri, 18 January 2008 19:54 UTC
Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1JFxIA-0004ZT-3U; Fri, 18 Jan 2008 14:54:18 -0500
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1JFxI8-0004ZJ-Td for discuss-confirm+ok@megatron.ietf.org; Fri, 18 Jan 2008 14:54:16 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1JFxI8-0004Z6-Js for discuss@apps.ietf.org; Fri, 18 Jan 2008 14:54:16 -0500
Received: from laweleka.osafoundation.org ([204.152.186.98]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1JFxI6-0003gd-FW for discuss@apps.ietf.org; Fri, 18 Jan 2008 14:54:16 -0500
Received: from localhost (laweleka.osafoundation.org [127.0.0.1]) by laweleka.osafoundation.org (Postfix) with ESMTP id 9DF9C142254 for <discuss@apps.ietf.org>; Fri, 18 Jan 2008 11:54:16 -0800 (PST)
X-Virus-Scanned: by amavisd-new and clamav at osafoundation.org
Received: from laweleka.osafoundation.org ([127.0.0.1]) by localhost (laweleka.osafoundation.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bxiamsRuVsnz for <discuss@apps.ietf.org>; Fri, 18 Jan 2008 11:54:10 -0800 (PST)
Received: from [192.168.1.101] (unknown [74.95.2.169]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by laweleka.osafoundation.org (Postfix) with ESMTP id DF10F142203 for <discuss@apps.ietf.org>; Fri, 18 Jan 2008 11:54:09 -0800 (PST)
Mime-Version: 1.0 (Apple Message framework v752.3)
References: <9161E0244AEE1807CA22ECF7@p3.JCK.COM>
Content-Type: text/plain; charset="US-ASCII"; format="flowed"
Message-Id: <9576353B-8071-4258-8B68-FD27002D8C33@osafoundation.org>
Content-Transfer-Encoding: 7bit
From: Lisa Dusseault <lisa@osafoundation.org>
Subject: PP11: Internationalized Identifiers -- Taking Another Look
Date: Fri, 18 Jan 2008 11:54:06 -0800
To: Apps Discuss <discuss@apps.ietf.org>
X-Mailer: Apple Mail (2.752.3)
X-Spam-Score: -4.0 (----)
X-Scan-Signature: ff0adf256e4dd459cc25215cfa732ac1
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org
Internationalized Identifiers -- Taking Another Look February 2008 Applications Area Architecture Workshop John C Klensin -- Working Draft, 20080107 ------------------------ I've been wrestling with questions of how we should really be looking at internationalization of applications-level identifiers for some years. Examining the questions in the context of internationalization and localization efforts seems to bring many of the issues into clearer focus. It has become clear to me, for reasons discussed below, that IRIs are not the answer to the set of problems that I, and most of the people who are worried about culturally-appropriate localization for end users, are concerned about. They may be the solution to some other set of problems but, if so, part of our job is to make that role clear. This note tries to summarize the issues and some of the requirements for a way forward. _ Problem Description _ People want, or believe they want, identifiers that can be expressed fully in their own language and script and in a way that is consistent with the associated culture. This is not just a matter of being able to use extended (i.e., beyond ASCII or Latin-1) character codings. If the culture is one in which any identifier that uses hierarchy (or nested categories) runs from largest category to smallest, then it is a reasonable expectation and possibly a requirement that localized identifiers run in the same direction. In the middle of the IDNA work, James Seng pointed out that, in most of Asia, a construction such as username at local-domain.bigger-domain.biggest-domain was quite foreign and that, independent of the character set used, the culturally-appropriate ordering would be (using a fairly arbitrary choice of delimiters): biggest-domain>bigger-domain>local-domain>>username or at least, if fully-qualified domain names could be construed, by habit, as atomic, as local-domain.bigger-domain.biggest-domain>>username Similar, and worse, issues arise with scripts whose natural directionality is from right to left. More on that below, in the more complicated case. In many quarters, expectations for IDNs are very high. When IDNA was finished in late 2002 and early 2003, many people assumed that widespread adoption of IDNs would follow quickly and would make it possible for people to use identifiers, pass them to others, and generally communicate with the Internet in their own languages. It didn't happen. Some of the reasons involved relatively slow deployment of supporting software in the web context and the need to do further work in the context of other protocols (such as the email i18n work now going on in the EAI WG). But a more fundamental reason (or excuse) was that domain names went from being ASCII-name.ASCII-name.ASCII-TLD-name to being Local-script-name.Local-script-name.ASCII-TLD-name The mixture didn't solve anyone's problem, at least as understood after some observation of user behavior. So, today, there is a lot of activity around ICANN (and elsewhere) about top-level IDNs. That discussion is mostly about politics and, where it is not, the technical issues are no longer considered applications matters, so this paper doesn't describe it further. However, it is fairly clear that even Local-script-name.Local-script-name.Local-script-name or, as often written, IDN.IDN.IDN will not solve the underlying problems either. There are two major reasons for this which bear directly on the problem at hand. (1) Users really don't use domain names. They use email addresses, URIs, or other pieces of syntax of which domain names are a part. There is a common assumption that one can just type IDN1.IDN2.IDN3 and have the application software (browser or whatever) magically turn it into http://IDN1.IDN2.IDN3/ Of course, that plan doesn't work for more than one protocol, or even for https versus http. Those who think it will are a little naive and it is very much in our interest, and that of the Internet, to figure out solutions before they notice. (2) The problem, however, gets worse because, at least in areas where the web and similar references are well-developed, the current trends are consistent with what almost everyone with real experience predicted starting 15 or more years ago: that we would move from attempts to make domain names identify resources to URLs, and then URIs, with longer and more complex tails. Put differently, we would move backward to using domain names to identify resources on the network (again) and forward to using the rest of the URI to identify the kinds of objects end users cared about, with the network resources providing context. Digression: Of course, there are another set of approaches in which the user-visible identifiers point to user-objects in some way that doesn't involve a reference to a host or equivalent resource, but those sorts of identifiers -- often known as "above DNS" -- are mostly outside the scope of this discussion except insofar as they use URI syntax. "Mostly" because such "above DNS" identifiers might be a complete alternate solution to the problems discussed here. But, so far, the marketplace hasn't been ready to take them seriously: we are still investing heavily (intellectually as well as financially) in domain names and URLs. So now, in many cases, we are looking at a URI that consists of a protocol identifier, some syntax, a domain name, some syntax, and a tail that contains a good deal of syntax as well. We can internationalize the domain names and present them, more or less, in local characters. We can internationalize the variable substrings of the tail and present them, again more or less, in local characters. When we have gotten through doing those things and applying a few very subtle rules (especially for right-to-left scripts), we have IRIs. Unfortunately, relative to the desire to see things in a linguistically and culturally-appropriate way, this just doesn't do it. We've still got ASCII protocol identifiers, and required ASCII pieces of syntax that differ from one protocol to another. As part of that syntax, the overall structure of the URI still runs from left to right. Even if one decides to localize the IRI by translating or transliterating the protocol identifier (the common practice in many parts of the world already) and transcoding the syntax bits into local characters (done sometimes, but much less common today), converting to and from URI format as things go out onto, and come in from, the wire, we are still in trouble once one leaves the web environment and, to some extent, while still in it. The problem is that there are enough variations on URI syntax -- variations that depend on knowledge of the protocol identifier and the properties of, and syntax required by, that protocol -- that one cannot tell, in the general case, what is a syntax element and what might be part of an identifier. That, in turn, means that one cannot write a general-purpose localization engine that does the right things for a particular language and culture. One has to, somehow, build an engine that knows about each protocol identifier and keep updating it as new protocol identifiers are added... a process that is sufficiently burdensome and error-prone as to be impractical and probably implausible. _A Pointer to a Solution_ We need to rethink the IRI as our primary internationalization tool in the light of the problem and target discussed above. Our goal should be a collection of data elements that are identified in a sufficiently clear way that it becomes possible to do a large part of the localization job algorithmically, without having to rely on either knowledge of the particular syntax associated with a given protocol identifier or on parsing heuristics that will mostly work. Ideally, we need to be able to identify an internationalized identifier in running text so that it can be localized. That, in turn, probably means that, for internationalized forms, we need to revisit the decision that URIs (or equivalent) can appear without unambiguous opening and ending delimiters. I'm not prepared to propose a syntax, but, to demonstrate that it is not impossible, note that it it clear that one could start with the general URI spec in RFC 3986, use it to create a list of data element and specific delimiter types, and then use an XML-based syntax to express a URI in tagged data element form (with the standard delimiters expressed as attribute values). From that form, one could perform transformations to a localized version of the XML form (e.g., using internationalized identifiers and local delimiters) and, from that, to a localized (no tag) identifier presentation form if desired (including those local delimiters and local ordering). Again, that may not be the only solution. But if we are going to preserve both * compatibility across the global Internet, with URIs in standard form, with standard protocol identifier names and standard delimiters, and * the ability to have presentation and information-entry forms that are localized to local cultures and needs, including the names and delimiters used, the ways in which data elements are presented and ordered, and so on, then we need to move beyond the IRI and its simple character-by-character, syntax-preserving, mapping to and from URIs. If we don't do it, we will certainly see local solutions that will make global interoperability and global references a thing of the past.
- PP11: Internationalized Identifiers -- Taking Ano… Lisa Dusseault