PP11: Internationalized Identifiers -- Taking Another Look

Lisa Dusseault <lisa@osafoundation.org> Fri, 18 January 2008 19:54 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1JFxIA-0004ZT-3U; Fri, 18 Jan 2008 14:54:18 -0500
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1JFxI8-0004ZJ-Td for discuss-confirm+ok@megatron.ietf.org; Fri, 18 Jan 2008 14:54:16 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1JFxI8-0004Z6-Js for discuss@apps.ietf.org; Fri, 18 Jan 2008 14:54:16 -0500
Received: from laweleka.osafoundation.org ([204.152.186.98]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1JFxI6-0003gd-FW for discuss@apps.ietf.org; Fri, 18 Jan 2008 14:54:16 -0500
Received: from localhost (laweleka.osafoundation.org [127.0.0.1]) by laweleka.osafoundation.org (Postfix) with ESMTP id 9DF9C142254 for <discuss@apps.ietf.org>; Fri, 18 Jan 2008 11:54:16 -0800 (PST)
X-Virus-Scanned: by amavisd-new and clamav at osafoundation.org
Received: from laweleka.osafoundation.org ([127.0.0.1]) by localhost (laweleka.osafoundation.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bxiamsRuVsnz for <discuss@apps.ietf.org>; Fri, 18 Jan 2008 11:54:10 -0800 (PST)
Received: from [192.168.1.101] (unknown [74.95.2.169]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by laweleka.osafoundation.org (Postfix) with ESMTP id DF10F142203 for <discuss@apps.ietf.org>; Fri, 18 Jan 2008 11:54:09 -0800 (PST)
Mime-Version: 1.0 (Apple Message framework v752.3)
References: <9161E0244AEE1807CA22ECF7@p3.JCK.COM>
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-Id: <9576353B-8071-4258-8B68-FD27002D8C33@osafoundation.org>
Content-Transfer-Encoding: 7bit
From: Lisa Dusseault <lisa@osafoundation.org>
Subject: PP11: Internationalized Identifiers -- Taking Another Look
Date: Fri, 18 Jan 2008 11:54:06 -0800
To: Apps Discuss <discuss@apps.ietf.org>
X-Mailer: Apple Mail (2.752.3)
X-Spam-Score: -4.0 (----)
X-Scan-Signature: ff0adf256e4dd459cc25215cfa732ac1
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

Internationalized Identifiers -- Taking Another Look
February 2008 Applications Area Architecture Workshop
John C Klensin  -- Working Draft, 20080107

------------------------

I've been wrestling with questions of how we should really be
looking at internationalization of applications-level
identifiers for some years.  Examining the questions in the
context of internationalization and localization efforts seems
to bring many of the issues into clearer focus.  It has become
clear to me, for reasons discussed below, that IRIs are not the
answer to the set of problems that I, and most of the people
who are worried about culturally-appropriate localization for
end users, are concerned about.  They may be the solution to
some other set of problems but, if so, part of our job is to
make that role clear.  This note tries to summarize the issues
and some of the requirements for a way forward.


_ Problem Description _

People want, or believe they want, identifiers that can be
expressed fully in their own language and script and in a way
that is consistent with the associated culture.  This is not
just a matter of being able to use extended (i.e., beyond ASCII
or Latin-1) character codings.  If the culture is one in which
any identifier that uses hierarchy (or nested categories) runs
from largest category to smallest, then it is a reasonable
expectation and possibly a requirement that localized
identifiers run in the same direction.  In the middle of the
IDNA work, James Seng pointed out that, in most of Asia, a
construction such as

     username at local-domain.bigger-domain.biggest-domain

was quite foreign and that, independent of the character set
used, the culturally-appropriate ordering would be (using a
fairly arbitrary choice of delimiters):

     biggest-domain>bigger-domain>local-domain>>username

or at least, if fully-qualified domain names could be
construed, by habit, as atomic, as

      local-domain.bigger-domain.biggest-domain>>username

Similar, and worse, issues arise with scripts whose natural
directionality is from right to left.  More on that below, in
the more complicated case.

In many quarters, expectations for IDNs are very high.  When
IDNA was finished in late 2002 and early 2003, many people
assumed that widespread adoption of IDNs would follow quickly
and would make it possible for people to use identifiers, pass
them to others, and generally communicate with the Internet in
their own languages.   It didn't happen.  Some of the reasons
involved relatively slow deployment of supporting software in
the web context and the need to do further work in the context
of other protocols (such as the email i18n work now going on in
the EAI WG).  But a more fundamental reason (or excuse) was
that domain names went from being

     ASCII-name.ASCII-name.ASCII-TLD-name

to being

     Local-script-name.Local-script-name.ASCII-TLD-name

The mixture didn't solve anyone's problem, at least as
understood after some observation of user behavior.  So, today,
there is a lot of activity around ICANN (and elsewhere) about
top-level IDNs.  That discussion is mostly about politics and,
where it is not, the technical issues are no longer considered
applications matters, so this paper doesn't describe it
further.

However, it is fairly clear that even

     Local-script-name.Local-script-name.Local-script-name

or, as often written,

     IDN.IDN.IDN

will not solve the underlying problems either.  There are two
major reasons for this which bear directly on the problem at
hand.

(1) Users really don't use domain names.  They use email
addresses, URIs, or other pieces of syntax of which domain
names are a part.   There is a common assumption that one can
just type

    IDN1.IDN2.IDN3

and have the application software (browser or whatever)
magically turn it into

    http://IDN1.IDN2.IDN3/

Of course, that plan doesn't work for more than one protocol,
or even for https versus http.  Those who think it will are a
little naive and it is very much in our interest, and that of
the Internet, to figure out solutions before they notice.

(2) The problem, however, gets worse because, at least in
areas where the web and similar references are well-developed,
the current trends are consistent with what almost everyone
with real experience predicted starting 15 or more years ago:
that we would move from attempts to make domain names identify
resources to URLs, and then URIs, with longer and more complex
tails.  Put differently, we would move backward to using domain
names to identify resources on the network (again) and forward
to using the rest of the URI to identify the kinds of objects
end users cared about, with the network resources providing
context.

	Digression: Of course, there are another set of approaches
	in which the user-visible identifiers point to user-objects
	in some way that doesn't involve a reference to a host or
	equivalent resource, but those sorts of identifiers --
	often known as "above DNS" -- are mostly outside the scope
	of this discussion except insofar as they use URI syntax.
	"Mostly" because such "above DNS" identifiers might be a
	complete alternate solution to the problems discussed here.
	But, so far, the marketplace hasn't been ready to take them
	seriously: we are still investing heavily (intellectually
	as well as financially) in domain names and URLs.

So now, in many cases, we are looking at a URI that consists of
a protocol identifier, some syntax, a domain name, some syntax,
and a tail that contains a good deal of syntax as well.  We can
internationalize the domain names and present them, more or
less, in local characters.  We can internationalize the
variable substrings of the tail and present them, again more or
less, in local characters.   When we have gotten through doing
those things and applying a few very subtle rules (especially
for right-to-left scripts), we have IRIs.

Unfortunately, relative to the desire to see things in a
linguistically and culturally-appropriate way, this just
doesn't do it.  We've still got ASCII protocol identifiers, and
required ASCII pieces of syntax that differ from one protocol
to another.  As part of that syntax, the overall structure of
the URI still runs from left to right.  Even if one decides to
localize the IRI by translating or transliterating the protocol
identifier (the common practice in many parts of the world
already) and transcoding the syntax bits into local characters
(done sometimes, but much less common today), converting to and
from URI format as things go out onto, and come in from, the
wire, we are still in trouble once one leaves the web
environment and, to some extent, while still in it.  The
problem is that there are enough variations on URI syntax --
variations that depend on knowledge of the protocol identifier
and the properties of, and syntax required by, that protocol --
that one cannot tell, in the general case, what is a syntax
element and what might be part of an identifier.

That, in turn, means that one cannot write a general-purpose
localization engine that does the right things for a particular
language and culture.  One has to, somehow, build an engine
that knows about each protocol identifier and keep updating it
as new protocol identifiers are added... a process that is
sufficiently burdensome and error-prone as to be impractical
and probably implausible.


_A Pointer to a Solution_

We need to rethink the IRI as our primary internationalization
tool in the light of the problem and target discussed above.
Our goal should be a collection of data elements that are
identified in a sufficiently clear way that it becomes possible
to do a large part of the localization job algorithmically,
without having to rely on either knowledge of the particular
syntax associated with a given protocol identifier or on
parsing heuristics that will mostly work.  Ideally, we need to
be able to identify an internationalized identifier in running
text so that it can be localized.  That, in turn, probably
means that, for internationalized forms, we need to revisit the
decision that URIs (or equivalent) can appear without
unambiguous opening and ending delimiters.

I'm not prepared to propose a syntax, but, to demonstrate that
it is not impossible, note that it it clear that one could
start with the general URI spec in RFC 3986, use it to create a
list of data element and specific delimiter types, and then use
an XML-based syntax to express a URI in tagged data element
form (with the standard delimiters expressed as attribute
values).  From that form, one could perform transformations to
a localized version of the XML form (e.g., using
internationalized identifiers and local delimiters) and, from
that, to a localized (no tag) identifier presentation form if
desired (including those local delimiters and local ordering).

Again, that may not be the only solution.  But if we are going
to preserve both

	* compatibility across the global Internet, with URIs in
	  standard form, with standard protocol identifier names
	  and standard delimiters, and

	* the ability to have presentation and information-entry
	  forms that are localized to local cultures and needs,
	  including the names and delimiters used, the ways in
	  which data elements are presented and ordered, and so on,

then we need to move beyond the IRI and its simple
character-by-character, syntax-preserving, mapping to and
from URIs.   If we don't do it, we will certainly see local
solutions that will make global interoperability and global
references a thing of the past.