Re: internationalization of URIs

Ted Hardie <> Tue, 16 October 2007 05:01 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1IheYQ-0002Lh-F3; Tue, 16 Oct 2007 01:01:18 -0400
Received: from discuss by with local (Exim 4.43) id 1IheYO-0002KY-I3 for; Tue, 16 Oct 2007 01:01:16 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1IheYN-0002JQ-Jd for; Tue, 16 Oct 2007 01:01:15 -0400
Received: from ([]) by with esmtp (Exim 4.43) id 1IheYH-0002vf-BC for; Tue, 16 Oct 2007 01:01:15 -0400
Received: from ( []) by (8.13.6/8.12.5/1.0) with ESMTP id l9G50w94009097 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL); Mon, 15 Oct 2007 22:00:58 -0700
Received: from [] ( []) by (8.13.6/8.13.6/1.0) with ESMTP id l9G50u3g000587; Mon, 15 Oct 2007 22:00:57 -0700
Mime-Version: 1.0
Message-Id: <p06240601c339e99bc2e9@[]>
In-Reply-To: <200710151939.l9FJdIkM003350@localhost.localdomain>
References: <200710151939.l9FJdIkM003350@localhost.localdomain>
Date: Mon, 15 Oct 2007 22:01:02 -0700
To: Thomas Narten <>,
From: Ted Hardie <>
Subject: Re: internationalization of URIs
Content-Type: text/plain; charset="us-ascii"
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 3002fc2e661cd7f114cb6bae92fe88f1
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

At 3:39 PM -0400 10/15/07, Thomas Narten wrote:
>As some of you may know, as part of testing the readiness of IDNs,
>ICANN has inserted a set of internationalized versions of ".test" into
>the root zone of the DNS. See
> for
>One of the questions that this has prompted (again) is what about that
>pesky "http:", that still needs to typed in ascii. And what about the
>rest of the URL for that matter.

So, I've read Martin's answer, but I'd like to take a shot at this from
a slightly different angle.    Inside the IETF, we commonly treat IRIs
as a presentation layer for URIs.  There is a URI form for any IRI
(and all URIs are also IRIs), so it is always possible to "stick to" the
URI as the protocol element and as use IRIs as presentation elements.
(The big exception to this is inside XML, where the "anyURI" element
got deployed with a syntax that didn't really match URIs at all;
the result is that those strings (which appear to be IRIs to the casual
observer) are really protocol elements using different rules than
those normally used by URIs.)

When I read Martin's comments about drop-downs, elided scheme
names, and similar tricks, my protocol-geek hat tightened on my head
and gave me a pretty severe headache.  Taking it off for a moment,
though, showed me things are still okay.  As presentation elements,
things like drop-downs, inference of scheme by an initial www, and
similar tricks are more reasonable.

A big question, then, is whether we have all the bits
we need to map between a presentation element and a protocol element,
and whether all of those mechanisms need to be standardized.
The answer to the first is almost certainly no.  There are some contexts
where the UI aspects of a decent presentation element are just beyond
the IETF's expertise.  Taking even a simple protocol element like
the scheme portion of an HTTP URI and determining how best to represent
that in, say, modern Mongolian as used by the Oirat is no easy task.   The monk
who developed it didn't have URIs in mind when Clear Script was being
developed.  Should we recommend they use the Latin letters in consequence?  or
the Cyrillic alphabet (as many  other Mongolian speakers do)?  Is either
really the right choice?   Especially, is it the right choice for the IETF to take on?

If it is not clear, I think the answer to the question of whether all presentation
elements need to be standardized is "not in the IETF, anyway".  I think the IETF
does need to make sure that presentation elements can use the UCS in useful
and reasonable ways.  We have worked on that, and there continues to be work on
that, largely through the efforts of dedicated individuals at this point, rather than working groups. 

We also have agreed, as a community, to take on work on some work
that does not rely on a presentation layer separation from the protocol.
We have agreed to work on email addresses, as one example,
and that working group decided not to use a pure presentation layer

This working group will address one basic approach to email
internationalization. That approach is based on the use of an SMTP
extension to enable both the use of UTF-8 in envelope address local-
parts and optionally in domain-parts and the use of UTF-8 in mail
headers -- both in address contexts and wherever encoded-words are
permitted today. Its initial target will be a set of experimental
RFCs that specify the details of this approach and provide the basis
for generating and testing interoperable implementations. Its work
will include examining whether "downgrading" -- transforming an
internationalized message to one that is compatible with unextended
SMTP clients and servers and unextended MUAs -- is feasible and
appropriate and, if it is, specifying a way to do so. If it is not,
the WG will evaluate whether the effort is worth taking forward.
Other approaches may be considered by the formation of other
working groups.

 (see for
the full context).   There will be consequences for lots of
other protocol slots if this experiment succeeds, as there
are lots of places for which there is a tacit assumption that
the identifier can "look like" an email identifier (think SIP
AoR s and certs, to take two examples).  But the changes needed
to those slots and the changes needed to the URIs  which refer
to those (or the IRI representation of those URIs) may not be quite
the same. 

I doubt this has helped much, honestly, but hopefully the urge
to correct my mistakes will prompt others to step in and say
something more useful.