Re: internationalization of URIs

Martin Duerst <> Tue, 23 October 2007 08:04 UTC

Return-path: <>
Received: from [] ( by with esmtp (Exim 4.43) id 1IkEkM-0007Rp-Dv; Tue, 23 Oct 2007 04:04:18 -0400
Received: from discuss by with local (Exim 4.43) id 1IkEkK-0007RH-G1 for; Tue, 23 Oct 2007 04:04:16 -0400
Received: from [] ( by with esmtp (Exim 4.43) id 1IkEkJ-0007Qw-CD for; Tue, 23 Oct 2007 04:04:15 -0400
Received: from ([]) by with esmtp (Exim 4.43) id 1IkEkC-0002JW-IO for; Tue, 23 Oct 2007 04:04:15 -0400
Received: from (scmse2 []) by (secret/secret) with SMTP id l9N83RQs002456 for <>; Tue, 23 Oct 2007 17:03:28 +0900 (JST)
Received: from ( by via smtp id 4983_6fd293dc_813e_11dc_94e6_0014221f2a2d; Tue, 23 Oct 2007 17:03:27 +0900
Received: from ([]:33943) by with [XMail 1.22 ESMTP Server] id <S191D0E> for <> from <>; Tue, 23 Oct 2007 16:59:52 +0900
Message-Id: <>
X-Sender: duerst@localhost
X-Mailer: QUALCOMM Windows Eudora Version 6J
Date: Tue, 23 Oct 2007 16:58:39 +0900
To: Ted Hardie <>, Thomas Narten <>,
From: Martin Duerst <>
Subject: Re: internationalization of URIs
In-Reply-To: <p06240601c339e99bc2e9@[]>
References: <200710151939.l9FJdIkM003350@localhost.localdomain> <p06240601c339e99bc2e9@[]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 87a3f533bb300b99e2a18357f3c1563d
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <>
List-Unsubscribe: <>, <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>

Hello Ted,

Many thanks for your contribution. I'm cc'ing it (with my
comments interspersed) to the mailing

At 14:01 07/10/16, Ted Hardie wrote:
>At 3:39 PM -0400 10/15/07, Thomas Narten wrote:
>>As some of you may know, as part of testing the readiness of IDNs,
>>ICANN has inserted a set of internationalized versions of ".test" into
>>the root zone of the DNS. See
>> for
>>One of the questions that this has prompted (again) is what about that
>>pesky "http:", that still needs to typed in ascii. And what about the
>>rest of the URL for that matter.
>So, I've read Martin's answer, but I'd like to take a shot at this from
>a slightly different angle.    Inside the IETF, we commonly treat IRIs
>as a presentation layer for URIs.  There is a URI form for any IRI
>(and all URIs are also IRIs), so it is always possible to "stick to" the
>URI as the protocol element and as use IRIs as presentation elements.
>(The big exception to this is inside XML, where the "anyURI" element
>got deployed with a syntax that didn't really match URIs at all;
>the result is that those strings (which appear to be IRIs to the casual
>observer) are really protocol elements using different rules than
>those normally used by URIs.)

Some additonal comments:
- Atom, as an IETF standard, is an example where the IETF uses
  IRIs as protocol elements. Atom is also XML-based, so this doesn't
  contradict the above, but I just wanted to mention this for people
  who might think, from the above, the the IETF doesn't use IRIs.
- RDF, whether written in XML or not, also uses IRIs.
  Because of the way RDF compares resource identifiers, converting
  from IRIs to URIs isn't a good idea in the case of RDF.
- There is in principle nothing that would prevent the IETF from
  using IRIs as protocol elements in a new protocol. In my opinion,
  this would actually be the right thing. The conversion to URIs
  as protocol elements is there first and foremost for existing
  protocols that are based on URIs.

>When I read Martin's comments about drop-downs, elided scheme
>names, and similar tricks, my protocol-geek hat tightened on my head
>and gave me a pretty severe headache.  Taking it off for a moment,
>though, showed me things are still okay.  As presentation elements,
>things like drop-downs, inference of scheme by an initial www, and
>similar tricks are more reasonable.

Detail: the scheme isn't inferenced by an initial www. A very quick
test on one single browser showed that a leading 'ftp' label inferences
ftp://, but there is no need for an initial www to infer http://.

In general, yes, all these tricks are very much presentation issues.
Overall, we can think about this in three layers:

Final presentation: May include tricks as above,...

IRI: sometimes protocol, sometimes presentation

URI: protocol

>A big question, then, is whether we have all the bits
>we need to map between a presentation element and a protocol element,
>and whether all of those mechanisms need to be standardized.
>The answer to the first is almost certainly no.

I fully agree.

>There are some contexts
>where the UI aspects of a decent presentation element are just beyond
>the IETF's expertise.  Taking even a simple protocol element like
>the scheme portion of an HTTP URI and determining how best to represent
>that in, say, modern Mongolian as used by the Oirat is no easy task.   The monk
>who developed it didn't have URIs in mind when Clear Script was being
>developed.  Should we recommend they use the Latin letters in consequence?  or
>the Cyrillic alphabet (as many  other Mongolian speakers do)?  Is either
>really the right choice?   Especially, is it the right choice for the IETF 
>to take on?
>If it is not clear, I think the answer to the question of whether all 
>elements need to be standardized is "not in the IETF, anyway".  I think the IETF
>does need to make sure that presentation elements can use the UCS in useful
>and reasonable ways.  We have worked on that, and there continues to be work on
>that, largely through the efforts of dedicated individuals at this point, 
>rather than working groups. 
>We also have agreed, as a community, to take on work on some work
>that does not rely on a presentation layer separation from the protocol.
>We have agreed to work on email addresses, as one example,
>and that working group decided not to use a pure presentation layer

Yes. I expect this way of designing protocols to become more frequent.
For new protocols, for me, it would be a non-brainer. For existing
protocols, the decision is of course much more difficult, and
will once go one way, and once the other way.

>This working group will address one basic approach to email
>internationalization. That approach is based on the use of an SMTP
>extension to enable both the use of UTF-8 in envelope address local-
>parts and optionally in domain-parts and the use of UTF-8 in mail
>headers -- both in address contexts and wherever encoded-words are
>permitted today. Its initial target will be a set of experimental
>RFCs that specify the details of this approach and provide the basis
>for generating and testing interoperable implementations. Its work
>will include examining whether "downgrading" -- transforming an
>internationalized message to one that is compatible with unextended
>SMTP clients and servers and unextended MUAs -- is feasible and
>appropriate and, if it is, specifying a way to do so. If it is not,
>the WG will evaluate whether the effort is worth taking forward.
>Other approaches may be considered by the formation of other
>working groups.
> (see for
>the full context).   There will be consequences for lots of
>other protocol slots if this experiment succeeds, as there
>are lots of places for which there is a tacit assumption that
>the identifier can "look like" an email identifier (think SIP
>AoR s and certs, to take two examples).  But the changes needed
>to those slots and the changes needed to the URIs  which refer
>to those (or the IRI representation of those URIs) may not be quite
>the same. 
>I doubt this has helped much, honestly, but hopefully the urge
>to correct my mistakes will prompt others to step in and say
>something more useful. 
>                       regards,
>                               Ted

I think this was very useful background, thanks a lot.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University