Comments on IDNA (was: Re: [idn] Reality Check)

Martin Duerst <duerst@w3.org> Wed, 11 July 2001 08:40 UTC

Received: from psg.com (exim@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with SMTP id EAA28340 for <idn-archive@lists.ietf.org>; Wed, 11 Jul 2001 04:40:25 -0400 (EDT)
Received: from lserv by psg.com with local (Exim 3.31 #1) id 15KF8L-0003zQ-00 for idn-data@psg.com; Wed, 11 Jul 2001 01:14:09 -0700
Received: from sh.w3.mag.keio.ac.jp ([133.27.194.41]) by psg.com with esmtp (Exim 3.31 #1) id 15KF8J-0003zH-00 for idn@ops.ietf.org; Wed, 11 Jul 2001 01:14:07 -0700
Received: from enoshima (dhcp-100-137.mag.keio.ac.jp [133.27.195.137]) by sh.w3.mag.keio.ac.jp (8.9.3/3.7W) with ESMTP id RAA15638; Wed, 11 Jul 2001 17:13:24 +0900 (JST)
Message-Id: <4.2.0.58.J.20010711105253.033c16f0@sh.w3.mag.keio.ac.jp>
X-Sender: duerst@sh.w3.mag.keio.ac.jp
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.2.0.58.J
Date: Wed, 11 Jul 2001 17:12:54 +0900
To: Patrik Faltstrom <paf@cisco.com>, Paul Hoffman / IMC <phoffman@imc.org>, Keith Moore <moore@cs.utk.edu>, Edmon <edmon@neteka.com>
From: Martin Duerst <duerst@w3.org>
Subject: Comments on IDNA (was: Re: [idn] Reality Check)
Cc: idn@ops.ietf.org, "Adam M. Costello" <amc@cs.berkeley.edu>
In-Reply-To: <200107110055.UAA29310@astro.cs.utk.edu>
References: <Your message of "Tue, 10 Jul 2001 16:14:07 EDT." <009c01c1097c$e0a096e0$1001a8c0@neteka.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format="flowed"
Sender: owner-idn@ops.ietf.org
Precedence: bulk

At 20:55 01/07/10 -0400, Keith Moore wrote:
> > Currently, IDNA does not provide the "option" for UTF-8 that you are
> > suggesting.
>
>sure it does.  the application can use whatever encoding it wants to.
>the protocol used by DNS on-the-wire is irrelevant.

It would be extremely helpful if IDNA would make it clear that
applications can and should do something else than ACE.

The current IDNA draft (02), in the the unnumbered diagram,
says 'Application-specific protocol: predefined by the
protocol or defaults to nameprepped ACE'. This should be
changed.

Some other comments on the IDNA draft:

The abstract and intro give the impression that only updating
applications results in a significant advantage. Readers of
the doc may get the impression that servers/resolvers are
50% and applications 50%. As far as I know, it's more like
10%/90%. The draft should be changed to not give such wrong
impressions.

2.1.1: "Applications MAY allow ACE input and output, but are
not encouraged to do so except as an interface for advanced
users, possibly for debugging.": Please remove the 'advanced
user'. Giving people who want to look at the names the way they
are written the impression that they are retarded is not a
good idea.

2.1.1: "Since ACE can be rendered either as the encoded ASCII
glyphs or the proper decoded characters, the rendering engine
for an application SHOULD have an option for the user to select
the preferred display.

- The word 'ACE' is used here completely differently from above
   (including input, output, and process, rather than only one
    form).
- 'Rendering engine' is usually in the OS or window system/
   toolkit, not in the application. It would be strange to require
   such an engine to be able to decode ACE.
- The 'SHOULD' has to be conditioned on the MAY above.

My rewording would look like this:

In case Applications allow ACE input and output, they SHOULD
provide an option for the user to select ACE, which SHOULD
be off by default.


2.1.1:
In protocols and document formats that define how to handle
specification or negotiation of charsets, IDN host name parts can be
given in any charset allowed by the protocol or document format. If a
protocol or document format only allows one charset, IDN host name parts
must be given in that charset.

Preferably change 'charset' to 'character encoding' throughout
the draft. 'charset' is the name of a parameter, not a concept.
Also, in addition to character encodings, escapes are also possible
(I'm thinking of HTML &eacute;,...).

Also 'given in any' is much too general. And negotiation is irrelevant.
A single encoding is also just a special case, and so it should be
either left out, or called out ("In particular,..."). Also
On the other hand, protocols often use different character encodings
in different places. So I would write:

In any place where a protocol or document format allows to transmit
the characters in IDN host name parts, IDN host name parts MUST/SHOULD
be transmitted using whatever character encoding and escape mechanism
that the protocol or document format uses at that place.


2.1.2:
"that is resolving a non-internationalized host name parts MUST NOT do"
singular-plural mismatch.


2.1.4:
----
If an application decodes an ACE name but cannot show all of the
characters in the decoded name, such as if the name contains characters
that the output system cannot display, the application SHOULD show the
name in ACE format instead of displaying the name with the replacement
character (U+FFFD). This is to make it easier for the user to transfer
the name correctly to other programs using copy-and-paste techniques.
Programs that by default show the ACE form when they cannot show all the
characters in a name part SHOULD also have a mechanism to show the name
with as many characters as possible and replacement characters in the
positions where characters cannot be displayed.
----

Copy-and-paste often works even if display doesn't, and on the other
hand display may work but copy-and-paste may fail. So I would suggest
to just write:

This is to make it easier for the user to transfer
the name correctly to other programs.

Also, I would note that there are other ways to deal with this,
namely to show some hex code or use some popup tooltip,...
And also in many cases, the application doesn't know exactly what
the underlying rendering engine can display or not.


2.1.5

----
An application which receives a host name SHOULD verify whether or not
the host name is in ACE. This is possible by verifying the prefix in
each of the labels, and seeing whether or not the label is in ACE. This
MUST be done regardless of whether or not the communication channel used
(such as keyboard input, cut and paste, application protocol,
application payload, and so on) has negotiated ACE.
----

'negotiating ACE' sounds quite strange, in particular for keyboards.
It is not completely impossible to e.g. go to the X Consortium and
register something like "ACE_STRING" in addition to "COMPOUND_TEXT"
and "UTF8_STRING",... and then have application A fill in all of
these and application B pick ACE_STRING if it doesn't understand
the others,... It is also in theory concievable to create an input
method that directly produces ACE from some keyboard layout.
But there are considerable problems with this, I'm sure it's not
done for similar things such as Mail mime headers, and I think
it's a bad idea that the draft mention such things because it
may lead implementers into the woods.

Another problem is that the MUST seems to overrule the SHOULD.
Which one do we take?


----
The reason for this requirement is that many applications are not
ACE-aware. Applications that are not ACE-aware will send host names in
ACE but mark the charset as being US-ASCII or some other charset which
has the characters that are valid in [STD13] as a subset.
----

I thought that the general idea was to register a 'charset' parameter
value for the finally chosen ACE so that it could be used by transcoding
libraries, but to strongly discourage the use of this label in any
other context. Also, in a large number of cases, the 'charset' parameter
cannot be changed for every single bit of text (think email bodies).


2.1.6 Bidirectional text
----
In IDNA, bidirectional text is entered and displayed exactly as it is
specified in ISO/IEC 10646. Both ISO/IEC 10646 and the Unicode standard
have extensive discussion of how to deal with bidirectional text. Any
input mechanism and display mechanism that handles characters from
bidirectional scripts should already conform to those specifications.
Note that the formatting characters that manually change the direction
of display are prohibited by nameprep, thus making the task for input
and display mechanisms easier.
----

Sorry, but this is completely bogus. Some examples, even though
logically clear, would look totally mutilated.
Paul told me that there is an alternative proposal that requires
bidirectional text to be entered and stored in visual order.
This would have several very severe problems, namely:
- People would have to type things in backwards (i.e. typing
   FTEI to get IETF)
- Cut/paste from general text would not work at all.
- Sorting of domain names in Arabic or Hebrew would work the
   wrong way (first all words ending in 'a', then all ending in
   'b' and so,...)
- Text-to-speech applications will read the text the wrong way
   round (ftei for IETF again).

An alternate proposal has been around for a very long time, please
see http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-07.txt
section 3.2. This proposal, originally by Francois Yergeau
(main contributor to RFC 2070 and to Tango, first web browser
that worked with Arabic as far as I know) works by taking advantage
of the Unicode bidirectional algorithm (thus not requiring special
code in particular for display) and can easily be integrated into
the nameprep algorithm (mainly removing specific bidi control
characters under specific conditions).

It may look more complicated than other (non-)proposals, but
bidirectionality IS a very thorny issue without easy ways out.


3. Name Server Considerations

----
The host name data in zone files (as specified by section 5 of RFC 1035)
MUST be both nameprepped and ACE encoded.
----

I thought that the zone files described here refer to some format
as exchanged over the network. But reading sec. 5 of 1035, it
speaks about 'loading a zone'. Obviously, there may be different
formats and ways to load a zone, so I have no idea why we would
have to require that ACE is used. I can very well imagine enhanced
dns software that is able to read in data in UTF-8 (or some local
encoding) and do nameprep and ACE internally. Actually, it would
seem that this would be a lot more convenient for a lot of users.

So either you have to explain the requirement, and restrict it
to the cases where it's really needed, or you remove it.
(or even better, say that dns software should provide facilities
for input from something readable).


Regards,    Martin.