Comments on IDNA (was: Re: [idn] Reality Check)
Martin Duerst <duerst@w3.org> Wed, 11 July 2001 08:40 UTC
Received: from psg.com (exim@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with SMTP id EAA28340 for <idn-archive@lists.ietf.org>; Wed, 11 Jul 2001 04:40:25 -0400 (EDT)
Received: from lserv by psg.com with local (Exim 3.31 #1) id 15KF8L-0003zQ-00 for idn-data@psg.com; Wed, 11 Jul 2001 01:14:09 -0700
Received: from sh.w3.mag.keio.ac.jp ([133.27.194.41]) by psg.com with esmtp (Exim 3.31 #1) id 15KF8J-0003zH-00 for idn@ops.ietf.org; Wed, 11 Jul 2001 01:14:07 -0700
Received: from enoshima (dhcp-100-137.mag.keio.ac.jp [133.27.195.137]) by sh.w3.mag.keio.ac.jp (8.9.3/3.7W) with ESMTP id RAA15638; Wed, 11 Jul 2001 17:13:24 +0900 (JST)
Message-Id: <4.2.0.58.J.20010711105253.033c16f0@sh.w3.mag.keio.ac.jp>
X-Sender: duerst@sh.w3.mag.keio.ac.jp
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.2.0.58.J
Date: Wed, 11 Jul 2001 17:12:54 +0900
To: Patrik Faltstrom <paf@cisco.com>, Paul Hoffman / IMC <phoffman@imc.org>, Keith Moore <moore@cs.utk.edu>, Edmon <edmon@neteka.com>
From: Martin Duerst <duerst@w3.org>
Subject: Comments on IDNA (was: Re: [idn] Reality Check)
Cc: idn@ops.ietf.org, "Adam M. Costello" <amc@cs.berkeley.edu>
In-Reply-To: <200107110055.UAA29310@astro.cs.utk.edu>
References: <Your message of "Tue, 10 Jul 2001 16:14:07 EDT." <009c01c1097c$e0a096e0$1001a8c0@neteka.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format="flowed"
Sender: owner-idn@ops.ietf.org
Precedence: bulk
At 20:55 01/07/10 -0400, Keith Moore wrote: > > Currently, IDNA does not provide the "option" for UTF-8 that you are > > suggesting. > >sure it does. the application can use whatever encoding it wants to. >the protocol used by DNS on-the-wire is irrelevant. It would be extremely helpful if IDNA would make it clear that applications can and should do something else than ACE. The current IDNA draft (02), in the the unnumbered diagram, says 'Application-specific protocol: predefined by the protocol or defaults to nameprepped ACE'. This should be changed. Some other comments on the IDNA draft: The abstract and intro give the impression that only updating applications results in a significant advantage. Readers of the doc may get the impression that servers/resolvers are 50% and applications 50%. As far as I know, it's more like 10%/90%. The draft should be changed to not give such wrong impressions. 2.1.1: "Applications MAY allow ACE input and output, but are not encouraged to do so except as an interface for advanced users, possibly for debugging.": Please remove the 'advanced user'. Giving people who want to look at the names the way they are written the impression that they are retarded is not a good idea. 2.1.1: "Since ACE can be rendered either as the encoded ASCII glyphs or the proper decoded characters, the rendering engine for an application SHOULD have an option for the user to select the preferred display. - The word 'ACE' is used here completely differently from above (including input, output, and process, rather than only one form). - 'Rendering engine' is usually in the OS or window system/ toolkit, not in the application. It would be strange to require such an engine to be able to decode ACE. - The 'SHOULD' has to be conditioned on the MAY above. My rewording would look like this: In case Applications allow ACE input and output, they SHOULD provide an option for the user to select ACE, which SHOULD be off by default. 2.1.1: In protocols and document formats that define how to handle specification or negotiation of charsets, IDN host name parts can be given in any charset allowed by the protocol or document format. If a protocol or document format only allows one charset, IDN host name parts must be given in that charset. Preferably change 'charset' to 'character encoding' throughout the draft. 'charset' is the name of a parameter, not a concept. Also, in addition to character encodings, escapes are also possible (I'm thinking of HTML é,...). Also 'given in any' is much too general. And negotiation is irrelevant. A single encoding is also just a special case, and so it should be either left out, or called out ("In particular,..."). Also On the other hand, protocols often use different character encodings in different places. So I would write: In any place where a protocol or document format allows to transmit the characters in IDN host name parts, IDN host name parts MUST/SHOULD be transmitted using whatever character encoding and escape mechanism that the protocol or document format uses at that place. 2.1.2: "that is resolving a non-internationalized host name parts MUST NOT do" singular-plural mismatch. 2.1.4: ---- If an application decodes an ACE name but cannot show all of the characters in the decoded name, such as if the name contains characters that the output system cannot display, the application SHOULD show the name in ACE format instead of displaying the name with the replacement character (U+FFFD). This is to make it easier for the user to transfer the name correctly to other programs using copy-and-paste techniques. Programs that by default show the ACE form when they cannot show all the characters in a name part SHOULD also have a mechanism to show the name with as many characters as possible and replacement characters in the positions where characters cannot be displayed. ---- Copy-and-paste often works even if display doesn't, and on the other hand display may work but copy-and-paste may fail. So I would suggest to just write: This is to make it easier for the user to transfer the name correctly to other programs. Also, I would note that there are other ways to deal with this, namely to show some hex code or use some popup tooltip,... And also in many cases, the application doesn't know exactly what the underlying rendering engine can display or not. 2.1.5 ---- An application which receives a host name SHOULD verify whether or not the host name is in ACE. This is possible by verifying the prefix in each of the labels, and seeing whether or not the label is in ACE. This MUST be done regardless of whether or not the communication channel used (such as keyboard input, cut and paste, application protocol, application payload, and so on) has negotiated ACE. ---- 'negotiating ACE' sounds quite strange, in particular for keyboards. It is not completely impossible to e.g. go to the X Consortium and register something like "ACE_STRING" in addition to "COMPOUND_TEXT" and "UTF8_STRING",... and then have application A fill in all of these and application B pick ACE_STRING if it doesn't understand the others,... It is also in theory concievable to create an input method that directly produces ACE from some keyboard layout. But there are considerable problems with this, I'm sure it's not done for similar things such as Mail mime headers, and I think it's a bad idea that the draft mention such things because it may lead implementers into the woods. Another problem is that the MUST seems to overrule the SHOULD. Which one do we take? ---- The reason for this requirement is that many applications are not ACE-aware. Applications that are not ACE-aware will send host names in ACE but mark the charset as being US-ASCII or some other charset which has the characters that are valid in [STD13] as a subset. ---- I thought that the general idea was to register a 'charset' parameter value for the finally chosen ACE so that it could be used by transcoding libraries, but to strongly discourage the use of this label in any other context. Also, in a large number of cases, the 'charset' parameter cannot be changed for every single bit of text (think email bodies). 2.1.6 Bidirectional text ---- In IDNA, bidirectional text is entered and displayed exactly as it is specified in ISO/IEC 10646. Both ISO/IEC 10646 and the Unicode standard have extensive discussion of how to deal with bidirectional text. Any input mechanism and display mechanism that handles characters from bidirectional scripts should already conform to those specifications. Note that the formatting characters that manually change the direction of display are prohibited by nameprep, thus making the task for input and display mechanisms easier. ---- Sorry, but this is completely bogus. Some examples, even though logically clear, would look totally mutilated. Paul told me that there is an alternative proposal that requires bidirectional text to be entered and stored in visual order. This would have several very severe problems, namely: - People would have to type things in backwards (i.e. typing FTEI to get IETF) - Cut/paste from general text would not work at all. - Sorting of domain names in Arabic or Hebrew would work the wrong way (first all words ending in 'a', then all ending in 'b' and so,...) - Text-to-speech applications will read the text the wrong way round (ftei for IETF again). An alternate proposal has been around for a very long time, please see http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-07.txt section 3.2. This proposal, originally by Francois Yergeau (main contributor to RFC 2070 and to Tango, first web browser that worked with Arabic as far as I know) works by taking advantage of the Unicode bidirectional algorithm (thus not requiring special code in particular for display) and can easily be integrated into the nameprep algorithm (mainly removing specific bidi control characters under specific conditions). It may look more complicated than other (non-)proposals, but bidirectionality IS a very thorny issue without easy ways out. 3. Name Server Considerations ---- The host name data in zone files (as specified by section 5 of RFC 1035) MUST be both nameprepped and ACE encoded. ---- I thought that the zone files described here refer to some format as exchanged over the network. But reading sec. 5 of 1035, it speaks about 'loading a zone'. Obviously, there may be different formats and ways to load a zone, so I have no idea why we would have to require that ACE is used. I can very well imagine enhanced dns software that is able to read in data in UTF-8 (or some local encoding) and do nameprep and ACE internally. Actually, it would seem that this would be a lot more convenient for a lot of users. So either you have to explain the requirement, and restrict it to the cases where it's really needed, or you remove it. (or even better, say that dns software should provide facilities for input from something readable). Regards, Martin.
- Comments on IDNA (was: Re: [idn] Reality Check) Martin Duerst
- Re: Comments on IDNA (was: Re: [idn] Reality Chec… Keith Moore