[Ltru] Solving the UTF-8 problem
"Doug Ewell" <dewell@roadrunner.com> Sun, 01 July 2007 22:58 UTC
Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1I58Nc-0004R9-Kd; Sun, 01 Jul 2007 18:58:56 -0400
Received: from ltru by megatron.ietf.org with local (Exim 4.43) id 1I58Nc-0004R4-Ao for ltru-confirm+ok@megatron.ietf.org; Sun, 01 Jul 2007 18:58:56 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1I58Nc-0004QN-0m for ltru@ietf.org; Sun, 01 Jul 2007 18:58:56 -0400
Received: from mta16.mail.adelphia.net ([68.168.78.211] helo=mta16.adelphia.net) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1I58NX-0008Nt-GH for ltru@ietf.org; Sun, 01 Jul 2007 18:58:55 -0400
Received: from DGBP7M81 ([76.167.184.182]) by mta16.adelphia.net (InterMail vM.6.01.05.04 201-2131-123-105-20051025) with SMTP id <20070701225850.CBOZ14967.mta16.adelphia.net@DGBP7M81>; Sun, 1 Jul 2007 18:58:50 -0400
Message-ID: <006501c7bc33$637b08b0$6401a8c0@DGBP7M81>
From: Doug Ewell <dewell@roadrunner.com>
To: LTRU Working Group <ltru@ietf.org>, ietf-languages@iana.org
Date: Sun, 01 Jul 2007 15:58:48 -0700
MIME-Version: 1.0
Content-Type: text/plain; format="flowed"; charset="utf-8"; reply-type="original"
Content-Transfer-Encoding: 8bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.3138
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3138
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 03169bfe4792634a390035a01a6c6d2f
Cc:
Subject: [Ltru] Solving the UTF-8 problem
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org
This is intentionally cross-posted to LTRU and ietf-languages, since it deals with both implementation policy and proposed changes to RFC 4646bis and 4645bis. CE Whitehead <cewcathar at hotmail dot com> wrote on ietf-languages: > I want to update the 1694acad comments field to include a transliteration > into Basic Latin (also--perhaps???--to fix the inconsistency as 4eme is > missing the accent grave on the e!! : > > Comments: 17th century French, as catalogued in the "Dictionnaire de > l'académie françoise" ("l'academie francoise"), 4ème (4eme) > ed. 1694; frequently includes elements of Middle French, as this is a > transitional period. I really, really don't like the direction this is headed. Ultimately we will find ourselves having to provide duplicate Description and Comments content for every non-ASCII character in the Language Subtag Registry, removing most of the advantage of being able to represent non-ASCII in the first place. What are we going to do when the ISO 639-3 code list is finalized and we have to deal with adding the following pairs of languages, whose names differ only by diacritical marks? aru Arua arx Aruá bfa Bari mot Barí kgm Karipúna kuq Karipuná sbe Saliba slc Sáliba wbf Wara tci Wára Are we going to include an ASCII version of every name that contains an accented letter? There are several hundred in ISO 639-3. As CE has shown above, we are already strongly considering adding duplicative Comments information to get around our own technical limitations. Will we also have to include elaborate Comments fields distinguishing the "real" Arua from the one we invented by lopping off the accent from Aruá? Section 3.1 mentions transcription of non-Latin Description fields into the Latin script. It does not talk about providing a pure-ASCII equivalent for every non-ASCII French- or Spanish-language string, and I don't believe that was the WG's intention. Transcriptions are useful when the content is in Arabic or Cyrillic or Han, to make the material available to Latin-script-only readers. Providing "transcriptions" like "4#xE8;eme (4eme)" merely announces to the world that we can't solve our own technical character-encoding problems without resorting to unwieldy kludges. We really need to take another hard, serious look at maintaining the Registry in UTF-8. The current scheme is one of the biggest sources of misunderstanding that newcomers have about the Registry, and one of the biggest bones of contention among regular list participants. We need to consider this soon, before the release of the "final" ISO 639-3 data moves RFC 4645bis onto the LTRU critical path, because the way it is decided will have huge implications for RFC 4645bis. And we need to consider it WITHOUT dragging in side discussions like XML. As far as I can tell, the objections to moving the Registry to UTF-8 are as follows: 1. UTF-8 doesn't play well with e-mail, which is invaluable for discussing changes on the ietf-languages list and sending the changes to IANA (stated by several). Clearly there are problems with the diversity of character encodings as e-mail travels from a sender's machine across the Internet and onto each recipient's client, or gets processed into Web archive pages. Even on the Unicode mailing list and mail archive pages, where you might think the problem would be solved, you can see numerous examples of this. This isn't limited to UTF-8; it is not uncommon to see Latin-1 or (especially) Windows-1252 get corrupted or displayed incorrectly. As Ira McDonald suggested, it would make sense to conduct all preliminary discussion using escape sequences or some other mechanism: "(Note: the name 'Aruá' is actually spelled 'A, r, u, a-with-acute'.)" but then send the final change to IANA in UTF-8 so they could simply drop it into the Registry with little or no editing, as they do now. We would have to figure out some way for the list to confirm that the changes seen on the list are those that are sent to IANA, with no alteration other than the recoding to UTF-8. This should not be difficult. I am willing do whatever is deemed necessary on my part to make this work. We need to figure out how to make it work, instead of using it as a reason not to adopt UTF-8. 2. Converting the Registry to UTF-8 would break existing implementations that expect hex NCRs (Addison Phillips). Addison is correct; any structural change to the Registry will break RFC 4646-conformant processors. This is true not only for UTF-8, but also for new fields such as "Macrolanguage" or "Modified." (Section 3.1 says the Type "MUST" be one of the seven currently defined values.) If there are implementations that read and interpret the Registry and will choke on non-ASCII input, whose authors are not on one of these mailing lists, then we need to get the word out that the format may change, just as we would if we decide to add new field values. Personally I doubt there are large numbers of such implementations. 3. UTF-8 can't be read on some, espcially older, computer systems (Frank Ellermann, months ago, and CE Whitehead). With the continuing adoption of Unicode by OS and software vendors, I really can't get behind this argument. It simply isn't appropriate to "dumb down" all computerized text to match the least capable systems that might be running somewhere. This is especially true considering the language names listed above. We don't restrict text to uppercase to maintain compatibility with BCDIC and Sinclair ZX81 systems. A Windows system running Internet Explorer 4.0 or above can display a local text file as UTF-8. According to Wikipedia, of the 78% of Windows machines that run IE, fewer than 1% are running a version lower than 6.0. Support for UTF-8 is probably the same, if not better, for non-Microsoft browsers; see http://www.alanwood.net/unicode/browsers.html for more information. In thinking about "display," we should step back and remember why we have a Language Subtag Registry at all. It exists to support BCP 47 language tagging by providing a complete list of all subtags that can be used to form a language tag, plus all grandfathered tags that can be used on their own. It provides some additional information, such as comments, to help tag producers and consumers make tagging decisions, but it is not intended to be a general compendium of language information, meant for casual browsing. Another possibility is to have IANA post an official version of the Registry in one encoding, such as UTF-8, and additional, unofficial versions in other encodings, such as Latin-1 or hex NCRs. This is the approach chosen by the ISO 639-3 Registration Authority. Potential problems with this approach are unintentional mismatches between the versions (I caught one of these problems for the ISO 639-3 people recently) and a perception that the "simplified" version is actually the official one. My suggestions, very simply, are: * For LTRU, to amend RFC 4646bis to change the format of the Registry to UTF-8, and to work out the details such as compatibility with existing RFC 4646 processors and avoidance of UTF-8 in e-mails. * For ietf-languages, to impose a moratorium on changes to Description and Comments fields whose only purpose is to transcribe hex NCRs to ASCII, until the matter is resolved within LTRU. -- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages _______________________________________________ Ltru mailing list Ltru@ietf.org https://www1.ietf.org/mailman/listinfo/ltru
- [Ltru] Solving the UTF-8 problem Doug Ewell
- [Ltru] Re: Solving the UTF-8 problem Doug Ewell
- Re: [Ltru] Solving the UTF-8 problem John Cowan
- Re: [Ltru] Solving the UTF-8 problem Doug Ewell
- [Ltru] Re: Solving the UTF-8 problem Doug Ewell
- [Ltru] Re: Solving the UTF-8 problem Stephane Bortzmeyer
- [Ltru] Re: Solving the UTF-8 problem Stephane Bortzmeyer
- [Ltru] Re: Solving the UTF-8 problem Stephane Bortzmeyer
- Re: [Ltru] Re: Solving the UTF-8 problem Randy Presuhn
- [Ltru] Re: Solving the UTF-8 problem Doug Ewell
- RE: [Ltru] Re: Solving the UTF-8 problem Peter Constable
- Re: [Ltru] Re: Solving the UTF-8 problem John Cowan
- Re: [Ltru] Re: Solving the UTF-8 problem Randy Presuhn
- RE: [Ltru] Re: Solving the UTF-8 problem Martin Duerst
- Re: [Ltru] Re: Solving the UTF-8 problem John Cowan
- Re: [Ltru] Re: Solving the UTF-8 problem Randy Presuhn
- [Ltru] Re: Solving the UTF-8 problem Doug Ewell
- Re: [Ltru] Re: Solving the UTF-8 problem Mark Davis
- Re: [Ltru] Re: Solving the UTF-8 problem Doug Ewell
- Re: [Ltru] Re: Solving the UTF-8 problem Addison Phillips
- Re: [Ltru] Re: Solving the UTF-8 problem Doug Ewell
- Re: [Ltru] Re: Solving the UTF-8 problem John Cowan
- Re: [Ltru] Re: Solving the UTF-8 problem Randy Presuhn
- Re: [Ltru] Re: Solving the UTF-8 problem Chris Newman
- RE: [Ltru] Re: Solving the UTF-8 problem Peter Constable
- RE: [Ltru] Re: Solving the UTF-8 problem Debbie Garside
- Re: [Ltru] Re: Solving the UTF-8 problem Addison Phillips
- RE: [Ltru] Re: Solving the UTF-8 problem CE Whitehead
- RE: [Ltru] Re: Solving the UTF-8 problem Peter Constable
- Re: [Ltru] Re: Solving the UTF-8 problem Mark Davis
- Re: [Ltru] Re: Solving the UTF-8 problem Randy Presuhn
- RE: [Ltru] Re: Solving the UTF-8 problem Kent Karlsson
- Re: [Ltru] Re: Solving the UTF-8 problem Addison Phillips
- Re: [Ltru] Re: Solving the UTF-8 problem John Cowan
- Re: [Ltru] Re: Solving the UTF-8 problem Randy Presuhn
- Re: [Ltru] Re: Solving the UTF-8 problem Martin Duerst
- [Ltru] Re: Solving the UTF-8 problem Doug Ewell
- [Ltru] Re: Solving the UTF-8 problem Stephane Bortzmeyer
- [Ltru] Re: Solving the UTF-8 problem CE Whitehead
- [Ltru] Re: Solving the UTF-8 problem Doug Ewell
- Re: [Ltru] Re: Solving the UTF-8 problem Randy Presuhn