Re: [I18ndir] [art] Modern Network Unicode
John C Klensin <john-ietf@jck.com> Tue, 09 July 2019 18:16 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1894A120966; Tue, 9 Jul 2019 11:16:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zH3xEsqMGtYi; Tue, 9 Jul 2019 11:16:14 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D9651120946; Tue, 9 Jul 2019 11:16:13 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1hkuex-000Cfl-Ry; Tue, 09 Jul 2019 14:16:11 -0400
Date: Tue, 09 Jul 2019 14:16:05 -0400
From: John C Klensin <john-ietf@jck.com>
To: Carsten Bormann <cabo@tzi.org>, art@ietf.org
cc: i18ndir@ietf.org
Message-ID: <0A5251342D480BA6437F7549@PSB>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/sruNL5UyZcC8X7ff48JvX255Oh4>
Subject: Re: [I18ndir] [art] Modern Network Unicode
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Jul 2019 18:16:16 -0000
Carsten, I will study this more when I get time, but a few quick observations. I have glanced at your revised versions, but started this response before either -01 or (obviously) -02 were published so will just finish it and get it off. If it makes a difference, I am writing as co-author of RFC 5198. I regret that my late, much-lamented, and far more articulate co-author is not not available to do so. Those you knew Mike might develop a smile in his memory by contemplating how he would have expressed himself about some of the issue below. Some things to think about that the drafts do not appear to reflect... (1) The network acquired the CRLF convention because the earliest versions of ASCII defined LF strictly as a movement to the same character position on the next line. For anyone old enough to remember typewriters, that is the function of rolling the carriage up one line. From the same period and interpretation, CR alone means "back to first character position on the current line". In that interpretation, if one wanted the first character position on the next line (sometimes known as the "new line" function), one needed CR LF. To make things a bit more complicated, some systems at the time used LF alone as new line and others used CR alone. If a few lines were entered on one such system and transferred to another, CR LF would always produce the expected result, but bare CR, bare LF, or LF CR might cause surprises. There are symmetric relationships among SP and HT, LF and VT, and CR and BS; considering them might help explain how we got here (and, for those who don't know, part of the reason why BS (U+0008) is defined as non-destructive on many systems). Later versions of ASCII changed several of the definitions to be, essentially, "interpret it however you like". Of course, from an interchange standpoint, that just made things worse. ISO/IEC 6429 tried to fix the problem by defining new control codes with specific meanings (see U+0084 and U+0085), but they never took off and hence added to the confusion. Unicode, having learned from that experience, solved it by adding yet more codes, notably U+2028, which, as you note, didn't go anywhere either. Telnet (and FTP TYPE A and perhaps email) may be history although I think we lost quite a bit in going from the network-standard forms specified at the core of those protocols to the "everything is like Unix" or "everything is like me" models that seem to be pervasive in many more recent developments. It may be a bit ironic that your solution to some other problems is to devise yet another network-standard form, especially one that has options that basically encourage (or require) profiles (or, if you prefer, combinations of variances). As long as one only intends to store or display the results of whatever is transmitted and the display device follows more less the same conventions as the transmitting one, none of that makes any difference, at least in the overwhelming number of cases. If, however, there is any requirement to compare a set of strings, things get more complicated in a hurry (see below). (2) Especially given recent knowledge and discoveries, requiring NFC is probably a mistake and might have been a mistake when 5198 was written. NFKC is worse. If you read the Unicode Standard, you will find that the goal of normalization is generally to permit application to convert two strings to a common form before they are compared to provide comparisons that would nearly approximate what people would expect. That pre-comparison conversion is probably why Unicode uses the term "canonical form" although it doesn't include some of the things the rest of us have historically means by that term in specific applications. However, converting strings to normalized form before transmission may lose information that was intended by the user to affect how characters are displayed. Because of some characteristics of the DNS and IDNA, there was a special need for stored NFC-strings. That might generalize to some of the PRECIS work and possibly to identifiers in general, but certainly does not extend to transmission of general Unicode strings. For the NFC case, it is easy to make the wrong generalization from Latin script, where it is generally harmless at worst but that is not the case more generally. NKKC is even more problematic: eliminating half-width characters in favor of their normal-width equivalents may be harmless (or may not), but dropping, e.g., mathematical symbols or rarely-used CJK characters in favor of Latin or Greek character or more commonly used CJK ones with similar interpretations can be a disaster. If you want a variation for "loses information but maybe no one cares" then you might put NFKC in there along with several other things :-( (3) The comparison problem is, as mentioned above, far more difficult. At the most superficial level, the web and some other modern applications allow either LF or CRLF. That might have been a mistake -- or maybe it is ok. It does means that strings (or whole files) have to be converted to some canonical form for comparison, that sizes change, etc. The difficulties pervade the many issues we've had with IDNA over the years, including the question of whether or not NFC is adequate to deal comprehensively with the problems (it isn't) and what should be done about the cases where it is not. For a discussion of some of the issues, see https://github.com/w3c/charmod-norm You might want to try to encourage the i18n directorate (copied on this note) to review this before you get much further along. best, john --On Sunday, 07 July, 2019 22:14 +0200 Carsten Bormann <cabo@tzi.org> wrote: > This week, I have been asked twice (for independent drafts) > what these drafts should say about their usage of Unicode. It > turns out, we have an RFC that is almost, but not entirely > unlike what is needed to reference for a new protocol that > uses text: RFC 5198. I have tried to build on that in the very > short draft: > > > > I would expect that such a specification, once completed, > would be a useful point of reference for any new protocols > that need to use text without any specific requirements > imposed by their applications. (Protocols that already have > been around probably benefit less, because there always are > legacy considerations.) > > I'm not yet asking for dispatching this document; right now I > would just like to receive feedback from this esteemed > community to further improve the document.
- Re: [I18ndir] [art] Modern Network Unicode John C Klensin
- Re: [I18ndir] [art] Modern Network Unicode Carsten Bormann
- Re: [I18ndir] [art] Modern Network Unicode Asmus Freytag
- Re: [I18ndir] [art] Modern Network Unicode John C Klensin
- Re: [I18ndir] [art] Modern Network Unicode Asmus Freytag
- Re: [I18ndir] [art] Modern Network Unicode John C Klensin
- Re: [I18ndir] [art] Modern Network Unicode Carsten Bormann
- Re: [I18ndir] [art] Modern Network Unicode Asmus Freytag (c)
- Re: [I18ndir] [art] Modern Network Unicode John C Klensin
- Re: [I18ndir] [art] Modern Network Unicode Carsten Bormann
- Re: [I18ndir] [art] Modern Network Unicode Patrik Fältström
- Re: [I18ndir] [art] Modern Network Unicode John C Klensin
- Re: [I18ndir] [art] Modern Network Unicode Ira McDonald
- Re: [I18ndir] [art] Modern Network Unicode Carsten Bormann
- Re: [I18ndir] [art] Modern Network Unicode John C Klensin