FWD: Re: Comments on Unicode Format for Network Interchange
John C Klensin <john-ietf@jck.com> Wed, 09 May 2007 13:20 UTC
Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com)
by megatron.ietf.org with esmtp (Exim 4.43)
id 1Hlm64-0002I2-Qs; Wed, 09 May 2007 09:20:48 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43)
id 1Hlm62-0002Ht-8w for discuss-confirm+ok@megatron.ietf.org;
Wed, 09 May 2007 09:20:46 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org)
by megatron.ietf.org with esmtp (Exim 4.43) id 1Hlm61-0002Hl-Vj
for discuss@apps.ietf.org; Wed, 09 May 2007 09:20:45 -0400
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com)
by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1Hlm5z-0004ez-Os
for discuss@apps.ietf.org; Wed, 09 May 2007 09:20:45 -0400
Received: from [127.0.0.1] (helo=p3.JCK.COM)
by bs.jck.com with esmtp (Exim 4.34) id 1Hlm5w-000HPg-M8
for discuss@apps.ietf.org; Wed, 09 May 2007 09:20:41 -0400
Date: Wed, 09 May 2007 09:20:37 -0400
From: John C Klensin <john-ietf@jck.com>
To: discuss@apps.ietf.org
Subject: FWD: Re: Comments on Unicode Format for Network
Interchange
Message-ID: <398A6C120C8B166FCBD3BDAF@p3.JCK.COM>
X-Mailer: Mulberry/4.0.7 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: ba0d4c5f57f7c289496fce758bbf4798
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols
<discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>,
<mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>,
<mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org
FYI, in the hope of _not_ having two separate discussions going on. john ---------- Forwarded Message ---------- Date: Wednesday, 09 May, 2007 08:06 -0400 From: John C Klensin <klensin+unicore@jck.com> To: Doug Ewell <dewell@adelphia.net> Cc: UnicoRe Mailing List <unicore@unicode.org>rg>, "Magda Danish \\\\(Unicode\\\\)" <v-magdad@microsoft.com>om>, Mike Padlipsky <the.map@alum.mit.edu>du>, Chris Newman <chris.newman@sun.com>om>, Lisa Dusseault <lisa@osafoundation.org> Subject: Re: Comments on Unicode Format for Network Interchange Doug, Benson, Markus, and L2 and UTC members, My thanks for your input (and for that of Frank Ellermann, which I don't believe was copied to the Unicore/ L2 list -- see below). I appreciate informed and thoughtful comments from any source and, while I cannot speak in any formal way for the IETF, the IETF generally does too. I have two observations on this thread, the second of which is a procedural issue but one that may be of some importance. I apologize in advance for the length of this note, but it seems useful to put both sets of issues to rest. Doug's comments appear to me to exactly reflect the considerations and IETF experience that went into draft-klensin-net-utf8-03c. We tried to review and explain those considerations in the fairly extensive historical material in that document but, in spite of several suggestions that the material was too long and not needed, it apparently was not sufficient to establish the appropriate context. The long experience in the IETF and its predecessor organizations has been that options, and optional and variant behaviors, are a bad idea. If systems sending material on the wire have only a single format to send and systems receiving material need only to verify that format to the degree needed to protect themselves from attack, then we end up with good interoperability and predictable behavior. Optional "do either this or that" behavior creates a situation in which the receiver has to be prepared for every possible combination of options that might be chosen by the sender and, if carried very far, takes takes one down a path toward n-squared behavior, i.e., the need for every system to account for the characteristics of every possible other systems. That way of thinking is completely consistent with the oft-cited "robustness principle": systems are expected to be conservative about what they send and liberal about what they accept". If the designer of a potential receiving system decides to accept variant forms of input --as long as their intended interpretation is clear-- that is usually considered a laudable alternative to generating error messages over such variations. But a sender should never be able to deviate from a standard on the expectation that the receiver will compensate. Even relatively small deviations from this principle have been known to cause problems, often in unexpected places. When an implementer notices (correctly or not) that LF-only, or CR-only, line endings seem to be generally accepted, adjusts for that, and then uses and stores whatever comes off the wire, the problems usually do not show up in that implementation. Instead, they show up with other applications on the same system that can't handle that format in files they are expected to read, files that cannot be properly displayed or printed on the local system, etc. One could argue that the application is correct and everything else is wrong, and we've seen that argument made. However, the better approach seems to be that only a single form comes from (or goes onto) the wire and the receiving application is responsible for getting the "wire" form converted to local norms -- whether those local norms are as trivial as a line-ending convention or conversion to a completely different CCS or coding. So, from our standpoint, the step from a single, on-the-wire, format to variations that can be used as the discretion of the sender is not justifiable on the grounds that it "can" be easily done is a big one, one that requires considerable justification and showing of necessity. That justification has not appeared here and the arguments for a single on-the-wire form seem to prevail. While the issues appear to be more complicated, the same basic reasoning applies to the other restrictions in the draft. Given that Unicode permits, in many cases, alternate ways to represent the same character, some normalization is important, again to prevent N-squared situations. While the choice of NFC follows recommendations we have gotten from UTC leadership, the form could, in principle, have been an IETF-invented NFZZ form: our requirement is that there be only one. And, contrary for Markus's conclusions, there is reason for a MUST: if the sending is _required_ to send NFC, then the receiver merely needs to verify that format to the degree needed to prevent attacks. If it is a SHOULD, then the receiver must be prepared to convert to NFC (or whatever from it needs) from whatever optional form might appear. Similar observations would apply to non-minimal forms of UTF-8 and so on. For CR NUL, I agree that it is an historical wart and that, in a more perfect world, we would ban it entirely. I'll see if that can be reflected better in the text (the text already discourages the use of CR as an overstrike mechanism). However, the strongest argument that is usually made for the use of UTF-8 (rather than other coding forms) is that it is identical to ASCII if only ASCII characters are used. If one is going to accept that coding and compatibility without incurring additional risks, then an ASCII form that is accepted and given an interpretation in NVT (net-ASCII) must have the same interpretation in net-Unicode: if the character sequence can appear on the wire, the alternative is an "everyone makes up their own interpretation". Of course, if NFC simply removed the NUL character from a data stream, there would be no problem. But it doesn't (perhaps that could be turned into an argument for an IETF-produced normalization form, but I certainly won't make it). The restriction to Unicode 3.2 and later is related to a different aspect of the above situation and, ironically, an action taken by (then) X3L2 almost 40 years ago. Incompatibilities were introduced into Unicode, and into NFC, between 2.0 and 3.2. Implementations that conform to 2.0 only may be incompatible with the IETF definition of UTF-8 and other protocols and may not normalize in a completely compatible way. So, absent both an argument compelling enough to require that receivers make allowances for different versions and a clear and non-heuristic method for telling which version is in use, it appears rational to draw the line at 3.2 (or, were it not for a standard or two that depends on 3.2, even later). The irony, which seems worth mentioning because it might be instructive, is that X3L2 changed the definition of "new line" after X3.4-1968 was published, permitting alternate interpretations. The reaction of ASCII-dependent systems in what later became the ARPANET community was to make very specific rules about the interpretation and use of CR and LF and eventually to carry a standardized form of those rules into the ARPANET and Internet. The Internet's definition of ASCII was also frozen with the 1968 version so as to avoid further surprises of this type. Finally, I want to come back to the procedural issue while stressing that this is a personal opinion that may or may not be consistent with whatever position the IETF would take if asked. While I know of nothing that can prevent L2, or the broader Unicode community, from discussing whatever they like, I note that: * there appears to be nothing in the L2 charter or project list that provides for formally commenting on this, or any other, proposal about how one of its standards should be applied. * the IETF has not asked L2, or UTC, for an opinion on this subject, nor does there appear to be an issue requiring clarification or other maintenance of any standard under L2's scope. * My recollection is that INCITS procedures and ANSI guidelines generally discourage such out-of-charter discussions in a formal TC context, placement on the agenda, issuance of formal documents, and, especially, issuance of TC opinions. * There is no liaison between L2 and the IETF that would justify formal L2 comments on an IETF work item. The most recent L2 annual report (INCITS/in050401 and almost two years out of date) doesn't even mention the IETF or IETF work -- any more than it mentions any other set of applications of Unicode or other L2 work. * The draft specifically identifies a mailing list on which discussions of the draft are invited. Holding a semi-closed discussion on another list, presumably with the intention of producing a formal statement (I can't tell exactly what is intended because the L2 agenda is password-protected, which I believe violates INCITS rules), is not helpful and is the sort of thing that has led to misunderstandings between IETF-related bodies and the Unicode Consortium in the past. I would suggest that further discussion take place on the requested mailing list. regards and, again, thanks for the input, john --On Wednesday, 09 May, 2007 00:23 -0700 Doug Ewell <dewell@adelphia.net> wrote: > Please add this to the agenda for UTC #111, for discussion in > response to agenda item B.5.2. > > L2/07-126, written by Markus Scherer, proposes some changes to > the Internet-Draft "draft-klensin-net-utf8-03", by John > Klensin and Michael Padlipsky. This Internet-Draft proposes > to replace the venerable "network ASCII" standard with a > tightly defined profile of UTF-8, including NFC normalization > and CRLF line endings. > > In particular, L2/07-126 proposes that the requirement to end > lines of text with CRLF be relaxed to permit "bare CR" and > "bare LF" as well. The document states, "We believe that > single CR and LF are common because of implementation practice > on a variety of platforms, and that it is both unrealistic and > unnecessary to try to legislate them away." > > In fact, the concept of "network ASCII," and the purpose of > the present Internet-Draft, is to do just that: "legislate" a > specific profile of ASCII and UTF-8, respectively, that can be > expected to maximize interoperability. There is indeed a > large number of text files on the Internet that use "bare CR" > or "bare LF," just as there is a large number of files encoded > in ISO 8859-1 or other character sets, but this potentially > troublesome diversity is exactly why a standard format is > being proposed. > > I recommend that the proposal to "legislate" the use of CRLF > be retained in the Internet-Draft. Furthermore, I suggest > that the authors of the I-D withdraw their continued support > in "network Unicode" for the obscure "CR NUL" sequence, which > means "carriage return without line feed" and was formerly > used as a hack to create overstruck sequences. This mechanism > is completely unnecessary in a text-transport mechanism that > supports Unicode, with its rich support for "proper" combining > characters, and is unlikely to be handled correctly by most > text engines in any case. > > L2/07-126 also recommends that the control code HT (horizontal > tabulation, U+0009) be permitted under Section 2.2 of the > "network Unicode" definition. > > I recommend that the control code FF (form feed, U+000C) be > likewise permitted. The form feed function is well known and > well defined in almost all printing functions, and all RFCs > issued in modern times use FF to separate pages. It would > ironic indeed for the Internet-Draft to retain the requirement > that form feeds "SHOULD NOT be used unless required by > exceptional circumstances" while advancing toward publication > as an RFC, complete with form feeds! > > I have no objection to the other proposals set forth in > L2/07-126. > > -- > Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN > # 14 > http://users.adelphia.net/~dewell/ > http://www1.ietf.org/html.charters/ltru-charter.html > http://www.alvestrand.no/mailman/listinfo/ietf-languages > ---------- End Forwarded Message ----------
- FWD: Re: Comments on Unicode Format for Network I… John C Klensin
- Form feed in Net-UTF8? (Was: FWD: Re: Comments on… Stephane Bortzmeyer
- Re: Form feed in Net-UTF8? (Was: FWD: Re: Comment… John C Klensin
- Re: Form feed in Net-UTF8? (Was: FWD: Re: Comment… Martin Duerst
- Re: Form feed in Net-UTF8? (Was: FWD: Re: Comment… John C Klensin
- Re: Form feed in Net-UTF8? (Was: FWD: Re: Comment… Dave Crocker
- Re: Form feed in Net-UTF8? (Was: FWD: Re: Comment… Stephane Bortzmeyer
- Re: Form feed in Net-UTF8? (Was: FWD: Re: Comment… Bill McQuillan
- Re: Form feed in Net-UTF8? (Was: FWD: Re: Comment… John C Klensin
- draft-klensin-net-utf8-05.txt (was: Re: Form feed… Martin Duerst