FWD: Re: Comments on Unicode Format for Network Interchange

John C Klensin <john-ietf@jck.com> Wed, 09 May 2007 13:20 UTC

Return-path: <discuss-bounces@apps.ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1Hlm64-0002I2-Qs; Wed, 09 May 2007 09:20:48 -0400
Received: from discuss by megatron.ietf.org with local (Exim 4.43) id 1Hlm62-0002Ht-8w for discuss-confirm+ok@megatron.ietf.org; Wed, 09 May 2007 09:20:46 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1Hlm61-0002Hl-Vj for discuss@apps.ietf.org; Wed, 09 May 2007 09:20:45 -0400
Received: from ns.jck.com ([209.187.148.211] helo=bs.jck.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1Hlm5z-0004ez-Os for discuss@apps.ietf.org; Wed, 09 May 2007 09:20:45 -0400
Received: from [127.0.0.1] (helo=p3.JCK.COM) by bs.jck.com with esmtp (Exim 4.34) id 1Hlm5w-000HPg-M8 for discuss@apps.ietf.org; Wed, 09 May 2007 09:20:41 -0400
Date: Wed, 09 May 2007 09:20:37 -0400
From: John C Klensin <john-ietf@jck.com>
To: discuss@apps.ietf.org
Subject: FWD: Re: Comments on Unicode Format for Network Interchange
Message-ID: <398A6C120C8B166FCBD3BDAF@p3.JCK.COM>
X-Mailer: Mulberry/4.0.7 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: 0.0 (/)
X-Scan-Signature: ba0d4c5f57f7c289496fce758bbf4798
X-BeenThere: discuss@apps.ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: general discussion of application-layer protocols <discuss.apps.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=unsubscribe>
List-Post: <mailto:discuss@apps.ietf.org>
List-Help: <mailto:discuss-request@apps.ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/discuss>, <mailto:discuss-request@apps.ietf.org?subject=subscribe>
Errors-To: discuss-bounces@apps.ietf.org

FYI, in the hope of _not_ having two separate discussions going
on.
     john


---------- Forwarded Message ----------
Date: Wednesday, 09 May, 2007 08:06 -0400
From: John C Klensin <klensin+unicore@jck.com>
To: Doug Ewell <dewell@adelphia.net>
Cc: UnicoRe Mailing List <unicore@unicode.org>rg>, "Magda Danish
\\\\(Unicode\\\\)" <v-magdad@microsoft.com>om>, Mike Padlipsky
<the.map@alum.mit.edu>du>, Chris Newman <chris.newman@sun.com>om>,
Lisa Dusseault <lisa@osafoundation.org>
Subject: Re: Comments on Unicode Format for Network Interchange

Doug, Benson, Markus, and L2 and UTC members,

My thanks for your input (and for that of Frank Ellermann, which
I don't believe was copied to the Unicore/ L2 list -- see
below).  I appreciate informed and thoughtful comments from any
source and, while I cannot speak in any formal way for the IETF,
the IETF generally does too.  I have two observations on this
thread, the second of which is a procedural issue but one that
may be of some importance.  I apologize in advance for the
length of this note, but it seems useful to put both sets of
issues to rest.

Doug's comments appear to me to exactly reflect the
considerations and IETF experience that went into
draft-klensin-net-utf8-03c.  We tried to review and explain
those considerations in the fairly extensive historical material
in that document but, in spite of several suggestions that the
material was too long and not needed, it apparently was not
sufficient to establish the appropriate context.

The long experience in the IETF and its predecessor
organizations has been that options, and optional and variant
behaviors, are a bad idea.  If systems sending material on the
wire have only a single format to send and systems receiving
material need only to verify that format to the degree needed to
protect themselves from attack, then we end up with good
interoperability and predictable behavior.   Optional "do either
this or that" behavior creates a situation in which the receiver
has to be prepared for every possible combination of options
that might be chosen by the sender and, if carried very far,
takes takes one down a path toward n-squared behavior, i.e., the
need for every system to account for the characteristics of
every possible other systems.

That way of thinking is completely consistent with the oft-cited
"robustness principle": systems are expected to be conservative
about what they send and liberal about what they accept".  If
the designer of a potential receiving system decides to accept
variant forms of input --as long as their intended
interpretation is clear-- that is usually considered a laudable
alternative to generating error messages over such variations.
But a sender should never be able to deviate from a standard on
the expectation that the receiver will compensate.

Even relatively small deviations from this principle have been
known to cause problems, often in unexpected places.  When an
implementer notices (correctly or not) that LF-only, or CR-only,
line endings seem to be generally accepted, adjusts for that,
and then uses and stores whatever comes off the wire, the
problems usually do not show up in that implementation.
Instead, they show up with other applications on the same system
that can't handle that format in files they are expected to
read, files that cannot be properly displayed or printed on the
local system, etc.  One could argue that the application is
correct and everything else is wrong, and we've seen that
argument made.  However, the better approach seems to be that
only a single form comes from (or goes onto) the wire and the
receiving application is responsible for getting the "wire" form
converted to local norms -- whether those local norms are as
trivial as a line-ending convention or conversion to a
completely different CCS or coding.

So, from our standpoint, the step from a single, on-the-wire,
format to variations that can be used as the discretion of the
sender is not justifiable on the grounds that it "can" be easily
done is a big one, one that requires considerable justification
and showing of necessity.   That justification has not appeared
here and the arguments for a single on-the-wire form seem to
prevail.

While the issues appear to be more complicated, the same basic
reasoning applies to the other restrictions in the draft.  

Given that Unicode permits, in many cases, alternate ways to
represent the same character, some normalization is important,
again to prevent N-squared situations.  While the choice of NFC
follows recommendations we have gotten from UTC leadership, the
form could, in principle, have been an IETF-invented NFZZ form:
our requirement is that there be only one.   And, contrary for
Markus's conclusions, there is reason for a MUST: if the sending
is _required_ to send NFC, then the receiver merely needs to
verify that format to the degree needed to prevent attacks.  If
it is a SHOULD, then the receiver must be prepared to convert to
NFC (or whatever from it needs) from whatever optional form
might appear.  Similar observations would apply to non-minimal
forms of UTF-8 and so on.

For CR NUL, I agree that it is an historical wart and that, in a
more perfect world, we would ban it entirely.  I'll see if that
can be reflected better in the text (the text already
discourages the use of CR as an overstrike mechanism).  However,
the strongest argument that is usually made for the use of UTF-8
(rather than other coding forms) is that it is identical to
ASCII if only ASCII characters are used.  If one is going to
accept that coding and compatibility without incurring
additional risks, then an ASCII form that is accepted and given
an interpretation in NVT (net-ASCII) must have the same
interpretation in net-Unicode: if the character sequence can
appear on the wire, the alternative is an "everyone makes up
their own interpretation".   Of course, if NFC simply removed
the NUL character from a data stream, there would be no problem.
But it doesn't (perhaps that could be turned into an argument
for an IETF-produced normalization form, but I certainly won't
make it).   

The restriction to Unicode 3.2 and later is related to a
different aspect of the above situation and, ironically, an
action taken by (then) X3L2 almost 40 years ago.
Incompatibilities were introduced into Unicode, and into NFC,
between 2.0 and 3.2.  Implementations that conform to 2.0 only
may be incompatible with the IETF definition of UTF-8 and other
protocols and may not normalize in a completely compatible way.
So, absent both an argument compelling enough to require that
receivers make allowances for different versions and a clear and
non-heuristic method for telling which version is in use, it
appears rational to draw the line at 3.2 (or, were it not for a
standard or two that depends on 3.2, even later). The irony,
which seems worth mentioning because it might be instructive, is
that X3L2 changed the definition of "new line" after X3.4-1968
was published, permitting alternate interpretations.  The
reaction of ASCII-dependent systems in what later became the
ARPANET community was to make very specific rules about the
interpretation and use of CR and LF and eventually to carry a
standardized form of those rules into the ARPANET and Internet.
The Internet's definition of ASCII was also frozen with the 1968
version so as to avoid further surprises of this type.

Finally, I want to come back to the procedural issue while
stressing that this is a personal opinion that may or may not be
consistent with whatever position the IETF would take if asked.
While I know of nothing that can prevent L2, or the broader
Unicode community, from discussing whatever they like, I note
that:

	* there appears to be nothing in the L2 charter or
	project list that provides for formally commenting on
	this, or any other, proposal about how one of its
	standards should be applied.
	
	* the IETF has not asked L2, or UTC, for an opinion on
	this subject, nor does there appear to be an issue
	requiring clarification or other maintenance of any
	standard under L2's scope.
	
	* My recollection is that INCITS procedures and ANSI
	guidelines generally discourage such out-of-charter
	discussions in a formal TC context, placement on the
	agenda, issuance of formal documents, and, especially,
	issuance of TC opinions.
	
	* There is no liaison between L2 and the IETF that would
	justify formal L2 comments on an IETF work item.  The
	most recent L2 annual report (INCITS/in050401 and almost
	two years out of date) doesn't even mention the IETF or
	IETF work -- any more than it mentions any other set of
	applications of Unicode or other L2 work.
	
* The draft specifically identifies a mailing list on which
discussions of the draft are invited.   Holding a semi-closed
discussion on another list, presumably with the intention of
producing a formal statement (I can't tell exactly what is
intended because the L2 agenda is password-protected, which I
believe violates INCITS rules), is not helpful and is the sort
of thing that has led to misunderstandings between IETF-related
bodies and the Unicode Consortium in the past.   I would suggest
that further discussion take place on the requested mailing list.

regards and, again, thanks for the input,
     john

	






--On Wednesday, 09 May, 2007 00:23 -0700 Doug Ewell
<dewell@adelphia.net> wrote:

> Please add this to the agenda for UTC #111, for discussion in
> response to agenda item B.5.2.
> 
> L2/07-126, written by Markus Scherer, proposes some changes to
> the Internet-Draft "draft-klensin-net-utf8-03", by John
> Klensin and Michael Padlipsky.  This Internet-Draft proposes
> to replace the venerable "network ASCII" standard with a
> tightly defined profile of UTF-8, including NFC normalization
> and CRLF line endings.
> 
> In particular, L2/07-126 proposes that the requirement to end
> lines of text with CRLF be relaxed to permit "bare CR" and
> "bare LF" as well. The document states, "We believe that
> single CR and LF are common because of implementation practice
> on a variety of platforms, and that it is both unrealistic and
> unnecessary to try to legislate them away."
> 
> In fact, the concept of "network ASCII," and the purpose of
> the present Internet-Draft, is to do just that: "legislate" a
> specific profile of ASCII and UTF-8, respectively, that can be
> expected to maximize interoperability.  There is indeed a
> large number of text files on the Internet that use "bare CR"
> or "bare LF," just as there is a large number of files encoded
> in ISO 8859-1 or other character sets, but this potentially
> troublesome diversity is exactly why a standard format is
> being proposed.
> 
> I recommend that the proposal to "legislate" the use of CRLF
> be retained in the Internet-Draft.  Furthermore, I suggest
> that the authors of the I-D withdraw their continued support
> in "network Unicode" for the obscure "CR NUL" sequence, which
> means "carriage return without line feed" and was formerly
> used as a hack to create overstruck sequences. This mechanism
> is completely unnecessary in a text-transport mechanism that
> supports Unicode, with its rich support for "proper" combining
> characters, and is unlikely to be handled correctly by most
> text engines in any case.
> 
> L2/07-126 also recommends that the control code HT (horizontal
> tabulation, U+0009) be permitted under Section 2.2 of the
> "network Unicode" definition.
> 
> I recommend that the control code FF (form feed, U+000C) be
> likewise permitted.  The form feed function is well known and
> well defined in almost all printing functions, and all RFCs
> issued in modern times use FF to separate pages.  It would
> ironic indeed for the Internet-Draft to retain the requirement
> that form feeds "SHOULD NOT be used unless required by
> exceptional circumstances" while advancing toward publication
> as an RFC, complete with form feeds!
> 
> I have no objection to the other proposals set forth in
> L2/07-126.
> 
> --
> Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN
> # 14
> http://users.adelphia.net/~dewell/
> http://www1.ietf.org/html.charters/ltru-charter.html
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
> 






---------- End Forwarded Message ----------