Re: [art] Modern Network Unicode

John C Klensin <john-ietf@jck.com> Tue, 09 July 2019 18:16 UTC

Date: Tue, 09 Jul 2019 14:16:05 -0400
From: John C Klensin <john-ietf@jck.com>
To: Carsten Bormann <cabo@tzi.org>, art@ietf.org
cc: i18ndir@ietf.org
Message-ID: <0A5251342D480BA6437F7549@PSB>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Archived-At: <https://mailarchive.ietf.org/arch/msg/art/-xRLsRcnvKqvduuB_oixgq0MkWw>
Subject: Re: [art] Modern Network Unicode
Precedence: list

Carsten,

I will study this more when I get time, but a few quick
observations.  I have glanced at your revised versions, but
started this response before either -01 or (obviously) -02 were
published so will just finish it and get it off.

If it makes a difference, I am writing as co-author of RFC 5198.
I regret that my late, much-lamented, and far more articulate
co-author is not not available to do so.  Those you knew Mike
might develop a smile in his memory by contemplating how he
would have expressed himself about some of the issue below.

Some things to think about that the drafts do not appear to
reflect...

(1) The network acquired the CRLF convention because the
earliest versions of ASCII defined LF strictly as a movement to
the same character position on the next line.  For anyone old
enough to remember typewriters, that is the function of rolling
the carriage up one line.   From the same period and
interpretation, CR alone means "back to first character position
on the current line".   In that interpretation, if one wanted
the first character position on the next line (sometimes known
as the "new line" function), one needed CR LF.  To make things a
bit more complicated, some systems at the time used LF alone as
new line and others used CR alone.  If a few lines were entered
on one such system and transferred to another, CR LF would
always produce the expected result, but bare CR, bare LF, or LF
CR might cause surprises.

There are symmetric relationships among SP and HT, LF and VT,
and CR and BS; considering them might help explain how we got
here (and, for those who don't know, part of the reason why BS
(U+0008) is defined as non-destructive on many systems).

Later versions of ASCII changed several of the definitions to
be, essentially, "interpret it however you like".  Of course,
from an interchange standpoint, that just made things worse.
ISO/IEC 6429 tried to fix the problem by defining new control
codes with specific meanings (see U+0084 and U+0085), but they
never took off and hence added to the confusion.   Unicode,
having learned from that experience, solved it by adding yet
more codes, notably U+2028, which, as you note, didn't go
anywhere either.

Telnet (and FTP TYPE A and perhaps email) may be history
although I think we lost quite a bit in going from the
network-standard forms specified at the core of those protocols
to the "everything is like Unix" or "everything is like me"
models that seem to be pervasive in many more recent
developments.  It may be a bit ironic that your solution to some
other problems is to devise yet another network-standard form,
especially one that has options that basically encourage (or
require) profiles (or, if you prefer, combinations of variances).

As long as one only intends to store or display the results of
whatever is transmitted and the display device follows more less
the same conventions as the transmitting one, none of that makes
any difference, at least in the overwhelming number of cases.
If, however, there is any requirement to compare a set of
strings, things get more complicated in a hurry (see below).

(2) Especially given recent knowledge and discoveries, requiring
NFC is probably a mistake and might have been a mistake when
5198 was written.  NFKC is worse.   If you read the Unicode
Standard, you will find that the goal of normalization is
generally to permit application to convert two strings to a
common form before they are compared to provide comparisons that
would nearly approximate what people would expect.  That
pre-comparison conversion is probably why Unicode uses the term
"canonical form" although it doesn't include some of the things
the rest of us have historically means by that term in specific
applications.   However, converting strings to normalized form
before transmission may lose information that was intended by
the user to affect how characters are displayed.  Because of
some characteristics of the DNS and IDNA, there was a special
need for stored NFC-strings.  That might generalize to some of
the PRECIS work and possibly to identifiers in general, but
certainly does not extend to transmission of general Unicode
strings.  For the NFC case, it is easy to make the wrong
generalization from Latin script, where it is generally harmless
at worst but that is not the case more generally.  NKKC is even
more problematic: eliminating half-width characters in favor of
their normal-width equivalents may be harmless (or may not), but
dropping, e.g., mathematical symbols or rarely-used CJK
characters in favor of Latin or Greek character or more commonly
used CJK ones with similar interpretations can be a disaster.
If you want a variation for "loses information but maybe no one
cares" then you might put NFKC in there along with several other
things :-(

(3) The comparison problem is, as mentioned above, far more
difficult.  At the most superficial level, the web and some
other modern applications allow either LF or CRLF.  That might
have been a mistake -- or maybe it is ok.  It does means that
strings (or whole files) have to be converted to some canonical
form for comparison, that sizes change, etc.  The difficulties
pervade the many issues we've had with IDNA over the years,
including the question of whether or not NFC is adequate to deal
comprehensively with the problems (it isn't) and what should be
done about the cases where it is not.  For a discussion of some
of the issues, see https://github.com/w3c/charmod-norm

You might want to try to encourage the i18n directorate (copied
on this note) to review this before you get much further along.  

best,
   john

--On Sunday, 07 July, 2019 22:14 +0200 Carsten Bormann
<cabo@tzi.org> wrote:

> This week, I have been asked twice (for independent drafts)
> what these drafts should say about their usage of Unicode.  It
> turns out, we have an RFC that is almost, but not entirely
> unlike what is needed to reference for a new protocol that
> uses text: RFC 5198. I have tried to build on that in the very
> short draft:
> 
> 
> 
> I would expect that such a specification, once completed,
> would be a useful point of reference for any new protocols
> that need to use text without any specific requirements
> imposed by their applications. (Protocols that already have
> been around probably benefit less, because there always are
> legacy considerations.)
> 
> I'm not yet asking for dispatching this document; right now I
> would just like to receive feedback from this esteemed
> community to further improve the document.

[art] Modern Network Unicode Carsten Bormann
Re: [art] Modern Network Unicode Manger, James
Re: [art] Modern Network Unicode Tim Bray
Re: [art] Modern Network Unicode Martin Thomson
Re: [art] Modern Network Unicode — –02 submitted Carsten Bormann
Re: [art] Modern Network Unicode — –02 submitted Manger, James
Re: [art] Modern Network Unicode — –02 submitted Peter Occil
Re: [art] Modern Network Unicode — –02 submitted Larry Masinter
Re: [art] Modern Network Unicode John C Klensin
Re: [art] Modern Network Unicode Carsten Bormann
Re: [art] Modern Network Unicode — –02 submitted John C Klensin
Re: [art] Modern Network Unicode Dale R. Worley
Re: [art] Modern Network Unicode John C Klensin
Re: [art] [I18ndir] Modern Network Unicode John C Klensin
Re: [art] [I18ndir] Modern Network Unicode Carsten Bormann
Re: [art] [I18ndir] Modern Network Unicode John C Klensin
Re: [art] [I18ndir] Modern Network Unicode Ira McDonald
Re: [art] [I18ndir] Modern Network Unicode Carsten Bormann
Re: [art] [I18ndir] Modern Network Unicode John C Klensin