[I18ndir] One more try

John C Klensin <john-ietf@jck.com> Thu, 21 February 2019 08:31 UTC

Date: Thu, 21 Feb 2019 03:31:42 -0500
From: John C Klensin <john-ietf@jck.com>
To: i18ndir@ietf.org
Message-ID: <2B91B60DE56B36DD5D667679@PSB>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/r7y8ez4RRJioAFYsxu5Ooiw5dVY>
Subject: [I18ndir] One more try
Precedence: list

Ok.  I've been accused of just trying to hold things up,
possibly by making them so complicated, or bundling so many
things together, to make progress impossible.  That is not my
intent.

So let me see if I can boil this down further at the risk of
oversimplifying a bit and leaving some possibly-important
nuances out (indented paragraphs, which are also set off by "((
... ))" are examples or speculative comments -- feel free to
skip over them is you are confident you understand what I'm
talking about and don't have time.:

(1)  There are two ways to view the IDNA2008 specs.  One is that
we have made our choices, are stuck with them, and that anything
we discover or that Unicode does to us, even things that with
ordinary IETF specifications would be considered known technical
defects, just have to be accepted and dealt with by giving
advice to registries and hoping they follow it 

	((The questions of who gives that advice, how it is
	maintained, in what ways different sorts of registries
	are differentiated, etc., are a separate issue from the
	current I-D and can be handled whenever (or if) we get
	around to it.  So can any clarifications to IDNA2008 to
	better identify the role and registries and the
	importance of their being careful, conservative, and
	responsible as long as those clarifications don't
	actually change the specifications or their intent.))

draft-faltstrom-unicode11-07 is aligned with that view.  It
doesn't say so in so many words (and I think it should if they
is what we are going to do), but it does say that some things
that have been identified as problems are best just left with
whatever categories are determined by the RFC 5891 and 5892
rules and then sorted out by registries, citing
draft-freytag-troublesome-characters non-normatively as a
specific example of such advice and that, whatever we do, we
just follow along with Unicode's properties and categories.

The other view is that these are IETF Proposed Standards like
all other IETF Proposed Standards.  If we discover technical
defects, we address them in a serious way.  Perhaps we fix them.
IDNA2008 (IMO) anticipated that possibility: code points can be
added to the exception list, new contextual rules can be added
or old ones modified, or new rule sets can be added to section 2
of RFC 5892.   One can even imagine Unicode making a set of
additions that would justify adding an entry to the block
collection in Section 2.4 of RFC 5892.   Or possibly we can
explain the issue in prose (possibly in 5891 rather than 5892)
and move on.  In some cases, that explanation might be that
after careful examination we concluded that the pain and
suffering that would be caused by the incompatible changes
needed to get things right so far exceeded the problems that
were likely to be caused by the defect that the latter is better
just left alone but, if so, we should document that, not
hand-wave around it. However, if some of the things we discover
(or changes or new code points in a new version of Unicode)
suggest global restrictions at a level of importance equivalent
to global restrictions we now impose, the "just like any other
standard" principle and the structure of IDNA2008 strongly
suggests that we should consider modifying the standard and its
rules, not ending up in a situation in which the real
distinction between what is handled as a global rule
(remembering that IDNA2008 has expectations about global rules
being enforced at lookup time, something that cannot possibly be
done with suggestions that are applied by individual registries
at all levels of the DNS hierarchy.  

	((Let me illustrate with a real example that Patrik's
	search did not turn up this time, nor did our search
	when RFC 5892 was finalized.  The example is chosen
	partially because I'm just getting tired of talking
	about non-decomposing code points and associated
	combining sequences.  Section 2.4 of 5892 lists (and
	disallows) two types of symbols for musical notation,
	both essentially European.  Handling them by blocks is
	necessary because both notation systems consist of a
	combination of basic musical symbols and combining
	symbols.   The former have General Category So and would
	hence be DISALLOWED without needing a special rule.  But
	the others have General Category Mn which, absent other
	action, make them PVALID.  Well sometime around Unicode
	5.x, coding was introduced for Balinese, including
	Balinese musical notation.  It has the same property
	relationships as the more traditional (in Unicode)
	musical notations -- the base notational symbols have
	General Category So and the combining marks are in Mn.
	Other than "slipped through the cracks" there is no
	rationale that makes sense in an IDNA2008 context for
	DISALLOWing the Balinese base musical symbols (along
	with the traditional western European and Greek ones
	while treating the Balinese musical combining symbols as
	PVALID while the others are not.  There is, however, a
	Unicode reason: while, in Blocks.txt, the musical
	symbols listed in Section 2.4 have their own blocks,
	they are folded into a Balinese block with letters of
	the writing system for the latter.  If we had noticed
	this in 2010, we would have had an interesting
	discussion about blocks that should be DISALLOWED for
	consistency but that are not named as blocks in
	Blocks.txt.   So, maybe we just live with this as an
	error on the theory that a registry would be insane to
	use those combining marks with anything but Balinese
	musical symbols and those are DISALLOWED.  Maybe it
	deserves a note somewhere; maybe not.   But now suppose
	a proposal for another musical notation comes along that
	uses a combination of base symbols and combining marks
	and Unicode accepts a proposal to include it.   Suppose
	too that the block is big enough that the musical code
	points can't be folded into some associated script block
	so we end up with a separately-named block for it.   Are
	we going to follow the precedent of Section 2.4 and
	DISALLOW the block or are we going to follow the
	precedent of Balinese and let it go (and hope that
	registries get the right advice and follow it)?  Would
	what typical graphemes for those code points look like
	and how they combine make a difference?   If we accept
	the "just follow Unicode" instructions of
	draft-faltstrom-unicode11 the answer is clear but it is
	less clear that it is the right one.))

There is another kind of mistake we might have made -- using the
example above, if the answer for Balinese is going to be the
answer for all musical notations we encounter in the future
(i.e., musical combining characters are PVALID even though the
base ones are DISALLOWED), maybe we should be figuring out
whether including two musical notations but not the others in
Section 2.4 was a mistake and we should take those two our for
consistency, thereby shifting several code points from
DISALLOWED to PVALID (and then presumably advising registries to
not use them).  

	((I can't imagine that this would be a big issue one way
	or the other.  But think about what happens if the next
	round of emoji excesses introduces some code points
	classified as non-spacing marks.   Are we willing to
	have them be PVALID and rely on registries, some
	emoji-happy, to do the right thing?   That is a real
	question -- I don't know the answer but think we should
	be in a position to think about it should the situation
	arise, and that makes "just follow Unicode" a bad idea.))

(2) The second choice above -- treating IDNA2008 as an ordinary
set of Proposed Standards that may have technical defects that
we should be looking to straighten out rather than deciding that
anything introduced by changes in Unicode or discovered as our
knowledge increases -- does not require delaying
draft-faltstrom-unicode11 until all of the issues that have been
raised are resolve (or until the end of time if that does
first).    It does require that we go through the text to remove
or modify the "just accept whatever Unicode throws over the
wall" language to make it clear that careful reviews are in
order.  It also requires that we have some sort of plan about
such reviews (not easy given that, even with this directorate,
the number of people who have participated substantively are
very few, but lots easier than solving the problems and that we
can describe the issues that have been outstanding for a few
years and either identify them as suitable for leaving to
registries or ones that should cause some choices in the I-D to
be marked as tentative.   I'd imagine that, if more of us can
really engage rather than sitting thing out, those things can be
done fairly quickly, possibly without holding up the document up
at all, because...

(3) All of these issues aside, I don't know how many people in
this directorate who are fluent native speakers of English
(either major variation) have read the document closely, but it
appears to ma to need work.   There are some typos that we can
probably rely on the RFC Editor to fix (e.g., "recomments" in
the Abstract), but there are also some technical errors (e.g.,
IDNA2008 was not "largely completed" in 2008; that is more or
less the year that revision work got underway with "largely
complately" belonging to 2010) and more complex editorial ones
(e.g., changes have not been made to Unicode "related to the
   algorithm IDNA2008 specifies" which would imply that those
changes were made because of IDNA2008 or with IDNA2008 in mind.
I don't believe either of those are true and believe that Mark
would deny it even if they were.  It would be much better and
more accurate to say that some Unicode changes have consequences
for IDNA2008 (or for its algorithm)).    

I also don't believe some of the more substantive statements are
accurate.  For example, the paragraph starting "Historically"
implies that Unicode 6.0 raised substantive issues as serious as
those identified in the 7.0 timeframe and later and that we
decided to just accept them.   I don't recall there being any
such issues.  From that standpoint, the key statement in RFC
6452 appears under Security Considerations: "The three code
points are unlikely to occur in internationalized domain
names,..."   That is a very specific rationale for accepting the
change for those particular code points.  It is quite different
from "The primary reason for that choice is that staying with
the Unicode Standard has been viewed as important because of the
diversity of implementations already existing in the wild.", a
statement for which I don't think there is much evidence of
informed IETF consensus.

There are also several places where the draft omits material
that would considerably strengthen it.   For example, under 3.2,
there are "interpretations" of UTS#46 that go well beyond "a mix
between IDNA2003 and IDNA2008".   As one important group of
examples,
https://www.unicode.org/Public/idna/11.0.0/IdnaMappingTable.txt
shows many or most symbols, including all of the emoji, as
valid.  Neither IDNA2008 or any plausible interpretation of
IDNA2003 allow them, they raise issues that go beyond "registrar
beware", and they do not appear to be on the "troublesome" list.
PAs an aside, perhaps it is too paranoid, but every time I see a
piece describing emoji as a new language, I wonder whether some
future version of Unicode will reclassify them from So and Sk to
Lo and Mn.  If we follow the doctrine or just accepting whatever
Unicode throws over the wall, that would take us straight to
vomiting cowboys and worse.  However, nothing to be fixed here
that should take much time; just a strong argument that we fix
the text to avoid promising we will accept whatever Unicode
throws at us.

I see these as editorial issues -- topics on which the I-D can
be changed to say a bit less and be a good deal less
controversial about the claims it makes and how those claims
might constrain us in the future without making any significant
substantive changes.  I think many of those changes should still
be made if we conclude that IDNA2008 is frozen and it is all up
to Unicode in the future, no matter what that brings.   I think
they can be done fairly quickly if people pitch in; somewhat
less quickly if some very small number of us need to deal with
both these editorial issues and smoothing things over if we want
to preserve the option of making technical adjustments to deal
with defects in to IDNA2003 or to better explain some of the
issues.

best,
   john

[I18ndir] One more try John C Klensin
Re: [I18ndir] One more try Patrik Fältström
Re: [I18ndir] One more try John C Klensin
Re: [I18ndir] One more try Patrik Fältström