Re: [I18ndir] Getting restarted and triage

John C Klensin <john-ietf@jck.com> Wed, 26 June 2019 16:24 UTC

Date: Wed, 26 Jun 2019 12:24:48 -0400
From: John C Klensin <john-ietf@jck.com>
To: "Asmus Freytag (c)" <asmusf@ix.netcom.com>, i18ndir@ietf.org
Message-ID: <B7A0B60B2E3DE13D5FAA651F@PSB>
In-Reply-To: <16a479d0-8a10-192f-a00c-11b1eae4abb1@ix.netcom.com>
References: <F2B84580-7E5A-4B86-BF9C-0205D4E6121D@episteme.net> <843EAB4535391A494DA216CC@PSB> <13212579-9AEA-45F8-A205-18B4AD1B0BF1@viagenie.ca> <EC8189E3EA3488B8924DBBEB@PSB> <77e8acfd-811a-c5e9-6940-3b8ed2669a75@ix.netcom.com> <E596E8F5E430FAFAC84B17CF@PSB> <16a479d0-8a10-192f-a00c-11b1eae4abb1@ix.netcom.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/RXx_ZniL5GmsO4BBBmTVWOvM57k>
Subject: Re: [I18ndir] Getting restarted and triage
Precedence: list

Asmus (and everyone else),

Let's try to restate your (IMO very helpful) explanation in
possibly-actionable terms...

--On Friday, June 14, 2019 14:13 -0700 "Asmus Freytag (c)"
<asmusf@ix.netcom.com> wrote:

>...
> I was just making the point that Unicode's own list of
> intentionally identical characters contains several
> same-script pairs.

>  They also document a large number of sequences that would
> render identically, except that one of each pair is considered
> "do not use" by Unicode (and vendors obligingly tend to insert
> dotted circles into the display - although that is not
> mandated).

> These cases are edge cases not covered by normalization for
> various good reasons, but we might need to work with them to
> make that data machine readable.

First question cluster: Would you advocate modifying RFC 5892 to
say something substantive about one or both members of that pair
of characters?  If so, do we need to ask Unicode, not just for
machine-readability but for a property or something similar that
IDNA can use?  Do we need to make a plan about what to do if we
ask and they say "no"?   Also see "Final, meta, questions" below.

> Finally, there are some sequences that we discovered in the
> Root Zone development process that are not normalizable (they
> are are distinct spellings, even though they look alike - the
> correct choice depends on the word in question). Because they
> are sequences, they are not yet covered by Unicode data.

Second question cluster: Is there something that could usefully
be done about these in IDNA and, if so, what would it be?  How
are such sequences differ from the "rn" and "m" pair that, with
the right choice of type styles and sizes, can look nearly
identical (or at least "alike")?  If there is a correct choice
and it depends on spelling of actual words, how does that
interact with the principle that DNS labels are not required to
be words in any language (and that many historically and
currently important ones are not)?  Is the Root Zone development
process imposing a "word" criterion for non-ASCII labels even
though the rest of the DNS does not have that requirement
(noting, e.g., "COM", "ORG", "BIZ", "MIL", most of the ISO
3166-1-based ccTLDs, etc., there is enough precedent for
non-words in ASCII that proposing such a change would probably
be a lost cause)?   Considering your work with and comments
about complex scripts, is it possible to construct and impose an
"orthographically plausible in a given script" rule that would
stop short of a "word" requirement and should we be exploring
that for IDNA?  Again see "Final, meta, questions" below.

> And then the larger issues is that nobody knows enough about
> ancient, obsolete or archaic scripts (or rare code points in
> general) to be able to map the problem space *within* that
> repertoire; which is why it's best to not support any of them
> in "public" zones.

While I think there is general agreement about that (some
fussing about the meaning of "public zone" aside), if it is a
public zone question alone, does it impact IDNA in any way?

> We should talk about the problem this way: "In the general
> case, if you want a reasonably secure zone, you need to
> develop a set of Label Generation Rules for your zone, that is
> restricted to modern-use repertoire (to facilitate
> recognition) and that uses context and other rules to restrict
> labels further to stay within the structure of the script and
> prevent duplicate or non-recognizable labels; or, that uses
> variants to limit the number of labels that users may consider
> substitutes for one another; or both. In addition, it may
> require the use of a separate process to deal with certain
> more subtle forms of confusion that fall short of full
> exchangeability."

At one level, that seems entirely reasonable. I think --but an
not sure and would learn from a correction and explanation --
that it is actually less stringent as a requirement than the
repeated variations on "as a registry, you should register only
strings you understand in scripts that you understand
thoroughly" in the existing IDNA specs and reinforced in
draft-klensin-idna-rfc5891bis.  Would it work any better as
advice or in practice?   I think each of us --or at least many
of us-- have hypotheses about why the existing rule has not been
observed very much in zones that are operated for a profit or
even by third party registries on a cost-recovery basis (not the
same as "public domain", but with a good deal of overlap).
Most of those hypotheses seem to amount to "not profitable",
either because it would reject potentially-paying customers who
want to buy particular names or because the costs of checking
each string would be too high. 

Third question cluster:  Keeping in mind the above and that
we've, so far, avoided (or at least tried to avoid) getting into
issues of what humans can confuse, especially if others are
deliberately trying to confuse them in IDNA, what would you
suggest doing in this area?  Or is it just about reinforcing the
mandate for registries being responsible about what they do and
providing additional guidance and tutorial materials to help if
they decide, for business reasons, to substitute "have a
general, if vague, knowledge or and some helpful guidelines" for
"understand thoroughly"?

> The issue here is that not all scripts require all types of
> mitigation to the same extent (or for the same reasons). Hence
> the, "in the general case".

Clearly.  And I think we understood that in the IDNA2008 design
even though I now believe that we botched some of the details.

> But I find that if we think of only a sub-problem, like
> non-normalizable sequences with identical (or possibly
> identical) display, we fall short of seeing the full issue and
> will try to pursue remedies that, while well intentioned, will
> fall short. Or worse, will make the life of those more
> difficult who are doing the right thing.

> For example, you could define some kind of inverse "CONTEXT"
> rule for IDNA that invalidates certain code points and feed
> that process all the "do not use" sequences from Unicode.
> Turns out, for the Indic scripts, where these are of an issue,
> a rather simple set of positive context rules, generally based
> on character classes, not individual code points, covers the
> 90+% of all cases including 100% of those listed by Unicode.

> But you don't want to bake these into the protocol, because
> between 1 and 10% of the cases can be language dependent, or
> can depend on whether you allow certain other code points (or
> not), that is the details (and needed exceptions) can be
> repertoire dependent.

Ok.  Whether we agree or not on the details (and you and I, at
least, mostly agree) I think our understanding of what we should
not do is improving, with your efforts and explanations being a
significant component of that improvement.  What I'm not sure
about is whether we are getting any closer to an understanding
and agreement on what we should do.

> [We really should be having this discussion based on your
> updated draft.]

Probably.  See "Final, meta, questions" below. 

>>> Without interpreting the months it took to get it off the
>>>> ground, the lag time between the discussions of the Unicode
>>>> 11.0 and 12.0 tables and drafts and Pete's note and the
>>>> month between Pete's note and my note as indicating anything
>>>> (although it probably does), (1) - (6) above make an
>>>> extremely strong case that getting critical mass together to
>>>> initiate and sustain a WG, at least a conventional one that
>>>> does not bend various rules, is implausible.

>>> Seems like a reasonable reading of a the evidence to me.

>> Sadly, we agree.

> Which is why a different beast (or some tweaking of the rules)
> may be needed to obviate any "bending".

Ok.  But what do you suggest given the constraints to which the
above points?  In a way, this is at the core of the "Final,
meta, questions" discussion below, but it states the problem at
least a bit differently.

>...
>> IMO that is a fairly serious problem and the reason I took
>> exception to "requires a WG" without a plan that addresses it.
>> 
>>> i18n is special in the way it intersects technologies. It
>>> isn't a standalone technology, despite the fact that some
>>> technologies are i18n-specific.
>> Yes.  I hope we all know and agree about that.  Certainly I
>> do.
> 
> May need it's own niche in terms of process.

Yes.  However, when the directorate was under discussion, we got
back what I thought were fairly clear signals from the IESG that
our role was going to be scoped even more narrowly to advising
the ART ADs than some directorates have been in the past.  That
does not make me optimistic about trying to define a special
niche and get it accepted, but maybe things have changed.  As we
wrote to each other earlier...

>>> In principle, the directorate model should cover the other
>>> aspects well, except that IETF has too few people who can (or
>>> want to) understand and review meaningfully those "generic"
>>> technologies that nevertheless have i18n exposure. The W3C
>>> make that model work, but only because their core
>>> participants are funded directly for that work.

>> Actually, there was a bit of a fuss when the directorate was
>> created about confining its role to that of traditional
>> directorates.  If those positions are accepted (some of which
>> came from comments by people who where then ADs), our sole
>> role is to advise that ART ADs on strategic issues.  Even
>> reviewing out-of-Area documents is a little marginal and some
>> Areas have, historically, had both directorates (for strategy
>> and technical advice to the Area) and what are now called
>> Review Teams (for out of Area document reviews) with
>> different memberships.

> Again, the way, i18n cuts across technologies seems poorly
> understood by IETF in general  - excepting this group.

I agree completely.  But I'm not seeing the mechanism or
suggestions that get us unstuck from however one describes the
current situation.  See "Final, meta, questions" below.

>> > From the perspective of someone who is part of W3C's core
>> > i18n
>> WG and who has been on most of the weekly calls for the last
>> few years --probably a higher percentage of calls over that
>> time than anyone other than the assigned staff member and the
>> chair (and who, by the way, is not funded either directly or
>> indirectly for that work), there are at least two other
>> reasons why that effort works.

> I've watched the process from a bit further remote, but I do
> monitor their work and tend to put my oar in when issues
> intersect my particular expertise.

Yes.  But the claim, IIR, was the W3C effort was succeeding and
we weren't because the core activity there was fully funded and
staffed.  Staffed, yes, but, other than Richard, not much better
funded than the IETF effort.  Certainly I'm not funded for that
work and I assume you are not either.  I also note that the
level of activity of the other people on this list who are
supposedly part of the core group (and who may not be funded for
it either) has been rather close to zero and that has not
prevented work getting done either.

>>   One is that the core group, and the W3C
>> generally, have been at liberty to say "not a web problem" or
>> equivalent and walk away (and have done that, repeatedly).   I
>> cannot imagine them spending much time on, e.g., non-ASCII
>> identifiers in X.509 certificates or physical device
>> identifiers.

> But I'm sure IETF also has a scope. Wouldn't expect this
> directorate to get involved in defining issues for HTML for
> example ?

Of course.  But that isn't the point.   The point is that
analysis of, e.g., implications of of non-ASCII identifiers in
X.509 certificates or storage identifiers is fairly far from the
expertise of most of us on this list.  But we don't get to say
"out of scope" and, instead, appear to be expected to prioritize
them.

> But some of the discussions here make me think that, for
> example, IETF isn't clear about keeping character encoding
> issues on the outside.

More explanation about that would be helpful.  I think we are
fairly clear about it and that, when issues have arisen, they
have been about whether other groups are following their own
rules (or what they told the IDNA2008 development effort their
rules are) or about areas in which they have done work or made
decisions that are not, historically, about character encoding.

>>   The second is that they are actually treated as a
>> group of experts that is required, or even expected, to
>> justify every internal decision to a pack of people with
>> strong opinions, loud voices, and no expertise (in our case,
>> whether within the IETF or to various ICANN and other
>> industry groups). Even the public review process is different
>> in that respect.
> 
> I can't parse that first sentence. Is there a "not" missing?

Yes.  Should have been "is not required..."  Sorry.

>...

Final, meta, questions(s):

We are now six weeks past Pete's "get back to work" note and two
weeks past my response to it.
draft-klensin-idna-unicode-review, the direct result of
discussions in this group, will have been posted for two weeks
on Friday.

Since then, we've had Marc's note suggesting the issues ought to
have a WG and several comments that I believe add up a
conclusions that the only way we are going to get a WG is to
redefine this list as one ... and that probably won't work
either.  We've had a few exchanges between Asmus and myself that
I'm learning from but that I don't think are moving us forward.
We've had a review by John Levine that is encouraging because it
means that someone other than Patrik and myself has looked at
the document, but that doesn't show a path forward either (not
his job).  And we've had a few notes from Pete about procedures
and comment styles but, AFAICT, no direction that will move us
forward.  My (I thought) fairly simple question as to whether
I/we should expect the directorate to review that draft and
advise the ADs on what to do with it or whether I should try to
find an AD to sponsor it has not been addressed by Pete, Peter,
or any of the ADs.

I hope my being unable to find this encouraging is no surprise.
I am having a good deal of trouble prioritizing this effort,
particularly working on the various documents, ahead of work
that actually generates interest and, btw, income to pay the
bills.  I can push back, as I have above, to try to find things
that are within the IETF's scope and actionable, but if we can't
figure out how to process even the two most obviously actionable
documents we have on the table
(draft-klensin-idna-unicode-review and
draft-klensin-idna-rfc5891bis), then there doesn't appear to be
a lot of point. Unlike most of the rest of what should be in the
directorate's queue, those two documents make no substantive
changes to IDNA2008 that, e.g., change the derived properties of
code points so they should not be very controversial.   I could
make a fast pass through the "registry responsibility" document
and encourage Asmus to join me to be sure that we haven't
changed our minds about anything significant and then post a
current (non-expired) version but, right now and given the
underwhelming response to the draft posted a bit under two weeks
ago, it is questionable whether even that is worth the trouble.

So, is there a path forward or was this directorate idea a
well-intentioned failure that leaves us back where we were a
year ago?   Is there going to be a report on the directorate on
the morning of 22 July and, if so, what do the five of you
expect to say?  

best,
   john

[I18ndir] Getting restarted and triage Pete Resnick
Re: [I18ndir] Getting restarted and triage John C Klensin
Re: [I18ndir] Getting restarted and triage Marc Blanchet
Re: [I18ndir] Getting restarted and triage John C Klensin
Re: [I18ndir] Getting restarted and triage Asmus Freytag
Re: [I18ndir] Getting restarted and triage John C Klensin
Re: [I18ndir] Getting restarted and triage Asmus Freytag (c)
Re: [I18ndir] Getting restarted and triage Patrik Fältström
Re: [I18ndir] Getting restarted and triage Marc Blanchet
Re: [I18ndir] Getting restarted and triage Pete Resnick
Re: [I18ndir] Getting restarted and triage Barry Leiba
Re: [I18ndir] Getting restarted and triage John C Klensin
Re: [I18ndir] Getting restarted and triage John Levine
[I18ndir] draft-faltstrom-unicode12 (was: Re: Get… John C Klensin
Re: [I18ndir] Getting restarted and triage John C Klensin
Re: [I18ndir] draft-faltstrom-unicode12 (was: Re:… Patrik Fältström
Re: [I18ndir] Getting restarted and triage Asmus Freytag
Re: [I18ndir] draft-faltstrom-unicode12 (was: Re:… John C Klensin
Re: [I18ndir] Getting restarted and triage John C Klensin
Re: [I18ndir] draft-faltstrom-unicode12 Asmus Freytag
Re: [I18ndir] Getting restarted and triage John R Levine
[I18ndir] Civility (Was: Getting restarted and tr… Pete Resnick
Re: [I18ndir] Getting restarted and triage John C Klensin
Re: [I18ndir] Getting restarted and triage Asmus Freytag (c)