Re: [I18n-discuss] FW: New Version Notification for draft-iab-identifier-comparison-01.txt

John C Klensin <john-ietf@jck.com> Mon, 09 April 2012 14:08 UTC

Date: Mon, 09 Apr 2012 10:08:52 -0400
From: John C Klensin <john-ietf@jck.com>
To: Dave Thaler <dthaler@microsoft.com>, i18n-discuss@iab.org
Message-ID: <6BE45A5D16245553F661236A@PST.JCK.COM>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Subject: Re: [I18n-discuss] FW: New Version Notification for draft-iab-identifier-comparison-01.txt
Precedence: list

--On Monday, March 12, 2012 20:48 +0000 Dave Thaler
<dthaler@microsoft.com> wrote:

> Updates since -00 include:
> 
> 1) Updated references, including quoted text to be from the
> most recent relevant     RFCs
> 
> 2) To try to avoid a portion of the (non-technical) process
> pushback we got     at Appsarea  and the plenary, no longer
>...

Dave,

Sorry for the delay on this detailed review.  I'll let the IANA
and ITU Programs make my apologies for me.

I like the general thrust of the document, but I have a lot of
quibbles.  I hope not too many of them run counter to the
watering-down you did as a consequence of the earlier external
comments.

With two major exceptions (which I identify), I haven't tried to
separate comments below that are editorial and semi-trivial from
those that are more substantive.  Anything for which there is
disagreement will need discussion either way.

Warning: The suggestions below push one set of examples that I
think we need to address, if only because a new theory of
"sometimes they are and sometimes they are not" name
equivalence is being introduced into the community, one with
much nastier implications than the discussion that now appears
in the documetnt.   In particular, it essentially reintroduces
the old notion that two URIs are "the same" if they access the
same content when the relevant Method is applied.  Comparision
systems that follow the URI spec are extremely likely to get
different results than systems that use the model that
content-identity equals identifier equivalence.  That issue is
similar to, but different from, the address-equivalence model
addressed in Section 3.1.4.

(1) Section 1(3), text reading:
	"in the email arriving at the holder's email server which
	has the repository of all email accounts on that server."

Small distinction: mail stores ("repositories for incoming
mail") are not necessarily on the same host as the email server
and may be distributed across several machines.  And the
repository for mail doesn't necessarily where the matching is
done -- matching does, necessarily, occur on the mail server.

It occurs to me that, as this mess evolves, the very strong
rule that only the delivery MTA can know how to interpret an
address, it may be desire to have an SMTP extension that works
a little bit like VRFY but takes two addresses and returns
information as to whether they match --possibly differentiating
between matching as a coding issue (can't change) and matching
as an alias one (could change if aliases change).  In a better
world, one could think about reopening the subaddress can of
worms, but that would probably doom the proposal.  Such a
command, like VRFY, could return "I'm not going to tell you" if
the server considered that necessary, leaving whatever wants to
know no owrse off then they are today.  I'd be happy to write
such a command up if others feel it would be worth the effort.

(2) Major issue -- perhaps more textual/presentation tha
substantive, but certainly affecting substance:   

In reading this draft starting with the comparison typology
and picture at the end of Section 1, I realized that there are
many places where it isn't actually talking about comparison at
all.  What we do when we are trying to compare two identifiers
is to canonicalize them into as single form that would be
bit-string equal if the two identifiers matched.  That form
might be either a lexical canonicalization of the original or
some standardized surrogate for it.  But the canonicalization
is the important operation and is what "can be complicated"
(next-to-late paragraph, Section 1); actual comparison is
largely trivial.  You essentially introduce that concept in
Section 2.3, but don't take as much advantage of it as I think
would be helpful.

I'm going to assume that model (and some changes of text) in
several of my comments below.  In particular, it means that the
more complex operations that this document (and others in the
IETF) assume a carried out on a pair of strings instead only
require a clear definition of the canonical comparison form for
each string type (and the function that gets to it) -- the
two-string operation is trivial.   This is, among other things,
consistent with the way our comparator registries and hash
functions are defined.

(3) Section 1, last sentence.  This sentence starts talking
about hierarchies of caches generally, uses web pages as a
parenthetical example, than makes a statement about
"authoritative web server[s]".  Either make the whole thing a
web example or make it about hierarchical caches with the web
as an example.  Incidentally, HTTP 1.1 [RFC2616, particularly
Section 10.2.4] doesn't define an "authoritative web server"
only "non-authoritative information" and an "origin server".
IIR, that was done deliberately because an intermediate server
is not required to return a 203 code if it knows that the
information it is returning is identical to what the origin
server would have produced. 

(4) Section 2, bullet 1: You may need to define "security
identifier" for the benefit of some readers of this text.  It
is another place where the "canonical comparison form" concept
may be important.

(5) Section 2.1, bullet 2, text reads: 
	"URI scheme names are defined to be a case-insensitive
	match". 

It sort of slips the ASCII requirement in later in the
paragraph, perhaps leading the reader to believe that
case-insensitive comparisons for non-ASCII characters are well
defined (i.e., without the use of controversial operations or
magic).  Suggestion:

	"URI scheme names are required to be ASCII and are defined
	to match in a case-insensitive way".

Or words to that effect (and then clean up the rest of the
paragraph).

(6) Section 2.2 paragraph 1.  Text reads:

	'(Often the term "normalization" is used synonymously with
	"canonicalization", but in internationalization the term
	normalization has a precise meaning, and so we use the
	generic term canonicalization here instead.)'

Unfortunately, Unicode supplies a precise meaning to
"canonicalization" as well.  The meaning is not necessarily
consistent with the one used in the draft (and is definitely
not consistent with the usage I'm advocating in note (2)
above).  John Tukey would certainly have suggested that we should
be inventing a new word to be absolutely clear.   I suggest
that we should use the construction outlined in note (2) and
make clear in the Introduction that variations on "canonical"
in this document mean what we say it means --in terms of a
per-identifier-type form and function-- despite the efforts
of TUS to preempt the term.

In any event, canonicalization is not an algorithm for
comparison.  It is a transformation applied to both identifiers
to make it possible to determine equivalence of the identifiers
by comparing the canonical (including surrogate) forms.

(7) In Section 2.3, it might be helpful to deal with current
events by adding another example.  Suppose that Foo Corp
actually writes its name as "Fô Corp" where that is feasible,
using "FooCorp.example" and "FôCorp.example" as variant domain
names.  It would be quite natural for Fô Corp employees and
algorithms to treat

	http://example.com/stuff/FooCorp/alice
and
	http://example.com/stuff/FôCorp/alice

as matching, although example.com might have no clue that
particular matching was intended.  The match would also be a
violation of a strict reading of the URI spec, but, in the land
in which a slightly different Alice travels, Humpty Dumpty
presumably invents comparison procedures as well as word
definitions and that is not an impediment.

It is not clear whether the prohibition on non-ASCII characters
in URIs provides protection or not.  It is certainly intended
to not be protective in the case of ICANN variants.  And, of
course, if example.com's customer was actually ColourCorp
rather than FooCorp, there would be no protection at all.

(8) In Section 3.1, please do not use "hostname" as a synonym
for what we've called "LDH label" or "NR-LDH label" in RFC 5890
and elswhere.  "Hostname" has other --very confused and
contradictory-- definitions and semantics, including questions
about whether the term refers to an FQDN or just a label and
whether it can refer to a non-terminal DNS node or be
associated with an alias-type RR.  If
I-D.ietf-pkix-rfc5280-clarifications uses "hostname" that way,
it should be fixed.   I thought we had gotten that confusion
out of RFC 6055 but, if we did not, it is errata time.

As always, I look forward to this bit of confusion being
completely cleared up with the publication of
draft-ietf-dnsext-dns-authoritative-terminology and its
achieving broad consensus.  :-(

(9) Another issue with Section 3.1 is illustrated by the
following example.  Suppose we have
    example.com.  IN A 10.0.0.6
    example.net.  CNAME example.com.
    example.org.  IN A 10.0.0.6

now, as an exercise, figure out which pairs of those names are
"equal".  As an even more interesting exercise, contruct a
similar example using DNAME or perhaps a hypothetical "VARIANT"
RR type. 

(10) I find parts of 3.1.1 very troubling.  Separate note when
I figure out how to explain what is bothering me, but at least
some of it may involve the relationship between
standard-but-loose as presented in this document and "user
interface matter" as described in RFC 1123.

(11) The discussion in 3.1.6 almost certainly needs to include
some comments of the effect of mapping, especially UTR 46
mapping, on comparison issues.  The existing text comes close
to assuming that all "Unicode" (native character) forms of IDN
labels are U-labels and hence duals of A-labels.  UTR 46
effectively rejects that assumption.

(12) In 3.1.4, it might also be worth noting that certainly DNS
tricks that are widely popular in practice (e.g., the Akamai
one) may have the conseqeunce that the same name resolves to
different IP addresses depending on when and from where the
question is asked. network load when it is asked, etc.

(13) Section 3.3, Paragraph 3, editorial:  
Old:
	each of which has their own rules
New:
	each of which has its own rules

(14) Section 3.3 is titled "URIs and IRIs" but does not touch
on IRIs --especially the messy problem of whether IRIs can be
compared at all without conversion to URIs and whether that
conversion is sufficient to produce comparable forms-- at all.
The term "IRI" doesn't even appear in the section.  I suggest
changing the title to "URIs", but don't see how an i18n
identifier comparison document can avoid them entirely, as this
draft seems to.

(15) Section 4: See comments above about variants and note that
there is no DNS mechanism for identifying whether a pair of
FQDNs are "Chinese" variants (i.e., separately-delegated domain
subtrees that are expected to have some relationship based on
character associations) or "Saudi numeric" variants (i.e., two
forms of the same labels in which digit types are substituted,
typically in the user interface).  Either can produce what the
user would see as a false negative.  Attempts to produce the
alternate forms algorithmically for comparison purposes could
produce false positives.

    --john

[I18n-discuss] FW: New Version Notification for d… Dave Thaler
Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
Re: [I18n-discuss] FW: New Version Notification f… Dave Thaler
Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
Re: [I18n-discuss] FW: New Version Notification f… Dave Thaler