Re: [I18n-discuss] FW: New Version Notification for draft-iab-identifier-comparison-01.txt

Dave Thaler <dthaler@microsoft.com> Wed, 09 May 2012 01:10 UTC

Received-SPF: pass (mail79-db3: domain of microsoft.com designates 131.107.125.8 as permitted sender) client-ip=131.107.125.8; envelope-from=dthaler@microsoft.com; helo=TK5EX14HUBC104.redmond.corp.microsoft.com ; icrosoft.com ;
From: Dave Thaler <dthaler@microsoft.com>
To: John C Klensin <klensin@jck.com>, "i18n-discuss@iab.org" <i18n-discuss@iab.org>
Thread-Topic: [I18n-discuss] FW: New Version Notification for draft-iab-identifier-comparison-01.txt
Thread-Index: AQHNAI/xkysfXvD+zUm22cBMRUXo6pZnHtjQgCwJYACALdZOQA==
Date: Wed, 09 May 2012 01:09:57 +0000
Message-ID: <9B57C850BB53634CACEC56EF4853FF653B5BD938@TK5EX14MBXW605.wingroup.windeploy.ntdev.microsoft.com>
References: <20120312203715.28336.99280.idtracker@ietfa.amsl.com> <9B57C850BB53634CACEC56EF4853FF653B494C6F@TK5EX14MBXW603.wingroup.windeploy.ntdev.microsoft.com> <9B6E2A610D4368549C5A4E98@PST.JCK.COM>
In-Reply-To: <9B6E2A610D4368549C5A4E98@PST.JCK.COM>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
Subject: Re: [I18n-discuss] FW: New Version Notification for draft-iab-identifier-comparison-01.txt
Precedence: list

Version -02 has now been posted.  This incorporates changes
based on Marc Blanchet's feedback and John Klensin's feedback.
I posted the changes based on Marc's feedback on 4/24.

Responses to John's feedback are below.

> I like the general thrust of the document, but I have a lot of
> quibbles.  I hope not too many of them run counter to the
> watering-down you did as a consequence of the earlier external
> comments.
> 
> With two major exceptions (which I identify), I haven't tried to
> separate comments below that are editorial and semi-trivial from
> those that are more substantive.  Anything for which there is
> disagreement will need discussion either way.
> 
> Warning: The suggestions below push one set of examples that I
> think we need to address, if only because a new theory of
> "sometimes they are and sometimes they are not" name
> equivalence is being introduced into the community, one with
> much nastier implications than the discussion that now appears
> in the documetnt.   In particular, it essentially reintroduces
> the old notion that two URIs are "the same" if they access the
> same content when the relevant Method is applied.  Comparision
> systems that follow the URI spec are extremely likely to get
> different results than systems that use the model that
> content-identity equals identifier equivalence.  That issue is
> similar to, but different from, the address-equivalence model
> addressed in Section 3.1.4.

Added Section 3.3.6, for parity with 3.1.4:
    As with Section 3.1.4 for hostnames, it may be tempting
    to define a URI comparison algorithm based on whether they resolve
    to the same content.  Similar problems exist, however, including
    content that dynamically changes over time or based on factors such
    as the requester's location, potential lack of external connectivity
    at the time/place comparison is done, potentially undesirable delay
    introduced, etc.

    In addition, as noted in Section 3.1.4, resolution
    leaks information about security decisions to outsiders if the
    queries are publicaly observable.

> (1) Section 1(3), text reading:
> 	"in the email arriving at the holder's email server which
> 	has the repository of all email accounts on that server."
> 
> Small distinction: mail stores ("repositories for incoming
> mail") are not necessarily on the same host as the email server
> and may be distributed across several machines.  And the
> repository for mail doesn't necessarily where the matching is
> done -- matching does, necessarily, occur on the mail server.

The term "repository" was originally intended to encompass the
distributed system of servers including the mail server and the 
mail stores.  However, I don't want to go into detail in the document,
as the point is supposed to be a high level one.
Changed to
    in the email arriving at the holder's email server which 
    has access to the mail stores.

> It occurs to me that, as this mess evolves, the very strong
> rule that only the delivery MTA can know how to interpret an
> address, it may be desire to have an SMTP extension that works
> a little bit like VRFY but takes two addresses and returns
> information as to whether they match --possibly differentiating
> between matching as a coding issue (can't change) and matching
> as an alias one (could change if aliases change).  In a better
> world, one could think about reopening the subaddress can of
> worms, but that would probably doom the proposal.  Such a
> command, like VRFY, could return "I'm not going to tell you" if
> the server considered that necessary, leaving whatever wants to
> know no owrse off then they are today.  I'd be happy to write
> such a command up if others feel it would be worth the effort.
> 
> 
> (2) Major issue -- perhaps more textual/presentation tha
> substantive, but certainly affecting substance:   
> 
> In reading this draft starting with the comparison typology
> and picture at the end of Section 1, I realized that there are
> many places where it isn't actually talking about comparison at
> all.  What we do when we are trying to compare two identifiers
> is to canonicalize them into as single form that would be
> bit-string equal if the two identifiers matched.  That form
> might be either a lexical canonicalization of the original or
> some standardized surrogate for it.  But the canonicalization
> is the important operation and is what "can be complicated"
> (next-to-late paragraph, Section 1); actual comparison is
> largely trivial.  You essentially introduce that concept in
> Section 2.3, but don't take as much advantage of it as I think
> would be helpful.

Actually it was introduced in the first paragraph of 2.2.
I took your feedback as suggesting it be more prominent, and
so I've moved that text to the end of the introduction section
and given it its own subsection.

However, that text noted that canonicalization + bitwise equality
is the "most common" comparison algorithm.  That's because it's
not the only one.  Comparison algorithms can of course also be
defined and/or implemented via equivalence tables where no
single value is canonical.   Of course canonicalization is the
most common because often you want to use the canonical form
for something, such as output.

I figured it was worth a short note to the effect that defining
comparison doesn't necessarily require defining canonicalization,
so added this text:

   While the most common method of comparison includes canonicalization,
   comparison can also be done by defining an equivalence algorithm,
   where no single form is canonical.  However in most cases, a
   canonical form is useful for other purposes, such as output,
   and so in such cases defining a canonical form suffices to
   define a comparison method.

> I'm going to assume that model (and some changes of text) in
> several of my comments below.  In particular, it means that the
> more complex operations that this document (and others in the
> IETF) assume a carried out on a pair of strings instead only
> require a clear definition of the canonical comparison form for
> each string type (and the function that gets to it) -- the
> two-string operation is trivial.   This is, among other things,
> consistent with the way our comparator registries and hash
> functions are defined.
> 
> 
> (3) Section 1, last sentence.  This sentence starts talking
> about hierarchies of caches generally, uses web pages as a
> parenthetical example, than makes a statement about
> "authoritative web server[s]".  Either make the whole thing a
> web example or make it about hierarchical caches with the web
> as an example.  Incidentally, HTTP 1.1 [RFC2616, particularly
> Section 10.2.4] doesn't define an "authoritative web server"
> only "non-authoritative information" and an "origin server".
> IIR, that was done deliberately because an intermediate server
> is not required to return a 203 code if it knows that the
> information it is returning is identical to what the origin
> server would have produced. 

Made the whole thing a web example.  Now reads:
    For example, when a hierarchy of web caches exist, each cache
    is itself a repository of a sort, and the match process is 
    usually intended to be the same as on the origin server.

> (4) Section 2, bullet 1: You may need to define "security
> identifier" for the benefit of some readers of this text.  It
> is another place where the "canonical comparison form" concept
> may be important.

The term "security identifier" is not presently used anywhere in
the document.   Since you only used "may", I assumed it was
merely a suggestion not a request, so no change made to this section.
If anyone does feel strongly, please suggest text :)

> (5) Section 2.1, bullet 2, text reads: 
> 	"URI scheme names are defined to be a case-insensitive
> 	match". 
> 
> It sort of slips the ASCII requirement in later in the
> paragraph, perhaps leading the reader to believe that
> case-insensitive comparisons for non-ASCII characters are well
> defined (i.e., without the use of controversial operations or
> magic).  Suggestion:
> 
> 	"URI scheme names are required to be ASCII and are defined
> 	to match in a case-insensitive way".
> 
> Or words to that effect (and then clean up the rest of the
> paragraph).

Done.  Now reads:
    URI scheme names are required to be ASCII and are defined
    to match in a case-insensitive way; the comparison is
    thus definite since all parties agree on how to
    do a case-insensitive match among ASCII strings.

> (6) Section 2.2 paragraph 1.  Text reads:
> 
> 	'(Often the term "normalization" is used synonymously with
> 	"canonicalization", but in internationalization the term
> 	normalization has a precise meaning, and so we use the
> 	generic term canonicalization here instead.)'
> 
> Unfortunately, Unicode supplies a precise meaning to
> "canonicalization" as well.  The meaning is not necessarily
> consistent with the one used in the draft (and is definitely
> not consistent with the usage I'm advocating in note (2)
> above).  John Tukey would certainly have suggested that we should
> be inventing a new word to be absolutely clear.   I suggest
> that we should use the construction outlined in note (2) and
> make clear in the Introduction that variations on "canonical"
> in this document mean what we say it means --in terms of a
> per-identifier-type form and function-- despite the efforts
> of TUS to preempt the term.

Text now reads:
    Perhaps the most common algorithm for comparison involves first
    converting each identifier to a canonical form (a process known
    as "canonicalization" or "normalization"), and then testing
    the resulting canonical representations for bitwise equality.
    In so doing, it is thus critical that all entities involved agree
    on the same canonical form and use the same canonicalization
    algorithm so that the overall comparison process is also the same.

    Note that in some contexts, such as in internationalization, the
    terms "canonicalization" and "normalization" have a precise
    meaning.  In this document, however, we use these terms
    synonymously in their more generic form, to mean conversion
    to some standard form.

> In any event, canonicalization is not an algorithm for
> comparison.  It is a transformation applied to both identifiers
> to make it possible to determine equivalence of the identifiers
> by comparing the canonical (including surrogate) forms.
> 
> 
> (7) In Section 2.3, it might be helpful to deal with current
> events by adding another example.  Suppose that Foo Corp
> actually writes its name as "Fô Corp" where that is feasible,
> using "FooCorp.example" and "FôCorp.example" as variant domain
> names.  It would be quite natural for Fô Corp employees and
> algorithms to treat
> 
> 	http://example.com/stuff/FooCorp/alice
> and
> 	http://example.com/stuff/FôCorp/alice
> 
> as matching, although example.com might have no clue that
> particular matching was intended.  The match would also be a
> violation of a strict reading of the URI spec, but, in the land
> in which a slightly different Alice travels, Humpty Dumpty
> presumably invents comparison procedures as well as word
> definitions and that is not an impediment.
> 
> It is not clear whether the prohibition on non-ASCII characters
> in URIs provides protection or not.  It is certainly intended
> to not be protective in the case of ICANN variants.  And, of
> course, if example.com's customer was actually ColourCorp
> rather than FooCorp, there would be no protection at all.

Added this text:
    Furthermore, consider an attacker using a similar corporation
    such as "foocorp" (or any variation containing a non-ASCII character
    that some humans might expect to represent the same corporation).
    If the resource holder treats them as different, but the
    security token service treats them as the same, then again
    elevation of privilege can occur.

> (8) In Section 3.1, please do not use "hostname" as a synonym
> for what we've called "LDH label" or "NR-LDH label" in RFC 5890
> and elswhere.  "Hostname" has other --very confused and
> contradictory-- definitions and semantics, including questions
> about whether the term refers to an FQDN or just a label and
> whether it can refer to a non-terminal DNS node or be
> associated with an alias-type RR.  If
> I-D.ietf-pkix-rfc5280-clarifications uses "hostname" that way,
> it should be fixed.   I thought we had gotten that confusion
> out of RFC 6055 but, if we did not, it is errata time.

Section 3.1 itself was actually consistent in always using label (never
hostname) where it meant label.  However, that wasn't clear to the 
reader.  Hence made this explicit, starting with:

    Hostnames (composed of dot-separated labels) are commonly used ...
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Also changed two uses of "string" in section 3.1.3 to "label"
again to be more explicit.  Resulting text shown under (11) below.

> As always, I look forward to this bit of confusion being
> completely cleared up with the publication of
> draft-ietf-dnsext-dns-authoritative-terminology and its
> achieving broad consensus.  :-(
> 
> 
> (9) Another issue with Section 3.1 is illustrated by the
> following example.  Suppose we have
>     example.com.  IN A 10.0.0.6
>     example.net.  CNAME example.com.
>     example.org.  IN A 10.0.0.6
> 
> now, as an exercise, figure out which pairs of those names are
> "equal".  As an even more interesting exercise, contruct a
> similar example using DNAME or perhaps a hypothetical "VARIANT"
> RR type. 

Added to section 3.1.4.  

> (10) I find parts of 3.1.1 very troubling.  Separate note when
> I figure out how to explain what is bothering me, but at least
> some of it may involve the relationship between
> standard-but-loose as presented in this document and "user
> interface matter" as described in RFC 1123.

No change.

> (11) The discussion in 3.1.6 almost certainly needs to include
> some comments of the effect of mapping, especially UTR 46
> mapping, on comparison issues.  The existing text comes close
> to assuming that all "Unicode" (native character) forms of IDN
> labels are U-labels and hence duals of A-labels.  UTR 46
> effectively rejects that assumption.

Above was actually about 3.1.3.  I assume the existing text
you refer to was where it referred to the "equivalent Unicode
string ("U-label")."

The original intent of this text wasn't to get into UTR 46 issues.
(If folks think that's a good idea, text welcome.  Seems like a 
quagmire to me though :).
Rephrased to:
    A hostname comparator thus needs to decide whether a Punycode-encoded
    label should or should not be considered a valid hostname label, 
    and if so, then whether it should match a label encoded in some 
    other form such as a percent-encoded Unicode label (U-label).

> (12) In 3.1.4, it might also be worth noting that certainly DNS
> tricks that are widely popular in practice (e.g., the Akamai
> one) may have the conseqeunce that the same name resolves to
> different IP addresses depending on when and from where the
> question is asked. network load when it is asked, etc.

Done. Also added to the list of problems
the fact that it requires both connectivity (which might not be
present at the time) and a willingness to wait for the answer.

> (13) Section 3.3, Paragraph 3, editorial:  
> Old:
> 	each of which has their own rules
> New:
> 	each of which has its own rules

Done.

> (14) Section 3.3 is titled "URIs and IRIs" but does not touch
> on IRIs --especially the messy problem of whether IRIs can be
> compared at all without conversion to URIs and whether that
> conversion is sufficient to produce comparable forms-- at all.
> The term "IRI" doesn't even appear in the section.  I suggest
> changing the title to "URIs", but don't see how an i18n
> identifier comparison document can avoid them entirely, as this
> draft seems to.

Done.

> (15) Section 4: See comments above about variants and note that
> there is no DNS mechanism for identifying whether a pair of
> FQDNs are "Chinese" variants (i.e., separately-delegated domain
> subtrees that are expected to have some relationship based on
> character associations) or "Saudi numeric" variants (i.e., two
> forms of the same labels in which digit types are substituted,
> typically in the user interface).  Either can produce what the
> user would see as a false negative.  Attempts to produce the
> alternate forms algorithmically for comparison purposes could
> produce false positives.

How's this text:
    First, there is no DNS mechanism for identifying whether two
    strings (such as "color" and "colour", although many non-English
    cases occur such as Saudi numeric strings, different forms of
    Chinese strings, etc.) would be seen by a human as being equivalent.
    Attempts to produce such alternate forms algorithmically could
    produce false positives and hence have an adverse affect on security.

-Dave

[I18n-discuss] FW: New Version Notification for d… Dave Thaler
Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
Re: [I18n-discuss] FW: New Version Notification f… Dave Thaler
Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
Re: [I18n-discuss] FW: New Version Notification f… Dave Thaler