Re: [I18n-discuss] FW: New Version Notification for draft-iab-identifier-comparison-01.txt
John C Klensin <john-ietf@jck.com> Mon, 09 April 2012 14:08 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: i18n-discuss@ietfa.amsl.com
Delivered-To: i18n-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A2F7921F8739 for <i18n-discuss@ietfa.amsl.com>; Mon, 9 Apr 2012 07:08:58 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.249
X-Spam-Level:
X-Spam-Status: No, score=-1.249 tagged_above=-999 required=5 tests=[AWL=-1.250, BAYES_50=0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gdmCfPCTRuVX for <i18n-discuss@ietfa.amsl.com>; Mon, 9 Apr 2012 07:08:58 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) by ietfa.amsl.com (Postfix) with ESMTP id 4CD2721F871E for <i18n-discuss@iab.org>; Mon, 9 Apr 2012 07:08:58 -0700 (PDT)
Received: from [198.252.137.7] (helo=PST.JCK.COM) by bsa2.jck.com with esmtp (Exim 4.71 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1SHFC4-0007NU-CK; Mon, 09 Apr 2012 10:03:44 -0400
Date: Mon, 09 Apr 2012 10:08:52 -0400
From: John C Klensin <john-ietf@jck.com>
To: Dave Thaler <dthaler@microsoft.com>, i18n-discuss@iab.org
Message-ID: <6BE45A5D16245553F661236A@PST.JCK.COM>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Subject: Re: [I18n-discuss] FW: New Version Notification for draft-iab-identifier-comparison-01.txt
X-BeenThere: i18n-discuss@iab.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Internationalization Program <i18n-discuss.iab.org>
List-Unsubscribe: <https://www.iab.org/mailman/options/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=unsubscribe>
List-Archive: <http://www.iab.org/mail-archive/web/i18n-discuss>
List-Post: <mailto:i18n-discuss@iab.org>
List-Help: <mailto:i18n-discuss-request@iab.org?subject=help>
List-Subscribe: <https://www.iab.org/mailman/listinfo/i18n-discuss>, <mailto:i18n-discuss-request@iab.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Apr 2012 14:08:58 -0000
--On Monday, March 12, 2012 20:48 +0000 Dave Thaler <dthaler@microsoft.com> wrote: > Updates since -00 include: > > 1) Updated references, including quoted text to be from the > most recent relevant RFCs > > 2) To try to avoid a portion of the (non-technical) process > pushback we got at Appsarea and the plenary, no longer >... Dave, Sorry for the delay on this detailed review. I'll let the IANA and ITU Programs make my apologies for me. I like the general thrust of the document, but I have a lot of quibbles. I hope not too many of them run counter to the watering-down you did as a consequence of the earlier external comments. With two major exceptions (which I identify), I haven't tried to separate comments below that are editorial and semi-trivial from those that are more substantive. Anything for which there is disagreement will need discussion either way. Warning: The suggestions below push one set of examples that I think we need to address, if only because a new theory of "sometimes they are and sometimes they are not" name equivalence is being introduced into the community, one with much nastier implications than the discussion that now appears in the documetnt. In particular, it essentially reintroduces the old notion that two URIs are "the same" if they access the same content when the relevant Method is applied. Comparision systems that follow the URI spec are extremely likely to get different results than systems that use the model that content-identity equals identifier equivalence. That issue is similar to, but different from, the address-equivalence model addressed in Section 3.1.4. (1) Section 1(3), text reading: "in the email arriving at the holder's email server which has the repository of all email accounts on that server." Small distinction: mail stores ("repositories for incoming mail") are not necessarily on the same host as the email server and may be distributed across several machines. And the repository for mail doesn't necessarily where the matching is done -- matching does, necessarily, occur on the mail server. It occurs to me that, as this mess evolves, the very strong rule that only the delivery MTA can know how to interpret an address, it may be desire to have an SMTP extension that works a little bit like VRFY but takes two addresses and returns information as to whether they match --possibly differentiating between matching as a coding issue (can't change) and matching as an alias one (could change if aliases change). In a better world, one could think about reopening the subaddress can of worms, but that would probably doom the proposal. Such a command, like VRFY, could return "I'm not going to tell you" if the server considered that necessary, leaving whatever wants to know no owrse off then they are today. I'd be happy to write such a command up if others feel it would be worth the effort. (2) Major issue -- perhaps more textual/presentation tha substantive, but certainly affecting substance: In reading this draft starting with the comparison typology and picture at the end of Section 1, I realized that there are many places where it isn't actually talking about comparison at all. What we do when we are trying to compare two identifiers is to canonicalize them into as single form that would be bit-string equal if the two identifiers matched. That form might be either a lexical canonicalization of the original or some standardized surrogate for it. But the canonicalization is the important operation and is what "can be complicated" (next-to-late paragraph, Section 1); actual comparison is largely trivial. You essentially introduce that concept in Section 2.3, but don't take as much advantage of it as I think would be helpful. I'm going to assume that model (and some changes of text) in several of my comments below. In particular, it means that the more complex operations that this document (and others in the IETF) assume a carried out on a pair of strings instead only require a clear definition of the canonical comparison form for each string type (and the function that gets to it) -- the two-string operation is trivial. This is, among other things, consistent with the way our comparator registries and hash functions are defined. (3) Section 1, last sentence. This sentence starts talking about hierarchies of caches generally, uses web pages as a parenthetical example, than makes a statement about "authoritative web server[s]". Either make the whole thing a web example or make it about hierarchical caches with the web as an example. Incidentally, HTTP 1.1 [RFC2616, particularly Section 10.2.4] doesn't define an "authoritative web server" only "non-authoritative information" and an "origin server". IIR, that was done deliberately because an intermediate server is not required to return a 203 code if it knows that the information it is returning is identical to what the origin server would have produced. (4) Section 2, bullet 1: You may need to define "security identifier" for the benefit of some readers of this text. It is another place where the "canonical comparison form" concept may be important. (5) Section 2.1, bullet 2, text reads: "URI scheme names are defined to be a case-insensitive match". It sort of slips the ASCII requirement in later in the paragraph, perhaps leading the reader to believe that case-insensitive comparisons for non-ASCII characters are well defined (i.e., without the use of controversial operations or magic). Suggestion: "URI scheme names are required to be ASCII and are defined to match in a case-insensitive way". Or words to that effect (and then clean up the rest of the paragraph). (6) Section 2.2 paragraph 1. Text reads: '(Often the term "normalization" is used synonymously with "canonicalization", but in internationalization the term normalization has a precise meaning, and so we use the generic term canonicalization here instead.)' Unfortunately, Unicode supplies a precise meaning to "canonicalization" as well. The meaning is not necessarily consistent with the one used in the draft (and is definitely not consistent with the usage I'm advocating in note (2) above). John Tukey would certainly have suggested that we should be inventing a new word to be absolutely clear. I suggest that we should use the construction outlined in note (2) and make clear in the Introduction that variations on "canonical" in this document mean what we say it means --in terms of a per-identifier-type form and function-- despite the efforts of TUS to preempt the term. In any event, canonicalization is not an algorithm for comparison. It is a transformation applied to both identifiers to make it possible to determine equivalence of the identifiers by comparing the canonical (including surrogate) forms. (7) In Section 2.3, it might be helpful to deal with current events by adding another example. Suppose that Foo Corp actually writes its name as "Fô Corp" where that is feasible, using "FooCorp.example" and "FôCorp.example" as variant domain names. It would be quite natural for Fô Corp employees and algorithms to treat http://example.com/stuff/FooCorp/alice and http://example.com/stuff/FôCorp/alice as matching, although example.com might have no clue that particular matching was intended. The match would also be a violation of a strict reading of the URI spec, but, in the land in which a slightly different Alice travels, Humpty Dumpty presumably invents comparison procedures as well as word definitions and that is not an impediment. It is not clear whether the prohibition on non-ASCII characters in URIs provides protection or not. It is certainly intended to not be protective in the case of ICANN variants. And, of course, if example.com's customer was actually ColourCorp rather than FooCorp, there would be no protection at all. (8) In Section 3.1, please do not use "hostname" as a synonym for what we've called "LDH label" or "NR-LDH label" in RFC 5890 and elswhere. "Hostname" has other --very confused and contradictory-- definitions and semantics, including questions about whether the term refers to an FQDN or just a label and whether it can refer to a non-terminal DNS node or be associated with an alias-type RR. If I-D.ietf-pkix-rfc5280-clarifications uses "hostname" that way, it should be fixed. I thought we had gotten that confusion out of RFC 6055 but, if we did not, it is errata time. As always, I look forward to this bit of confusion being completely cleared up with the publication of draft-ietf-dnsext-dns-authoritative-terminology and its achieving broad consensus. :-( (9) Another issue with Section 3.1 is illustrated by the following example. Suppose we have example.com. IN A 10.0.0.6 example.net. CNAME example.com. example.org. IN A 10.0.0.6 now, as an exercise, figure out which pairs of those names are "equal". As an even more interesting exercise, contruct a similar example using DNAME or perhaps a hypothetical "VARIANT" RR type. (10) I find parts of 3.1.1 very troubling. Separate note when I figure out how to explain what is bothering me, but at least some of it may involve the relationship between standard-but-loose as presented in this document and "user interface matter" as described in RFC 1123. (11) The discussion in 3.1.6 almost certainly needs to include some comments of the effect of mapping, especially UTR 46 mapping, on comparison issues. The existing text comes close to assuming that all "Unicode" (native character) forms of IDN labels are U-labels and hence duals of A-labels. UTR 46 effectively rejects that assumption. (12) In 3.1.4, it might also be worth noting that certainly DNS tricks that are widely popular in practice (e.g., the Akamai one) may have the conseqeunce that the same name resolves to different IP addresses depending on when and from where the question is asked. network load when it is asked, etc. (13) Section 3.3, Paragraph 3, editorial: Old: each of which has their own rules New: each of which has its own rules (14) Section 3.3 is titled "URIs and IRIs" but does not touch on IRIs --especially the messy problem of whether IRIs can be compared at all without conversion to URIs and whether that conversion is sufficient to produce comparable forms-- at all. The term "IRI" doesn't even appear in the section. I suggest changing the title to "URIs", but don't see how an i18n identifier comparison document can avoid them entirely, as this draft seems to. (15) Section 4: See comments above about variants and note that there is no DNS mechanism for identifying whether a pair of FQDNs are "Chinese" variants (i.e., separately-delegated domain subtrees that are expected to have some relationship based on character associations) or "Saudi numeric" variants (i.e., two forms of the same labels in which digit types are substituted, typically in the user interface). Either can produce what the user would see as a false negative. Attempts to produce the alternate forms algorithmically for comparison purposes could produce false positives. --john
- [I18n-discuss] FW: New Version Notification for d… Dave Thaler
- Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
- Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
- Re: [I18n-discuss] FW: New Version Notification f… Dave Thaler
- Re: [I18n-discuss] FW: New Version Notification f… John C Klensin
- Re: [I18n-discuss] FW: New Version Notification f… Dave Thaler