Re: [urn] Talking about fragments in URNs

Sean Leonard <dev+ietf@seantek.com> Wed, 29 June 2016 15:38 UTC

Return-Path: <dev+ietf@seantek.com>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 91C6912D09D for <urn@ietfa.amsl.com>; Wed, 29 Jun 2016 08:38:00 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.601
X-Spam-Level:
X-Spam-Status: No, score=-2.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gFG2botCyRSB for <urn@ietfa.amsl.com>; Wed, 29 Jun 2016 08:37:58 -0700 (PDT)
Received: from mxout-08.mxes.net (mxout-08.mxes.net [216.86.168.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 231A112D0B7 for <urn@ietf.org>; Wed, 29 Jun 2016 08:37:58 -0700 (PDT)
Received: from [192.168.123.7] (unknown [75.83.2.34]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 3F83050A73; Wed, 29 Jun 2016 11:37:56 -0400 (EDT)
To: "Hakala, Juha E" <juha.hakala@helsinki.fi>, "urn@ietf.org" <urn@ietf.org>
References: <CALaySJ+NpCY9dMhuz+yyP4N0x6cO7D+iHVvaeAhoV2QxNvDPXQ@mail.gmail.com> <aae7f6bf-b42f-3063-4422-ba139b6eb974@seantek.com> <d6528125-0f16-70c6-7048-5d29e53b427e@stpeter.im> <4839DC906BA288B105BFE29F@JcK-HP8200> <VI1PR07MB1727935B40F4047972F7A163FA230@VI1PR07MB1727.eurprd07.prod.outlook.com> <efc6b6cc-fd83-0acf-15a3-2598e41a90df@seantek.com> <VI1PR07MB1727D9CC015A1C1A34AD6A35FA230@VI1PR07MB1727.eurprd07.prod.outlook.com>
From: Sean Leonard <dev+ietf@seantek.com>
Message-ID: <4fc21c1e-f127-8b2f-18a5-f10fffd8f761@seantek.com>
Date: Wed, 29 Jun 2016 08:37:38 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <VI1PR07MB1727D9CC015A1C1A34AD6A35FA230@VI1PR07MB1727.eurprd07.prod.outlook.com>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/urn/m-kdGlVxjx8jS5b2sdQf-5vhaxQ>
Subject: Re: [urn] Talking about fragments in URNs
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Jun 2016 15:38:00 -0000

On 6/29/2016 12:56 AM, Hakala, Juha E wrote:
> Hello Sean; all,
>
> some comments below.
>
>> -----Original Message-----
>> From: urn [mailto:urn-bounces@ietf.org] On Behalf Of Sean Leonard
>> Sent: 29. kesäkuuta 2016 9:54
>> To: urn@ietf.org
>> Subject: Re: [urn] Talking about fragments in URNs
>>
>> On 6/28/2016 10:39 PM, Hakala, Juha E wrote:
>>> Hello,
>>>
>>> there are at least four different aspects of equivalence or "naming the
>> same thing".
>>> [...]
>> Thank you, Juha, for that.
>>
>> Upon reading the responses, I revise my proposal of "Name the Same
>> Thing". It appears that "NtST" means or can be implied to mean "semantic
>> equivalence", which is not what we are going for.
> Yes. People who actually create URNs must pay attention to semantic equivalence and use the namespace specific rules when they do that. But it is beyond the URNBIS scope to try to tell namespaces what they should be doing.
>> I read draft-ietf-urnbis-rfc2141bis-urn-17 and remain staunchly opposed to
>> the term "URN equivalence". The term is confusing and adds little value.
>> Even very intelligent people can't tell if the q-, r-, and f-components are
>> supposed to be included in the comparison or not.
> They are not, since then many namespaces would not be able to use them. Including f-component in the comparison would mean for instance an extension to the ISBN scope so that it could be used to identify component parts of books, which is not acceptable to the ISBN community. The problem can be bypassed simply by saying that fragment does not identify anything, which is not compliant with what at least some people think RFC 3986 requires. This dilemma has been solved by making URN palatable to the identifier systems.
>
>> Honestly, I don't see why we got away from the term used by Section 5 of
>> RFC 2141, "lexical equivalence". Folks, that is what we have been talking
>> about all along.
> Lexical equivalence might be a better term than URN equivalence, since URN equivalence can be understood to be too broad a concept. For instance, some people may think that URNs based on ISBN-10 and ISBN-13 of the same book are equivalent in the RFC2141bis sense of the word since they identify the same resource. However, these URNs are lexically equivalent only if additional namespace specific rules are applied to detect the equivalence. RFC 2141bis guidelines alone would not be sufficient to detect the equivalence.
>   
>> Lexical def.:
>> http://www.dictionary.com/browse/lexical?s=t
>> 1. of or relating to the words or vocabulary of a language, especially as
>> distinguished from its grammatical and syntactical aspects.
>> 2. of, relating to, or of the nature of a lexicon.
>> => Lexicon def.:
>> 1. a wordbook or dictionary, especially of Greek, Latin, or Hebrew.
>> [...]
>> 3. inventory or record.
>>
>> ***
>> That is what we are talking about. A namespace is an "inventory" (or
>> "dictionary") of names. Each name is different. But some names can be
>> written in a plurality of ways but still be the same name. Tomatoes and
>> tomatos are the same fruits (tomato is a fruit, in case you were wondering).
> OK, but in lexical analysis we would only be interested in one language at the time. We shall not try to determine anything across languages, for instance to determine if tomato and tomaatti mean the same thing.

Yes. Basically the comparison has to be computable offline, via the 
application of an algorithm specified in the namespace registration. And 
you can't try to get around this by registering the entire English and 
Italian dictionaries as part of your algorithm and expect people to 
implement it.

>
>> It appears that the term "lexical" got dropped between draft-05 and
>> draft-06:
>> https://tools.ietf.org/rfcdiff?difftype=--hwdiff&url2=draft-ietf-urnbis-
>> rfc2141bis-urn-06.txt
>>
>> for reasons unbeknownst to me...but as Peter Saint-Andre and Ryan Moats
>> were the editor/author at the time, I would like to ask (respectfully and
>> politely), "why?"
>>
>> Let's bring back "lexical equivalence", and call it a kind of scheme-specific
>> comparison (Section 6.2 of RFC 3986).
> OK, but in order to avoid misunderstandings we should say that lexical analysis is not namespace (identifier) specific but applies to all namespaces, and from RFC 3986 point of view it is scheme specific from URN point of view; not from the point of view of identifier schemes.
>
> We must also keep the last paragraph of 3.1 which says that namespace definitions may include additional rules for URN equivalence. These rules can be lexical or something else. For instance, ISBN standard says that hyphens must are not part of the ISBN and are only used to improve readability (simple lexical rule) but there is also a complex lexical rule concerning generation of ISBN-13 from ISBN-10, and semantic rules concerning identification of resources.

Ok.

It sounds like we now have two votes/opinions in favor of "lexical 
equivalence".

In view of the computability property, it would be desirable for 
namespace registrants to provide a string comparison specification that 
can be compiled directly into a variety of implementations. What are 
standard notations for string comparison specifications? For example, 
rather than just saying in prose: "To compare ISBNs for lexical 
equivalence, you eliminate hyphens and then compare in a case-sensitive 
manner", you should say {elim: "-", compare: "C"}, and then 
copy-and-paste this (or compile this) into your implementation.

Regular expressions sort of fall in this category, but not really. A 
regex can test an arbitrary string against itself, but we want to take 
two strings and compare them in a particular way. This is either called 
normalization, or something like "comparison operation". You can write 
regular expression-based patterns to eliminate characters, e.g., 
s/-//gi, but that is not a true comparison spec, that is a pattern 
substitution spec. I am looking for something like LDAP/X.500 Attribute 
syntax matching rules. A quick Google search did not come up with what I 
was looking for.

Juha brought up this ISBN-10 -> ISBN-13 thing. If the registrant of the 
ISBN namespace wants to provide a complex but still purely "lexical" 
algorithm to generate and compare the two, that is computable, I say it 
is fair game. That is up to the namespace registrant.

Sean