Re: [urn] Tuning the "URNs are not URIs" spec

Juha Hakala <juha.hakala@helsinki.fi> Wed, 18 June 2014 07:34 UTC

Return-Path: <juha.hakala@helsinki.fi>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7E0C71A005D for <urn@ietfa.amsl.com>; Wed, 18 Jun 2014 00:34:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.252
X-Spam-Level:
X-Spam-Status: No, score=-6.252 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_I_LETTER=-2, J_CHICKENPOX_34=0.6, RCVD_IN_DNSWL_MED=-2.3, RP_MATCHES_RCVD=-0.651, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rAcnOTfzuY0X for <urn@ietfa.amsl.com>; Wed, 18 Jun 2014 00:34:36 -0700 (PDT)
Received: from smtp-rs1-vallila2.fe.helsinki.fi (smtp-rs1-vallila2.fe.helsinki.fi [128.214.173.75]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 375791A002A for <urn@ietf.org>; Wed, 18 Jun 2014 00:34:35 -0700 (PDT)
Received: from [128.214.71.180] (lh2-kkl1206.lib.helsinki.fi [128.214.71.180]) (authenticated bits=0) by smtp-rs1.it.helsinki.fi (8.14.4/8.14.4) with ESMTP id s5I7YXdH030238 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Wed, 18 Jun 2014 10:34:34 +0300
Message-ID: <53A14107.5020300@helsinki.fi>
Date: Wed, 18 Jun 2014 10:34:31 +0300
From: Juha Hakala <juha.hakala@helsinki.fi>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: urn@ietf.org, John C Klensin <john-ietf@jck.com>
References: <964DC8688FC7E02C49E21E2D@JcK-HP8200.jck.com> <201406172015.s5HKF0Xx012074@hobgoblin.ariadne.com>
In-Reply-To: <201406172015.s5HKF0Xx012074@hobgoblin.ariadne.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: http://mailarchive.ietf.org/arch/msg/urn/wOGECW1XpwVrDMrLeABEUPZSD8Q
Subject: Re: [urn] Tuning the "URNs are not URIs" spec
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Jun 2014 07:34:40 -0000

Hello,

On 17.6.2014 23:15, Dale R. Worley wrote:
>> From: John C Klensin <john-ietf@jck.com>
>>     Instead, the question is whether the IETF is willing to
>>     evolve and adapt the URN definition to accommodate those
>>     perceived needs or whether if prefers to have that work
>>     done elsewhere, either by adoption in the broader
>>     community and marketplace of a different approach or,
>>     potentially, even a competing URN standard.
> As far as I can tell, the key question is "Is the syntax of RFC 2141
> ("URN syntax") inadequate for the needs of important groups of users?"
Urnbis charter requires (among other things) the WG to make

> an update of the formal syntax specification in the light of the
> URI Standard (STD 66, RFC 3986) using the ABNF from STD 68
> (RFC 5234)

In practice, two major changes are needed: revised URN syntax should 
allow the use of fragment and query.

The use of URI fragment to indicate locations within identified 
resources enables creation of URN-based citations instead of URL-based 
ones. But in order to avoid breaking the assignment rules of traditional 
standard identifiers, RFC2141bis must make it clear that the fragment is 
not part of the namespace specific string (that is, from URN point of 
view, it does not identify anything).

There is definitely also a need to use URI query for passing resolution 
related parameters to URN resolvers. URN resolvers currently in use are 
dumb; of all the resolution services specified in RFC 2483 they usually 
support by default only URN - URL mapping (to a single URL only). This 
is insufficient, since users may not want the document itself. Instead 
they may be interested in descriptive metadata about the resource, or 
other versions of the resource (work), etc.

DDDS  (Dynamic Delegation Discovery System) has been available since 
2002 as a means for passing service related information to resolvers, 
but it has not become popular. URI query provides a more lightweight and 
technically sufficient means for making URN resolvers smarter.
> If the syntax is adequate, then various groups of users can define
> namespaces that have the properties that they desire without causing
> any upset in running code; at most, revisions would be needed to RFC
> 3406 ("URN namespace definition mechanisms").  (Unlike IPv4 addresses,
> we aren't in danger of running out of namespace identifiers.)
Services available in namespaces depend to large extent on applications 
used to manage the identified resources. That is, resolution services 
supported by a namespace will change over time.

For instance, with their current systems (such as integrated library 
systems and open repositories) libraries could in principle support all 
the URN resolution services specified in RFC 2483. In fact, there is 
already a problem since metadata about resource can be supplied in many 
metadata formats, and RFC 2483 does not allow the libraries to specify 
the format (it is supposed to be URC or Uniform Resource 
Characteristics, which does not exist).

In the future, when e.g. national libraries will establish digital 
preservation systems, a lot of new services will be available. For 
instance, a user can check if there are earlier / later versions of the 
identified resource in the digital archive. He will be able to compare 
the intellectual content of successive versions of the work, which may 
differ due to migrations needed to maintain the usability, and check 
which applications will be needed to render the preferred version readable.

>
> (And for example, this flexibility allows multiple sets of identifiers
> which handle name/locator differences in distinctly different ways.)
>
> What seems to be constantly hinted at in this discussion -- but never
> explicitly stated -- is that important groups of users have needs that
> CANNOT be satisfied within the syntax of RFC 2141.

I have tried to make it clear that for instance library community needs 
additional functionality which can be provided by query and fragment.

> Not just that the
> particular definitions of "fragment" and "query" in RFC 3986 are
> inadequate for their needs, but that those needs cannot be satisfied
> by *any alternative means* that can be represented within the syntax
> of RFC 2141.
I can't see an acceptable alternative for the usage of fragments. 
Everyone must be able to make citations and references, but identifier 
assignment - when done properly - is a managed process.

DDDS could be used instead of query, but slow / non-existent adoption of 
the former during the last decade makes it clear to at least me that an 
alternative method is needed.

>
> If that actually is so, it is a fairly concrete fact, it shouldn't be
> very difficult to make the case in a convincing way.  And once the
> case is made, it would be a sound reason to move ahead with
> significant changes to the status quo.

The problem is that what looks convincing to e.g. library community - 
and we are not the only significant group out here which is using 
persistent identifiers - may not convince other groups which have 
different requirements and applications. If there is no need to preserve 
millions of documents for hundreds of years, even cool URIs may look 
like a sufficient solution.
>
> One objection I can see to RFC 2141 is how restrictive it is:  The
> namespace-specific string can contain only letters (two cases),
> digits, and 15 of the 32 ASCII special characters.  (That's not
> counting %, since it has a special purpose.)  I can see how a group
> may want to have much more syntactic flexibility in designing an
> identifier, and they wouldn't enjoy %-encoding a large fraction of the
> characters in their identifiers.
In practice, many communities which use URNs do not see this as a 
significant limitation. Standard identifiers often allow only a very 
limited set of characters. And even if UNICODE is allowed in principle 
(as in Handle system), it is not used in practice.

The problem some traditional identifier systems such as ISBN and ISSN 
are facing when they are applied to digital documents has to do with the 
syntax. For instance, ISBN has been extended once (from ISBN-10 to 
ISBN-13) to extend the scope of the system. And the next revision is 
already under way.
>
> OTOH, there's no reason that a user group would have to *see* what
> their URNs look like very often.  Most user interfaces that would
> refer to sophisticated identifiers don't have to reveal the URNs that
> are used one the wire, any more than that we have to see the bits that
> encode the characters of our e-mails.

Correct. If resolution service / services exist, then making an ISBN 
actionable by turning it into URN can be an automated process. And this 
actionability is all the ISBN users want from URNs. URN implementation 
should not change the scope of the ISBN system in any way.
>
> For instance, there's no reason that DOIs can't be encoded much the
> same way that EIDRs are encoded (draft-pal-eidr-urn-03):
>
>      The DOI "10.1051/0004-6361:20054201"
>      becomes the URN "urn:doi:10.1051:0004-6361%3a20054201".
>
> (Any sequence of Unicode characters can be represented as a sequence
> of %-escapes per RFC 2141 section 2.2.)
>
> The great benefit of this approach is that generic operations on these
> URNs can be done by the great bulk of currently running code.

Technically, DOIs (or Handles or ARKs) can be expressed as URNs, and 
vice versa. But there is no added value whatsoever if the URN is 
resolved in the same way as the DOI.

There is often just one traditional identifier (such as ISBN), but it 
has to be made actionable in different ways. Let's assume that we have a 
doctoral dissertation published in PDF format with an ISBN. DOI using 
this ISBN as suffix may resolve to the publisher's repository, which is 
behind a pay wall. Same ISBN as Handle may provide a link to the 
DSpace-based open repository maintained by the author's alma mater where 
the thesis is available for free. And finally the national library may 
have a legal deposit copy of the book with URN:ISBN, available for free 
but only on dedicated workstations in legal deposit libraries (so a user 
not using one of these workstations will only get metadata about the book).

In theory, all this functionality could be provided by the publisher's 
DOI resolver. In practice, this is not likely to happen, for financial 
reasons that should be quite obvious. As an aside, PID communities are 
currently considering creation of multi-PID resolvers (via e.g. adding 
URN resolution functionality to the Handle software). This is necessary 
since there are (open) repositories which contain documents with 
different PIDs.

Juha
>
> Dale
>
> _______________________________________________
> urn mailing list
> urn@ietf.org
> https://www.ietf.org/mailman/listinfo/urn


-- 

  Juha Hakala
  Senior advisor

  The National Library of Finland
  Library Network Services
  P.O.Box 26 (Teollisuuskatu 23)
  FIN-00014 Helsinki University
  Tel. +358 9 191 44293
  Mobile +358 50 3827678