Re: [urn] Comments on draft-ietf-urnbis-rfc2141bis-urn-10

Hello,

Some comments below. 

________________________________________
Lähettäjä: urn <urn-bounces@ietf.org> käyttäjän  puolestaSvensson, Lars <L.Svensson@dnb.de>
Lähetetty: 25. maaliskuuta 2015 18:24
Vastaanottaja: urn@ietf.org
Aihe: Re: [urn] Comments on draft-ietf-urnbis-rfc2141bis-urn-10

All,

Sorry for causing much confusion, I should make clear that what I intended to say here is _not specifically about DOIs_, but about the (proposed) p-component, the NSS and urn comparison (my use of DOI was only to illustrate my point and obviously was not helpful). Specifics below:

On Wednesday, March 18, 2015 5:46 PM, Peter wrote:

> To: Svensson, Lars; urn@ietf.org
> Subject: Re: [urn] Comments on draft-ietf-urnbis-rfc2141bis-urn-10
>
> On 3/17/15 7:16 AM, Svensson, Lars wrote:

> >> Although I don't claim to fully understand the DOI numbering
> >> system (with regard to URNs, Section 2.6.3 of the DOI Handbook is
> >> especially confusing to me because it implies that in a URN the
> >> slash could be replaced by a colon), let's take your example at
> >> face value.

Slash may appear in two different roles in Handles and DOIs: as a separator between prefix and suffix, and as a character within the identifier string. These two cases should receive different treatment when DOIs are converted to URNs. That is why the DOI Handbook section 2.6.3 says that: 

"To enable the use of DOIs in workflows that have already standardized on URNs, the DOI proxy servers understand the substitution of a colon in place of the initial slash in a DOI name. DOI names may therefore be expressed as URNs in the doi.org domain by writing, for example, the DOI name 10.123/456 in the form http://doi.org/urn:doi:10.123:456. Note, however, that a DOI suffix is allowed to contain other slashes, and where these occur they must be percent-encoded rather than replaced with a colon: for example, the DOI name 10.123/456ABC/zyz would become http://doi.org/urn:doi:10.123:456ABC%2Fzyz, with the final slash character encoded as %2F. "

IMO it is best not to divert from this representation, which makes sense to me. However, the example is theoretical since the namespace DOI has not been registered for the time being and the DOI community does not seem to have much interest to do so. 

> But non-technical users aren't creating URNs.

Incorrect. In Finland graduating students often publish their master's thesis in the Web. If they do that, they are asked to create metadata for their work. The template which assists them in the process retrieves a URN:NBN which is used as the identifier for the thesis. 

So with the help of tools provided by the national library non-technical users are minting thousands of URNs annually. 

> But non-technical users don't know about URNs.

Students and researchers may not know what URN is, but the importance of persistent identifiers in general is gradually becoming more apparent due to increased usage of PIDs. Large percentage of research articles published in electronic format receive them, and the same applies for research data sets.  On the other hand, increasing link rot has an impact on what people think about so called cool URIs.    

About the usage and role of persistent identifiers in science, see for instance 

http://www.pidconsortium.eu/

> > they will create
> > "urn:example:doi:10.1007/10929179_65" by simple string concatenation.
> > It will require a substantial amount of education to make people
> > substitute the slash in the DOI for a colon.
>
> I'd assume that the International DOI Foundation would create these URNs.

As far as I know, IDF is not involved with this kind of practical work.  

> And others might want to figure out what would be the URN for a given identifier (e. g. ISBN or ISSN).

As an aside, it may not be trivial to restore an identifier in its pristine form from within DOI / Handle string. You cannot assume that the DOI suffix = URN NSS. For instance for ISBN the recommendation is to put the publisher part of the ISBN to the DOI suffix. So ISBN 951-0-18435-7 should be represented as DOI 10.9510/184357. 

> (And remember this is just an example - we have no idea if the IDF has
> any plans to use URNs. But the same reasoning would apply to any
> similarly structured identifier system.)

There are no plans to represent any other persistent identifiers as URNs (or vice versa). ISBN can be presented both as DOI and as URN, and that makes sense because the ISBN would then be resolved in two different resolution services (DOI resolver maintained by the publisher and URN resolver maintained by the national library, for instance). But there is no reason to stack persistent identifiers on top of one another so that the resulting combo-PID needs to be resolved twice (for instance, URN containing a DOI first in the national library's URN resolver in order to find the publisher's DOI resolver which would then retrieve the actual document). 
>
> >> Further: are there alternative representations? For example, might
> >> it make sense to represent the complete DOI (prefix and suffix) as
> >> a p-component? Such as:
> >>
> >> urn:example:doi/10.1007/10929179_65
> >
> > Yes, for DOIs that would be a much better idea.

The best idea is to stick to what the DOI Handbook says. 

All the best,

Juha

>
> So, if the IDF registers a URN namespace, we'd encourage them to do the
> right thing regarding the p-component.
>
> > My point is, however, that when non-technical users create urns just
> > adding a prefix to an existing identifier,
>
> I question many assumptions in that part of your sentence. :-)
>
> > we should make sure that
> > things work as intuitively as possible. Since 2141bis allows the "/"
> > in the syntax, a generic URN parser will accept "
> > urn:example:doi:10.1007/10929179_65" (or any other URN created from
> > an identifier with a "/" in it) as a valid URN. I realise that this
> > might also be the case for other characters with special meaning, but
> > I also think that the "/" character will be more confusing to
> > implementers and users since it's more commonly used as a delimiter
> > than e. g. "@".
>
> If we're going to reuse URI syntax, then "/" is what we have as a
> delimiter between the authority and the p-component. Are you suggesting
> that we diverge from URI syntax?
>
> >>> In the syntax defined by 2141bis, the first one uses NSS +
> >>> p-component, the second one puts the complete DOI in the NSS.
> >>
> >> As far as I understand things, putting the complete DOI in the
> >> p-component makes the most sense (option 3).
> >
> > Agreed.
> >
> >>> This is potentially confusing for consumers, since §5 says that
> >>> [[ [urn-aware] applications might support display of URNs in a
> >>> more human-friendly form and might use a character set that
> >>> includes characters that are not permitted in URN syntax as
> >>> defined in this specification (e.g., when displaying URNs to
> >>> humans, such applications might replace percent-encoded strings
> >>> with characters from an extended character repertoire [...]). ]]
> >>
> >> I agree that using option 1 or option 2 is potentially confusing.
> >>
> >>> That means that if I choose option (2) (putting the complete DOI
> >>> into the NSS using percent-encoding), a conforming urn-aware
> >>> application MUST show "urn:example:doi:10.1007%2F10929179_65" and
> >>> MAY show "urn:example:doi:10.1007/10929179_65". This is likely to
> >>> be _very_ confusing to some users not aware of the difference
> >>> between the two.
> >>>
> >>> I suggest that the p-component is incorporated into the NSS; this
> >>> is more intuitive.
> >>
> >> Either that or don't split an identifier (e.g., a DOI name) across
> >> the NSS/p-component boundary.
> >

[Skipping some discussion about DOIs and coming to the point about urn comparison and p-component]

> > Right. Nonetheless I'd argue that it's more intuitive that the
> > p-component is incorporated into the NSS. If we say that a URN
> > consists of schema + NID + NSS + p-component (+ q-component +
> > f-component), this (at least to me) indicates that the p-component is
> > not a real part of the name, but something that is added to it. When
> > determining URN equality, however, the p-component is considered. It
> > would be more consistent to say two URNs are equivalent iff all parts
> > of the name are octet-by-octet equal (modulo normalisation). URN
> > components that are not part of the name (i. e. q-component and
> > f-component) are not considered when determining equality.

[Peter had answered to this but his answer is more about the syntactic aspect of the p-component and not about naming and urn comparison which is the point I'm failing to make...]

I try to reformulate what is my concern:

Right now we have urn=assigned-name + p-component (+q-component + f-component), where assigned-name = 'urn' + NID + NSS. To me this implies that p-, q- and f-components are not part of the name and thus (my reading) not really identifying anything (sort of second-class citizens). That seems OK for q- and f-components since they are ignored when we do urn comparison. But to me it seems counter-intuitive to have a part of the identifier that is not relevant for the _identification_ (not part of the assigned name) but nonetheless matters when it comes to _comparison_.

Am I over-interpreting the use of "assigned-name" to mean that the rest of the urn is not part of the name and thus not part of the identifier?

Best,

Lars
_______________________________________________
urn mailing list
urn@ietf.org
https://www.ietf.org/mailman/listinfo/urn