Re: [urn] Suggested PWID URN for Persistent Web IDentifiers

Hello, 

a few comments below. 

-----Original Message-----
From: urn <urn-bounces@ietf.org> On Behalf Of Henry S. Thompson
Sent: perjantai 27. heinäkuuta 2018 11.58
To: Eld Zierau <elzi@kb.dk>
Cc: urn@ietf.org
Subject: Re: [urn] Suggested PWID URN for Persistent Web IDentifiers - version 3

The only parts of the argument that follows that I can make sense of, as to why the proposed scheme is needed, consist of two mistaken and/or unsupported high-level claims:

 1) Citations of web-hosted resources should refer to archives, not to
    the resources themselves;

 2) Citation of archived material needs help to be sufficiently 'precise'.

The first is at best contentious as a general claim, and even in cases where an archive reference is preferred, why packaging it in a PWID urn will add value is not explained.  After all, if I know that what I want to cite is available in some archive, it's because I've found it there, in which case I already _know_ the URI I can use to retrieve it.  More on this when below when we come to the question of uniqueness.

Juha: ISO TC 46/SC 9 is currently revising ISO 690 (Guidelines for bibliographic references and citations to information resources). One of the challenges is how to deal with reference rot, which consists of link rot (404) and content drift (retrieved documents change). A large percentage of all Internet resources suffer from these problems. 

Eld has been most helpful in writing the Annex that discusses the usage of Web archives in detail, and the current version is based on what she has written, with two important exceptions: 

- I don't think that referring just to a copy of the cited resource in a Web archive is always the best solution. Archive reference only must be used when the resource is dynamic (say, https://www.ed.ac.uk/) and a specific version is cited.  But when the page is supposed to be persistent (e.g. https://www.w3.org/Provider/Style/URI) the added value provided by the archive link may be modest, because the archived copy may not always be exactly the same, and web archives themselves may not last forever. The current draft of ISO 690 (which is still only a working draft) does promote the use of web archives, but not as the only solution. And if web archives are used, utilization of Memento is something that should be considered since it provides some level of protection to disappearance of web archives.  

- the current ISO 690 draft does not mention PWID. There are three reasons for this. First, and most importantly, PWID does not (yet) have a registered URN namespace. An ISO standard cannot refer to PWIDs as long as namespace registration has not been completed. And IMO it is not obvious that it will be completed. Second, every persistent identifier that will be included in the examples or elsewhere in the standard must be actionable (expressed, for the time being, as HTTP URI). So all DOIs that were in the form doi:<DOI string> have been converted into https://doi.org/<DOI string>. This will not possible with PWIDs, since only the Danish web archive is planning to implement PWIDs, in an archive which has strict access limitations. 

When you consider what to do with PWID, keep in mind that as a national library the Royal Danish Library can always use the NBN namespace for items in the web archive. For instance, something like urn:nbn:pwid:<pwid> can be constructed, as long as the <pwid> string conforms to the syntactical requirements of the NBN namespace.    

Henry: The second surely applies to any kind of citation at all, and seems to depend on a confusion between web identity, media type and the ontology of citable things and loci associated with them -- all of FRBR lurks just around the corner here.

Just because some representation is hosted on the Web does not make it easier (or harder) than it ever was to make clear in a citation what part/aspect/property of what you _name_ in a citation is what you actual are _referring to_.  More on this below when we come to the 'coverage-spec'.

Juha:  trying to express in PWID coverage-spec which part of the of the resource has been cited may be more difficult that providing this information elsewhere in the citation. ISO 690 draft specifies such an alternative approach.  And even if PWID really has to be used for this purpose, usage of f-, r- and q-components may also be considered as an alternative. 

As Henry says below, coverage-spec would need a lot more alternatives to be truly useful. Coverage-spec "other" does not add much value, and it would be necessary to use it for every image, video, etc. And if there are several images on the page, how to indicate the right one (except with f-component)? 

Best, 

Juha

------------
One more specific issue with this section

  "The precision regards both regards precise reference where there
   can be no doubt about that you have the correct web material as
   well as precision about what is actually referred by the reference
   (e.g. is it the page or the whole website)"

There's nothing in the proposal which follows to back up the "you have the correct web material" claim, and the subsequent inventory of 8 'type[s] of archived item[s]' is clearly not sufficient to unambiguously determine "what is actually referred [to] by the reference".

*Syntax*

I see no problems in the ABNF.
------------
The definitions of the 'coverage-spec' values are hopelessly underspecified, and heterogenous.  The 'part'/'page' distinction is particularly unclear/non-obvious/ambiguous/medai-type dependent.

*Assignment*

The initial claim here and its subsequent gloss taken together don't hold up:

 "The PWID URNs does not have to be assigned by an authority, as they
  are based on the information created at the time of archiving:"

 "In other words: the PWID URNs are created independently, but
  following an algorithm that itself guarantees uniqueness."

On a strict reading of "uniqueness" (i.e. a one-to-one relation between items and PWIDs), this amounts to a claim that _any_ 3rd party considering _any_ item in _any_ archive will always construct the _same_ PWID.  This reading is obviously false: the presence of the the two "+( unreserved )" expansions in the ABNF amount to a _guarantee_ that there will be multiple distinct PWIDs for the same item.

RFC8141 does not require one-to-one relations between URNs, even URNs within the same namespace.  But it does require, in the case of "URNs .. . . created independently" that they be created "following an algorithm that itself guarantees uniqueness".  The underspecification of the 'coverage-spec' already alluded to above makes independent _and_ inconsistently understood coinages of identical PWIDs not just possible but likely.  Consider for example

urn:pwid:web.archive.org:2018-01-01T17:03:53Z:part:http://www.ltg.ed.ac.uk/~ht/

One person might coin this to mean the text/html character sequence which was served from http://www.ltg.ed.ac.uk/~ht/ (my homepage at the University of Edinburgh) at the beginning of 2018, whereas someone else might coin it to mean the text/html character sequence you get from the Web Archive today if you do an http GET on https://web.archive.org/web/20180101170353/http://www.ltg.ed.ac.uk/~ht/,
which are different in important ways.

In general, it seems that knowing whether to use 'part' or 'page' would depend on a detailed understanding of the media type of the retrieved representation, which the average creator of a citation is unlikely to have.
-------------
Some of this is corrigible, but only at the expense of a vast increase in detail and clarity.  Others not-so-much, insofar as so much is _necessarily_ left underspecified, to allow for _anything_ to be considered as an archive ('archive-id' value), with the consequent requirement to allow for arbitrarily idiosyncratic internal naming mechanisms ('archived-item' value).

I'm curious in this connection to wonder if we actually have any URN namespaces where assignment is done "independently but following an algorithm that itself guarantees uniqueness"?  Ah, there are registrations for urn:uuid:, which qualifies, and urn:oid:, which sort of does.  OK, is there such a namespace in active use?  Which provides for resolution, at least in principle?

A crucial point here is that in the OID case a urn:oid:... carries a chain of responsibility with it.  If you run across one, you know how to find out who takes responsibility for every level in the tree that's involved in interpreting it.  In the UUID case, a urn:uuid: doesn't travel well/at all, so the question doesn't arise.  But for PWIDs, if I find one I have _no way_ to figure out who's responsible, or whom I can ask for clarification, or whom to blame if it doesn't appear to 'work'.
----------------
I don't see what the intended value of the discussion of the "SOLR-Wayback tool" is for the spec.

*Interoperability*

The above comments about ambiguity/lack of identity apply here too: if different implementors interpret e.g. a coverage-spec of 'subsite'
differently, their resolvers will not interoperate.

*Resolution*

None of this is very much to the point.  In particular, it completely ignores the precedents set by doi.org, identifiers.org and n2t.net
(https://doi.org/10.1038/sdata.2018.95 exemplifies the first and its content discusses the second and third). 

==============

In conclusion, I think a useful way to think about this proposal is as a misguided attempt to define a urn type which _is_ a citation.  This makes sense of PWID's 3rd party nature, and clarifies that the archive-id and -time, the embedded URI and the coverage spec are just a pretty arbitrary, certainly impoverished, subset of what you would expect to find in a _real_ citation.  Trying to pack a citation into a URN just doesn't make sense to me. Neither does trying to determine exactly the _right_ subset of citations of web-hosted resources to pack up in order to be both useful and unambiguous.

(There is work underway in various venues (see e.g. [1]) to _objectify_ citations and give PIDs to _those_, which might address at least some of the goals of this work.)

ht

[1] http://sched.co/DJ3P
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/  [mail from me _always_ has a .sig like this -- mail without it is forged spam]

_______________________________________________
urn mailing list
urn@ietf.org
https://www.ietf.org/mailman/listinfo/urn