Re: [urn] Suggested PWID URN for Persistent Web IDentifiers - version 3

ht@inf.ed.ac.uk (Henry S. Thompson) Fri, 27 July 2018 08:57 UTC

Return-Path: <ht@inf.ed.ac.uk>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7F361130DF4 for <urn@ietfa.amsl.com>; Fri, 27 Jul 2018 01:57:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.201
X-Spam-Level:
X-Spam-Status: No, score=-4.201 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Zgs3AcEaIhk4 for <urn@ietfa.amsl.com>; Fri, 27 Jul 2018 01:57:38 -0700 (PDT)
Received: from seine.is.ed.ac.uk (seine.is.ed.ac.uk [129.215.17.202]) by ietfa.amsl.com (Postfix) with ESMTP id AE335130DE3 for <urn@ietf.org>; Fri, 27 Jul 2018 01:57:36 -0700 (PDT)
Received: from crunchie.inf.ed.ac.uk (crunchie.inf.ed.ac.uk [129.215.202.41]) by seine.is.ed.ac.uk (8.14.7/8.14.6) with ESMTP id w6R8vZ6d004013; Fri, 27 Jul 2018 09:57:35 +0100
Received: from troutbeck.inf.ed.ac.uk (troutbeck.inf.ed.ac.uk [129.215.25.32]) by crunchie.inf.ed.ac.uk (8.14.7/8.14.7) with ESMTP id w6R8vXI9005249; Fri, 27 Jul 2018 09:57:33 +0100
Received: from troutbeck.inf.ed.ac.uk (localhost [127.0.0.1]) by troutbeck.inf.ed.ac.uk (8.14.7/8.14.7) with ESMTP id w6R8vXbV023079; Fri, 27 Jul 2018 09:57:33 +0100
Received: (from ht@localhost) by troutbeck.inf.ed.ac.uk (8.14.7/8.14.7/Submit) id w6R8vXKb023078; Fri, 27 Jul 2018 09:57:33 +0100
X-Authentication-Warning: troutbeck.inf.ed.ac.uk: ht set sender to ht@inf.ed.ac.uk using -f
To: Eld Zierau <elzi@kb.dk>
Cc: "urn@ietf.org" <urn@ietf.org>
References: <76f19e6c8892422d9d9475549218d82d@kb.dk>
From: ht@inf.ed.ac.uk
In-Reply-To: <76f19e6c8892422d9d9475549218d82d@kb.dk> (Eld Zierau's message of "Mon\, 16 Jul 2018 08\:08\:36 +0000")
User-Agent: Gnus/5.1012 (Gnus v5.10.12) XEmacs/21.5-b34 (linux)
Date: Fri, 27 Jul 2018 09:57:33 +0100
Message-ID: <f5b4lgl9htu.fsf@troutbeck.inf.ed.ac.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Edinburgh-Scanned: at seine.is.ed.ac.uk with MIMEDefang 2.84, Sophie, Sophos Anti-Virus, Clam AntiVirus
X-Scanned-By: MIMEDefang 2.84 on 129.215.17.202
Archived-At: <https://mailarchive.ietf.org/arch/msg/urn/qZ4qcHHmPJyKEg-YaIHmr_ksG3I>
Subject: Re: [urn] Suggested PWID URN for Persistent Web IDentifiers - version 3
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 27 Jul 2018 08:57:43 -0000

Eld Zierau writes:

> Please note that I have updated the suggested PWID URN for Persistent
> Web Identifiers with the following changes:
>
> ...
>
> It is available as version 3 from
> https://datatracker.ietf.org/doc/draft-pwid-urn-specification/, and it
> is attached to this email as a pdf.

[Before my comments, please note that I don't believe it's best practice
to attach drafts in mail to this list -- your original was classified as
spam as a result of the attachment, and I only found out about it as a
result of Juha's mail.]

Much of what follows is a critique of the requirements analysis and the
presumed use cases, rather than the specifics of the namespace itself,
and the overall conclusion is negative.

This is a bit unusual, in that I wouldn't normally bother with an
extended critique whose TL;DR would be "It doesn't seem likely to me
that this will solve anyone's problems".  After all, if it doesn't solve
anyone's problems, no-one will use it, let the 'market' decide, right?

This might be so, for a narrowly focussed namespace targetting a
specialist community.  But this proposal has _very_ large-scale use in
view.

_And_, the legitimacy of discourse around various aspects of
'persistent' reference has been so compromised by flawed analyses and
failed solutions that there is actually real reputational risk for the
urn infrastructure itself from yet another example, which I'm afraid I
think this is, particularly given its very high aims.

So, I'm sorry but for these reasons, and the many specific problems
discussed below, I don't personally see a future for this document.

============

Specific comments, by section

*Purpose*

As noted above, the goal stated in the open paragraph is very ambitious.

The only parts of the argument that follows that I can make sense of, as
to why the proposed scheme is needed, consist of two mistaken and/or
unsupported high-level claims:

 1) Citations of web-hosted resources should refer to archives, not to
    the resources themselves;

 2) Citation of archived material needs help to be sufficiently 'precise'.

The first is at best contentious as a general claim, and even in cases
where an archive reference is preferred, why packaging it in a PWID urn
will add value is not explained.  After all, if I know that what I want
to cite is available in some archive, it's because I've found it there,
in which case I already _know_ the URI I can use to retrieve it.  More
on this when below when we come to the question of uniqueness.

The second surely applies to any kind of citation at all, and seems to
depend on a confusion between web identity, media type and the ontology
of citable things and loci associated with them -- all of FRBR lurks
just around the corner here.

Just because some representation is hosted on the Web does not make it
easier (or harder) than it ever was to make clear in a citation what
part/aspect/property of what you _name_ in a citation is what you actual
are _referring to_.  More on this below when we come to the
'coverage-spec'.
------------
One more specific issue with this section

  "The precision regards both regards precise reference where there
   can be no doubt about that you have the correct web material as
   well as precision about what is actually referred by the reference
   (e.g. is it the page or the whole website)"

There's nothing in the proposal which follows to back up the "you have
the correct web material" claim, and the subsequent inventory of 8
'type[s] of archived item[s]' is clearly not sufficient to unambiguously
determine "what is actually referred [to] by the reference".

*Syntax*

I see no problems in the ABNF.
------------
The definitions of the 'coverage-spec' values are hopelessly
underspecified, and heterogenous.  The 'part'/'page' distinction is
particularly unclear/non-obvious/ambiguous/medai-type dependent.

*Assignment*

The initial claim here and its subsequent gloss taken together don't
hold up:

 "The PWID URNs does not have to be assigned by an authority, as they
  are based on the information created at the time of archiving:"

 "In other words: the PWID URNs are created independently, but
  following an algorithm that itself guarantees uniqueness."

On a strict reading of "uniqueness" (i.e. a one-to-one relation between
items and PWIDs), this amounts to a claim that _any_ 3rd party
considering _any_ item in _any_ archive will always construct the _same_
PWID.  This reading is obviously false: the presence of the the two "+(
unreserved )" expansions in the ABNF amount to a _guarantee_ that there
will be multiple distinct PWIDs for the same item.

RFC8141 does not require one-to-one relations between URNs, even URNs
within the same namespace.  But it does require, in the case of "URNs
. . . created independently" that they be created "following an
algorithm that itself guarantees uniqueness".  The
underspecification of the 'coverage-spec' already alluded to above
makes independent _and_ inconsistently understood coinages of identical
PWIDs not just possible but likely.  Consider for example

urn:pwid:web.archive.org:2018-01-01T17:03:53Z:part:http://www.ltg.ed.ac.uk/~ht/

One person might coin this to mean the text/html character sequence
which was served from http://www.ltg.ed.ac.uk/~ht/ (my homepage at the
University of Edinburgh) at the beginning of 2018, whereas someone else
might coin it to mean the text/html character sequence you get from the
Web Archive today if you do an http GET on
https://web.archive.org/web/20180101170353/http://www.ltg.ed.ac.uk/~ht/,
which are different in important ways.

In general, it seems that knowing whether to use 'part' or 'page' would
depend on a detailed understanding of the media type of the retrieved
representation, which the average creator of a citation is unlikely to
have.
-------------
Some of this is corrigible, but only at the expense of a vast increase
in detail and clarity.  Others not-so-much, insofar as so much is
_necessarily_ left underspecified, to allow for _anything_ to be
considered as an archive ('archive-id' value), with the consequent
requirement to allow for arbitrarily idiosyncratic internal naming
mechanisms ('archived-item' value).

I'm curious in this connection to wonder if we actually have any URN
namespaces where assignment is done "independently but following an
algorithm that itself guarantees uniqueness"?  Ah, there are
registrations for urn:uuid:, which qualifies, and urn:oid:, which sort
of does.  OK, is there such a namespace in active use?  Which provides
for resolution, at least in principle?

A crucial point here is that in the OID case a urn:oid:... carries a
chain of responsibility with it.  If you run across one, you know how to
find out who takes responsibility for every level in the tree that's
involved in interpreting it.  In the UUID case, a urn:uuid: doesn't
travel well/at all, so the question doesn't arise.  But for PWIDs, if I
find one I have _no way_ to figure out who's responsible, or whom I can
ask for clarification, or whom to blame if it doesn't appear to 'work'.
----------------
I don't see what the intended value of the discussion of the
"SOLR-Wayback tool" is for the spec.

*Interoperability*

The above comments about ambiguity/lack of identity apply here too: if
different implementors interpret e.g. a coverage-spec of 'subsite'
differently, their resolvers will not interoperate.

*Resolution*

None of this is very much to the point.  In particular, it completely
ignores the precedents set by doi.org, identifiers.org and n2t.net
(https://doi.org/10.1038/sdata.2018.95 exemplifies the first and its
content discusses the second and third). 

==============

In conclusion, I think a useful way to think about this proposal is as a
misguided attempt to define a urn type which _is_ a citation.  This
makes sense of PWID's 3rd party nature, and clarifies that the
archive-id and -time, the embedded URI and the coverage spec are just a
pretty arbitrary, certainly impoverished, subset of what you would
expect to find in a _real_ citation.  Trying to pack a citation into a
URN just doesn't make sense to me. Neither does trying to determine
exactly the _right_ subset of citations of web-hosted resources to pack
up in order to be both useful and unambiguous.

(There is work underway in various venues (see e.g. [1]) to _objectify_
citations and give PIDs to _those_, which might address at least some of
the goals of this work.)

ht

[1] http://sched.co/DJ3P
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]