Re: [urn] Suggested PWID URN for Persistent Web IDentifiers - version 3

Eld Zierau <elzi@kb.dk> Mon, 13 August 2018 10:44 UTC

Return-Path: <elzi@kb.dk>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2FA38130DBE for <urn@ietfa.amsl.com>; Mon, 13 Aug 2018 03:44:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.6
X-Spam-Level:
X-Spam-Status: No, score=-2.6 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id h7GVSMFeq9GR for <urn@ietfa.amsl.com>; Mon, 13 Aug 2018 03:44:11 -0700 (PDT)
Received: from smtp-out12.electric.net (smtp-out12.electric.net [89.104.206.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A0F0E127598 for <urn@ietf.org>; Mon, 13 Aug 2018 03:44:11 -0700 (PDT)
Received: from 1fpAKW-000o5r-U2 by out12d.electric.net with emc1-ok (Exim 4.90_1) (envelope-from <elzi@kb.dk>) id 1fpAKW-000o9N-Vk; Mon, 13 Aug 2018 03:44:08 -0700
Received: by emcmailer; Mon, 13 Aug 2018 03:44:08 -0700
Received: from [130.226.226.11] (helo=post.kb.dk) by out12d.electric.net with esmtp (Exim 4.90_1) (envelope-from <elzi@kb.dk>) id 1fpAKW-000o5r-U2; Mon, 13 Aug 2018 03:44:08 -0700
Received: from EXCH-02.kb.dk (exch-02.kb.dk [10.5.0.112]) by post.kb.dk (Postfix) with ESMTP id 4450E91F59; Mon, 13 Aug 2018 12:44:08 +0200 (CEST)
Received: from EXCH-02.kb.dk (10.5.0.112) by EXCH-02.kb.dk (10.5.0.112) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1415.2; Mon, 13 Aug 2018 12:44:07 +0200
Received: from EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29]) by EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29%7]) with mapi id 15.01.1415.002; Mon, 13 Aug 2018 12:44:07 +0200
From: Eld Zierau <elzi@kb.dk>
To: "Dale R. Worley" <worley@ariadne.com>
CC: "juha.hakala@helsinki.fi" <juha.hakala@helsinki.fi>, "ht@inf.ed.ac.uk" <ht@inf.ed.ac.uk>, "urn@ietf.org" <urn@ietf.org>
Thread-Topic: [urn] Suggested PWID URN for Persistent Web IDentifiers - version 3
Thread-Index: AQHUMRZFkm3uj4RtRcGB26kVRboqjKS9WtSA
Date: Mon, 13 Aug 2018 10:44:07 +0000
Message-ID: <b1b4141e5b2c4e9492a8dd51f22ad996@kb.dk>
References: <d083d07173c5455a8db9b43f73bd44d1@kb.dk> (elzi@kb.dk) <8736vlfz18.fsf@hobgoblin.ariadne.com>
In-Reply-To: <8736vlfz18.fsf@hobgoblin.ariadne.com>
Accept-Language: da-DK, en-US
Content-Language: da-DK
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [130.226.229.95]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Outbound-IP: 130.226.226.11
X-Env-From: elzi@kb.dk
X-Proto: esmtp
X-Revdns: post-03.kb.dk
X-HELO: post.kb.dk
X-TLS:
X-Authenticated_ID:
X-Virus-Status: Scanned by VirusSMART (c)
X-Virus-Status: Scanned by VirusSMART (s)
X-PolicySMART: 10573177
Archived-At: <https://mailarchive.ietf.org/arch/msg/urn/Aki2xuDqrdan9uQ4cNAdnuTTYHE>
Subject: Re: [urn] Suggested PWID URN for Persistent Web IDentifiers - version 3
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Aug 2018 10:44:16 -0000

See my comments below - (marked with "--")
Best regards, Eld

-----Original Message-----
From: Dale R. Worley <worley@ariadne.com> 
Sent: Saturday, August 11, 2018 3:55 AM
To: Eld Zierau <elzi@kb.dk>
Cc: juha.hakala@helsinki.fi; ht@inf.ed.ac.uk; urn@ietf.org
Subject: Re: [urn] Suggested PWID URN for Persistent Web IDentifiers - version 3

Eld Zierau <elzi@kb.dk> writes:
> Dale R. Worley worley@ariadne.com writes:
> > A major component of pwid-urn is archive-id.  My assumption is that 
> > the archive-id is the top-level component and defines what abstract 
> > "archive" the URN is a reference into.  And that the "archive"
> > defines the exact interpretation of archival-tine, coverage-spec, 
> > and archived-item.  However, in order for PWID URNs to be 
> > unambiguous, it must be unambiguous what "archive" a given 
> > archive-id refers to.  The draft suggests "it is recommended to use 
> > the web domain as the identifier for the web archive".
> >
> > What if an institution decides to create PWID URNs that use an 
> > archive-id which is a domain name that the institution does not own?
> > As written, the draft does not forbid *me* from creating PWID URNs 
> > with archive-id "netarkivet.dk".
> 
> -- Eld: I think there is a misunderstanding here - it is not the 
> archive that chooses the archive name, it is the creator of the 
> reference, - and that is why it is recommended to use the domain name.

If you are forced to say "it is not the archive that chooses the archive name", then I suggest that the "archive name" is not actually the name *of the archive*, but an identifier of some other entity.
-- Eld: If it says 'archive name' anywhere, it should be corrected - it is an identifier we are taking about.

When you say "the domain name" here, *of what* is it the domain name?
Do you mean "the archive's domain name", or "the domain name of the creator of the reference"?
-- Eld: the domain name of the archive - it is to identify in which archive you can find the resource in - I actually realized that it should say top level domain name of the web archive. 

> It would be best to have a formal registry, but I have to start 
> somewhere, - and with the domain name it is possible to find a 
> reference to any existing web archive (restricted access or not). 
> Within the next 5-10 years, I think there will be evidence enough 
> (e.g. from web archives) to reconstruct a registry in case there are 
> web archives changing domains within that period.  I will definitely 
> work on having such a registry formerly established.

You seem to be describing how you expect these URNs to be processed, but it is not clear to me what the process is.  Can you describe how this process works, starting with "I am looking at a PWID URN" and continuing to the point where "I have discovered the particular archive in question".  (I assume that after that point, looking up the resource in the archive's index is straightforward, at least conceptually.)
-- Eld: It is indirectly described under resolution, - but it is a good point that there should be a construction as well. This would be:
--  PWID URNs can be constructed in different ways. In case a single reference is wanted, it can be created by having/finding/creating the wanted archived resource and then construct the PWID URN from that. Or it can be created for a collection by using tooling producing PWIDs.
-- In the first case, there can be three situations where the creator of the reference:
--  * already has the resource from a web archive that the creator wants to reference
--  * finds resource from a web archive by using tools like Memento and choses the exact web resource to reference from a specific archive
--  * creates a resource by asking a web archive/citation service to archive a live URI, and afterwards verifies the reference in the web archive
-- In these three cases the PWID URN is constructed by filling out the PWID URN pattern 
--   urn:pwid:<archive-id>:<archival-time>:<coverage-spec>:<archived-uri>
-- By setting 
-- The <archive-id> to the top level domain of the web archive holding the reference. For open web archives like Internet Archive's publicly available web archive collection (with current wayback on https://web.archive.org/) the top level domain is archive.org. For  web archives with restricted access, choose the top level domain for where the web archive is publicly described, e.g. the Danish web archive is described on http://netarkivet.dk/in-english/  which has the top level domain netarkivet.dk.
-- The <archival-time> (on the form described in the syntax and given as UTC time) to the archival time for the selected web archive resource. In some wayback solutions, this archival time can be found in the online URI for the resource, e.g. in the online URL for an archived version of https://tools.ietf.org/html/rfc3986: https://web.archive.org/web/20180418152929/https://tools.ietf.org/html/rfc3986, the archival time is represented in the long number 20180418152929, where year is the first four digits, month the next two digits etc. In other Wayback solutions or web archive interfaces the archival date is given as part of the metadata for the web resource.
-- The <coverage-spec> to what the creator of the reference wants the reference to cover, e.g. page or subsite.
The <archived-uri> to the URI that is archived, which means what the web archive has registered as the archived URI. For URIs that still exists online, the online URI and the archive URI will usually be the same. However, there are case (e.g. for redirects) where they differ. For instance for archive.is archived in archive.org on 2018-01-06T14:08:01 UTC is archived wiith the archived URI:  http://archive.is:80.
-- An example of constructing an PWID URI for https://web.archive.org/web/20180418152929/https://tools.ietf.org/html/rfc3986 would therefore be
-- <archive-id> = archive.org (to level domain of web.archive.org)
-- <archival-time> = 2018-04-18T15:29:29Z (deduced from 20180418152929 which is always UTC time)
-- <coverage-spec> = webpage (depends on style sheets references. In case the creator of the reference wants to indicate that the contents is covered fully on this reference without dependencies to calculation of dependencies, he/she could choose to say part, - although this may result in a resolution of the html depending on the resolution tool)
-- <archived-uri> = https://tools.ietf.org/html/rfc3986
-- Which results in the PWID URN: 
--  urn:pwid:archive.org:2018-04-18T15:29:29Z:webpage:https://tools.ietf.org/html/rfc3986

> > archived-item can be a URI, and a URI can contain the characters 
> > '?', '#', '[', and ']', among others.  However, those 4 characters 
> > may not appear in the NSS part of a URN.  Note that the first 2 of 
> > these characters are used in URIs only to introduce the "query" and 
> > "fragment" parts, but as the draft is written, the archived-item URI 
> > is not restricted to not having a query or fragment part.

> -- Eld: that is a problem if the archived URI does have fragments

Yes, it is a problem.  But your specification allows the archived URI to have fragments, because it refers to URIs but does not forbid them to have fragment parts.  How do you propose to resolve this contradiction?
-- Eld: I will certainly make the reader aware of it, e.g. by writing:
-- "It should be noted that a valid URN is not allowed to contain the characters ?', '#', '[', and ']'. Therefore, PWID URNs for archived URIs cannot be constructed as legal URNs when following the instructions given."
-- Eld: I would like to discuss different alternative:
-- 1: we can just leave it like that - which means that that valid PWID URNs cannot cover these cases. - In all cases the PWID URN will be a huge step forward, and cover about 80-95% of all cases. In this case, it will be up to the users if they construct them anyway - just like the web archives online URIs are not valid URIs since they contain "//" as part of the path.
-- 2: we can consider encoding of these characters (which will complicate things a bit)

Dale