[urn] PWID URN namespace registration with latest draft version 12

Eld Zierau <elzi@kb.dk> Thu, 22 April 2021 10:07 UTC

Return-Path: <elzi@kb.dk>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9FFF53A0407 for <urn@ietfa.amsl.com>; Thu, 22 Apr 2021 03:07:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5c3gw7dyB5GL for <urn@ietfa.amsl.com>; Thu, 22 Apr 2021 03:07:26 -0700 (PDT)
Received: from smtp-out12.electric.net (smtp-out12.electric.net [89.104.206.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6E89C3A0785 for <urn@ietf.org>; Thu, 22 Apr 2021 03:07:23 -0700 (PDT)
Received: from 1lZWEy-0006k8-W6 by out12b.electric.net with emc1-ok (Exim 4.94) (envelope-from <elzi@kb.dk>) id 1lZWEz-0006lS-U1 for urn@ietf.org; Thu, 22 Apr 2021 03:07:21 -0700
Received: by emcmailer; Thu, 22 Apr 2021 03:07:21 -0700
Received: from [92.43.124.147] (helo=deliveryscan.hostedsepo.dk) by out12b.electric.net with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94) (envelope-from <elzi@kb.dk>) id 1lZWEy-0006k8-W6 for urn@ietf.org; Thu, 22 Apr 2021 03:07:20 -0700
Received: from localhost (unknown [10.72.17.201]) by deliveryscan.hostedsepo.dk (Postfix) with ESMTP id D6BC79FEC8 for <urn@ietf.org>; Thu, 22 Apr 2021 12:07:20 +0200 (CEST)
Received: from 10.72.17.201 ([10.72.17.201]) by dispatch-outgoing.hostedsepo.dk (JAMES SMTP Server 2.3.2-1) with SMTP ID 743 for <urn@ietf.org>; Thu, 22 Apr 2021 12:07:20 +0200 (CEST)
Received: from out12d.electric.net (smtp-out12.electric.net [89.104.206.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "electric.net", Issuer "Sectigo RSA Domain Validation Secure Server CA" (verified OK)) by outgoing-postscan.hostedsepo.dk (Postfix) with ESMTPS id 98E5A23C1 for <urn@ietf.org>; Thu, 22 Apr 2021 12:07:20 +0200 (CEST)
Received: from 1lZWEy-0004ea-TA by out12d.electric.net with hostsite:2468467 (Exim 4.94) (envelope-from <elzi@kb.dk>) id 1lZWEy-0004h1-Ty for urn@ietf.org; Thu, 22 Apr 2021 03:07:20 -0700
Received: by emcmailer; Thu, 22 Apr 2021 03:07:20 -0700
Received: from [92.43.124.46] (helo=pf2.outprescan-mta.hostedsepo.dk) by out12d.electric.net with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94) (envelope-from <elzi@kb.dk>) id 1lZWEy-0004ea-TA for urn@ietf.org; Thu, 22 Apr 2021 03:07:20 -0700
Received: from post.kb.dk (post-03.kb.dk [130.226.226.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by pf2.outprescan-mta.hostedsepo.dk (Postfix) with ESMTPS id EA5552504 for <urn@ietf.org>; Thu, 22 Apr 2021 12:07:19 +0200 (CEST)
Received: from EXCH-01.kb.dk (unknown [10.5.0.111]) by post.kb.dk (Postfix) with ESMTPS id CA4B291EED for <urn@ietf.org>; Thu, 22 Apr 2021 12:07:19 +0200 (CEST)
Received: from EXCH-02.kb.dk (10.5.0.112) by EXCH-01.kb.dk (10.5.0.111) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2242.4; Thu, 22 Apr 2021 12:07:19 +0200
Received: from EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29]) by EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29%7]) with mapi id 15.01.2242.008; Thu, 22 Apr 2021 12:07:19 +0200
From: Eld Zierau <elzi@kb.dk>
To: "urn@ietf.org" <urn@ietf.org>
Thread-Topic: PWID URN namespace registration with latest draft version 12
Thread-Index: Adc3Xw5U+nwnP7TLQH6GvUzGHenlNQ==
Date: Thu, 22 Apr 2021 10:07:18 +0000
Message-ID: <b5ade01ffcfc42a5b428e0027780e724@kb.dk>
Accept-Language: da-DK, en-US
Content-Language: da-DK
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator:
x-originating-ip: [130.226.229.95]
Content-Type: multipart/mixed; boundary="_002_b5ade01ffcfc42a5b428e0027780e724kbdk_"
MIME-Version: 1.0
X-Outbound-IP: 92.43.124.46
X-Env-From: elzi@kb.dk
X-Proto: esmtps
X-Revdns: outprescan-mta.hostedsepo.dk
X-HELO: pf2.outprescan-mta.hostedsepo.dk
X-TLS: TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256
X-Authenticated_ID:
X-Virus-Status: Scanned by VirusSMART (c)
X-Virus-Status: Scanned by VirusSMART (b)
X-PolicySMART: 10573177, 19718497
X-Outbound-IP: 92.43.124.147
X-Env-From: elzi@kb.dk
X-Proto: esmtps
X-Revdns: deliveryscan.hostedsepo.dk
X-HELO: deliveryscan.hostedsepo.dk
X-TLS: TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256
X-Authenticated_ID:
X-Virus-Status: Scanned by VirusSMART (c)
X-Virus-Status: Scanned by VirusSMART (b)
Archived-At: <https://mailarchive.ietf.org/arch/msg/urn/8Q1GB282Q-qxSKKOqDyqteUjj2I>
Subject: [urn] PWID URN namespace registration with latest draft version 12
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 22 Apr 2021 10:07:33 -0000

Dear all
I hope you are all still well in spite of the pandemic.
I have not asked to the progress for some time, due to different reasons related to this difficult time.

As two of you have already approved this PWID URN, I would be happy if we can find a way for it. 

It is very important for us to have the PWID URN. In Denmark, PWID is the recommended way to reference web archive material. We use it in scientific papers, as well as for specification of elements in collections of web material. As example collections, we have just produced collection of PWIDs for the "Probing a Nation's Web Domain" project (http://www.netlab.dk/research/projects/probing-a-nations-web-domain-the-historical-development-of-the-danish-web/).

I have answered all comments and questions, which have been raised, either by explanation or by slimming the suggested PWID e.g. only leaving page and part in the precision specification.
The only "weak" point I see in the current version is the reference to the web archive. It would be best if there existed a registry, but this is not the case. However, I am convinced that the suggested way is absolutely workable, until a registry is in place - and I will of course work hard to get such a registry established as soon as the PWID is accepted - e.g. in cooperation with the International Internet Preservation Consortium (IIPC) https://netpreserve.org/. 

The current web archive identification is workable, since it is findable as long as the domain exists. In case of change, it will be possible to track the change through archived harvests of the web archives domain. We actually have the example in Denmark where the merge of the two libraries (who jointly had the Danish web archive) has resulted in https://netarkivet.dk being moved to https://www.kb.dk/find-materiale/samlinger/netarkivet, where netarkivet.dk redirects to the new place. There are several examples of web archives changing domains or paths to their web archive material. However, it has so far been possible to track the change, usually, because the old URL is redirected to new one.  These redirects will probably be there for some years, but at some stage they are also likely to be removed. No matter whether it is redirects or announcements, it will be information that will be harvested and kept in web archives. That means it will be traceable by looking in web archives that have harvested this data. So workable for now, but of course best placed in a coming registry, which is likely to be in place before this become an issue.

In my involvement in WARCnet (https://cc.au.dk/en/warcnet/), it is my experience that there is also a need for the PWID internationally, especially when web archives changes domains or paths to data. There are examples for references with archived URLs in research datasets, which overnight have become almost useless for this reason. I am bringing in the PWID here, as a way to solve this as part of the research data management work related to web archives.

At the International Internet Preservation Consortium (IIPC) 2021 conference, I will present the work we have done with representation of PWIDs collections (as mentioned above). The collections from the "Probing a Nation's Web Domain" project contains elements harvested in a specific year, where each web element only appears once (although the web element was harvested many times during the year). The collections were originally produced by an extraction program, which ran on the Danish web archive (Netarkivet). The collections are now migrated to collections of PWIDs, which are much more sustainable as target for preservation, enabling future check of results and enabling establishment of comparable results. One non-sustainable alternative would be to save the extraction program. However here, the problem will be, that we cannot be sure that the extraction program will be functioning in the future, and even so whether it will produce the same result (archive can have been enriched with new data). Another alternative would be to preserve the outcome of the extraction program. However, this is not a standardized format, and it refers metadata (crawl-logs) rather than registered archive metadata, - thus even with thorough documentation, it will be hard to re-track and understand this output even in a 10 years' time horizon. Therefore, the PWID collections are in a more sustainable format as well. 

Please consider final approval of the PWID URN or tell me what is needed for it to be approved and published as a URN namespace.

Best regards, Eld

PS: I have attached the latest draft which followed with the mail included below

-----
Eld Zierau
Digital Preservation Specialist PhD
The Royal Danish Library
Digital Cultural Heritage
P.O. Box 2149, 1016 Copenhagen K
Ph. +45 9132 4690 
Email: elzi@kb.dk

-----Original Message-----
From: Eld Zierau 
Sent: Tuesday, May 26, 2020 10:01 AM
To: 'Dale R. Worley' <worley@ariadne.com>om>; urn@ietf.org
Subject: RE: PWID URN namespace registration version 10

Thank you Dale

My comments is given below, and updated version is attached with following corrections: 
- added draft information in filename/header
- description of archive-domain
- syntax for utc-time
- date of document
Best regards, Eld

---------------------------------------------------------------------------------------

My apologies for not giving this attention sooner.

I've read version 10, and I think we should approve it.  I have the following observations, which include one editorial suggestion.


I assume that the attachment to message
https://mailarchive.ietf.org/arch/msg/urn/x_JVtfKpANKZz6Qr8iOqpsXJ8SU/
"PWID URN (shortened title)" is draft verson 10, despite that neither the attachment's name or contents states that.
> Eld: At some stage, I was told that it is version 1, maybe I 
> misunderstood, but I thought the document date then indicated the version.
> It was actually draft version 11. In the attached draft version 12, I 
> have put in the information about it being draft version 12


I particularly support the PWID proposal for the reasons I described in "PWID as citation"
https://mailarchive.ietf.org/arch/msg/urn/s-CM7hcWtUeAz7ZVBF94rCHMtsQ/
-- namely that what a PWID references is transparent enough that one could algorithmically transform a PWID pointing to one archive into a query into another archive.  This is a genuinely new capability for URNs (as far as I know) and only by deploying it in practice can we see what benefits might be obtained.
> Eld: Agree

I still dislike that there's no well-defined way to catalog allowed values of archive-dimain.  But the number of values that are used will likely remain small and there are unlikely to be "ownership conflicts"
about them, so this is unlikely to be a problem in practice.
> Eld: I agree, and I will certainly work on other fronts to make it possible to make more 
> precise reference, but this is what exists at the moment.

A minor editorial point:
      *  'archive-domain' is defined as in (section 3.5) [RFC1034].
"archive-domain" is not defined in RFC 1034.  You need to say
      * 'archive-dimain' is <subdomain> as defined in (section 3.5) [RFC1034].
(Oddly, you do want to use <subdomain> rather than <domain> defined in that section.)
> Eld: You are right - it is subdomain because <domain> ::= <subdomain> | " " and we want to avoid the " " possibility.
> I have corrected in the attached  draft version 12 


I see that if utc-time is part of archival-time, then it must contain both utc-hour and utc-minute, whereas utc-second and secfrac can be added independently.  This is a bit of an inconsistency, but I assume you intend it.
> Eld: That is actually an error, - I have corrected it, so it is possible just to specify the hour without minutes

Given that precision-spec is currently limited to one of two values, later extensions can be indicated by additional values defined for this field.
> Eld: Agree

Dale

-----
Eld Zierau
Digital Preservation Specialist PhD
The Royal Danish Library
Digital Cultural Heritage
P.O. Box 2149, 1016 Copenhagen K
Ph. +45 9132 4690 
Email: elzi@kb.dk

-----Original Message-----
From: Dale R. Worley <worley@ariadne.com> 
Sent: Sunday, May 24, 2020 5:19 AM
To: Eld Zierau <elzi@kb.dk>dk>; urn@ietf.org
Subject: Re: PWID URN namespace registration version 10

My apologies for not giving this attention sooner.

I've read version 10, and I think we should approve it.  I have the following observations, which include one editorial suggestion.

I assume that the attachment to message
https://mailarchive.ietf.org/arch/msg/urn/x_JVtfKpANKZz6Qr8iOqpsXJ8SU/
"PWID URN (shortened title)" is draft verson 10, despite that neither the attachment's name or contents states that.

I particularly support the PWID proposal for the reasons I described in "PWID as citation"
https://mailarchive.ietf.org/arch/msg/urn/s-CM7hcWtUeAz7ZVBF94rCHMtsQ/
-- namely that what a PWID references is transparent enough that one could algorithmically transform a PWID pointing to one archive into a query into another archive.  This is a genuinely new capability for URNs (as far as I know) and only by deploying it in practice can we see what benefits might be obtained.

I still dislike that there's no well-defined way to catalog allowed values of archive-dimain.  But the number of values that are used will likely remain small and there are unlikely to be "ownership conflicts"
about them, so this is unlikely to be a problem in practice.

A minor editorial point:
      *  'archive-domain' is defined as in (section 3.5) [RFC1034].
"archive-domain" is not defined in RFC 1034.  You need to say
      * 'archive-dimain' is <subdomain> as defined in (section 3.5) [RFC1034].
(Oddly, you do want to use <subdomain> rather than <domain> defined in that section.)

I see that if utc-time is part of archival-time, then it must contain both utc-hour and utc-minute, whereas utc-second and secfrac can be added independently.  This is a bit of an inconsistency, but I assume you intend it.

Given that precision-spec is currently limited to one of two values, later extensions can be indicated by additional values defined for this field.

Dale