[urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)

Eld Zierau <elzi@kb.dk> Fri, 11 October 2019 13:43 UTC

Return-Path: <elzi@kb.dk>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 38AC7120090 for <urn@ietfa.amsl.com>; Fri, 11 Oct 2019 06:43:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.601
X-Spam-Level:
X-Spam-Status: No, score=-2.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pUP6Mul-I8wg for <urn@ietfa.amsl.com>; Fri, 11 Oct 2019 06:43:48 -0700 (PDT)
Received: from smtp-out12.electric.net (smtp-out12.electric.net [89.104.206.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AC363120073 for <urn@ietf.org>; Fri, 11 Oct 2019 06:43:47 -0700 (PDT)
Received: from 1iIvCo-0006Rt-Um by out12d.electric.net with emc1-ok (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1iIvCp-0006Up-UO; Fri, 11 Oct 2019 06:43:43 -0700
Received: by emcmailer; Fri, 11 Oct 2019 06:43:43 -0700
Received: from [92.43.124.147] (helo=deliveryscan.hostedsepo.dk) by out12d.electric.net with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1iIvCo-0006Rt-Um; Fri, 11 Oct 2019 06:43:42 -0700
Received: from localhost (unknown [10.72.17.201]) by deliveryscan.hostedsepo.dk (Postfix) with ESMTP id 676C09F6FA; Fri, 11 Oct 2019 15:43:42 +0200 (CEST)
Received: from 10.72.17.201 ([10.72.17.201]) by dispatch-outgoing.hostedsepo.dk (JAMES SMTP Server 2.3.2-1) with SMTP ID 327; Fri, 11 Oct 2019 15:43:43 +0200 (CEST)
Received: from out12a.electric.net (smtp-out12.electric.net [89.104.206.36]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "electric.net", Issuer "COMODO RSA Domain Validation Secure Server CA" (verified OK)) by pf1.outpostscan-mta.hostedsepo.dk (Postfix) with ESMTPS id 13AC29F491; Fri, 11 Oct 2019 15:43:42 +0200 (CEST)
Received: from 1iIvCn-00012Q-TU by out12a.electric.net with hostsite:2468467 (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1iIvCn-00014n-WC; Fri, 11 Oct 2019 06:43:41 -0700
Received: by emcmailer; Fri, 11 Oct 2019 06:43:41 -0700
Received: from [92.43.124.46] (helo=pf2.outprescan-mta.hostedsepo.dk) by out12a.electric.net with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1iIvCn-00012Q-TU; Fri, 11 Oct 2019 06:43:41 -0700
Received: from post.kb.dk (post-03.kb.dk [130.226.226.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by pf2.outprescan-mta.hostedsepo.dk (Postfix) with ESMTPS id B256DA6D; Fri, 11 Oct 2019 15:43:40 +0200 (CEST)
Received: from EXCH-02.kb.dk (exch-02.kb.dk [10.5.0.112]) by post.kb.dk (Postfix) with ESMTPS id 4A76B95D87; Fri, 11 Oct 2019 15:43:40 +0200 (CEST)
Received: from EXCH-02.kb.dk (10.5.0.112) by EXCH-02.kb.dk (10.5.0.112) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1779.2; Fri, 11 Oct 2019 15:43:39 +0200
Received: from EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29]) by EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29%7]) with mapi id 15.01.1779.005; Fri, 11 Oct 2019 15:43:39 +0200
From: Eld Zierau <elzi@kb.dk>
To: "'Hakala, Juha E'" <juha.hakala@helsinki.fi>, Peter Saint-Andre <stpeter@stpeter.im>, "urn@ietf.org" <urn@ietf.org>
Thread-Topic: new draft 10 - new form (RE: [urn] new draft 9 - RE: new urn PWID draft (7) with corrections)
Thread-Index: AdWAM/DTuXghRloBRw63xsvflzI7Qg==
Date: Fri, 11 Oct 2019 13:43:39 +0000
Message-ID: <2396dbcf66bb4c8689bcbabca2cc8492@kb.dk>
Accept-Language: da-DK, en-US
Content-Language: da-DK
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator:
x-originating-ip: [130.226.229.95]
Content-Type: multipart/mixed; boundary="_002_2396dbcf66bb4c8689bcbabca2cc8492kbdk_"
MIME-Version: 1.0
X-Outbound-IP: 92.43.124.46
X-Env-From: elzi@kb.dk
X-Proto: esmtps
X-Revdns: outprescan-mta.hostedsepo.dk
X-HELO: pf2.outprescan-mta.hostedsepo.dk
X-TLS: TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256
X-Authenticated_ID:
X-PolicySMART: 10573177, 19718497
X-Virus-Status: Scanned by VirusSMART (b)
X-Virus-Status: Scanned by VirusSMART (c)
X-Outbound-IP: 92.43.124.147
X-Env-From: elzi@kb.dk
X-Proto: esmtps
X-Revdns: deliveryscan.hostedsepo.dk
X-HELO: deliveryscan.hostedsepo.dk
X-TLS: TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256
X-Authenticated_ID:
X-Virus-Status: Scanned by VirusSMART (b)
X-Virus-Status: Scanned by VirusSMART (c)
Archived-At: <https://mailarchive.ietf.org/arch/msg/urn/93kbt1i9OThN1P53pKWHyBpS-PI>
Subject: [urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Oct 2019 13:43:55 -0000

Dear Juha, all
I have attached a new “draft version 10” of the URN PWID document, where I have adjusted the text to meet your comments as described in my commenting of your mail below. I have tried to make it "look a like" the other versions as far as it make sense. It is in a PDF, - please say if you want another format.

Please also find the description of use cases and comments on why we need the PWID which is included as part of my comments in your mail below.

And please tell me if you need any more information.

Best regards, Eld
-----
Eld Zierau
Digital Preservation Specialist PhD
The Royal Danish Library
Digital Cultural Heritage
P.O. Box 2149, 1016 Copenhagen K
Ph. +45 9132 4690 
Email: elzi@kb.dk

-----Original Message-----
From: Hakala, Juha E <juha.hakala@helsinki.fi> 
Sent: Friday, September 6, 2019 2:46 PM
To: Eld Zierau <elzi@kb.dk>; Peter Saint-Andre <stpeter@stpeter.im>; urn@ietf.org
Subject: RE: [urn] new draft 9 - RE: new urn PWID draft (7) with corrections

Dear Eld; all, 

I have some general and specific comments about this version of the PWID requests: 

There is a fair amount of redundancy, the same text is repeated in introduction and further chapters. The draft would be easier to read if this kind of duplication were reduced. 
> Eld: Restructuring leaving out the RFC template and only using the URN template has resulted in removal of duplication

The process specified in RFC 8141 should be used to register a namespace. That is, the registration does not need to be an RFC unless there is an acceptable reason to do so (in case of NBN, there is no other document to describe NBNs than RFC 8458). Note that the RFC process is more complicated than the RFC 8141 one, even for informational RFCs. 
> Eld: We proceed as a registration of a URN namespace only according to RFC 8141

Reference to W3CDTF is out of date, since the problem the W3C profile fixes (ISO 8601 allowed presentation of year with two digits) has been corrected in ISO 8601:2004 and later versions of the standard. 
> Eld: I have removed reference to W3CDTF

ISO 8601 has been revised in 2019, so unless you want to cite the old version you should either change the year in the citation or remove it completely. Without the year specification the citation will always refer to the latest version of the standard. I recommend this latter approach. 
> Eld: I have updated the ISO8601 reference

On page 9. sentence "The following valid precision-spec values are exists:" should be fixed. Whether it is enough to specify just page and file is a good question; in theory it is possible to specify also all copies of a page or all pages harvested from a web site, but this is clearly beyond the intended scope of PWID.   
> Eld: Yes in the minimal initial version, it is beyond the scope, but it can be considered later – I removed “are” in the sentence.

Concerning the two issues you mention below: 

Domain name of the archive is indeed a weak way of identifying the archive.  It might be better not to talk about identification at all in this context, and just call this component domain name of the archive, and say that the name will separate the Web archives from one another. 
>Eld: In order to meet this I have made the following adjustments: 
>  -   In section “Section Assignment” the following is changed:
>       o  A prerequisite for assignment of a PWID is that the web archive can be identifiedlocated (with a domain describing the web archive)
>       o  archive-domain (identificationlocation of web archive):
>       o  This must be the domain of the web archive which assists in locating as identification of the web archive and separate the web archive from other web archives
>  -   In section “Additional Information”, I have removed the discussion of archive identifiers, i.e. archive identifier is only mentioned in the same section in a bullet point for something that can be looked at in later versions
>  -   In the rest of the document I have removed all reference to archive identifier discussion

Web archives are organizations and they may eventually receive e.g. ISNI (International Standard Name Identifier) or other organization identifier. Surprisingly not even the Internet Archive has an ISNI yet. And even if there were one, it would be difficult to use it during the automatic PWID generation process. 
> Eld: I have removed the comment about registries from the assignment section, since it would take too much space to explain all the challenges.

I do not agree with this statement: 

"it would not make sense to have direct assignment of identifiers for all archived web material at any point. Firstly, because the volume of data is too big, and secondly, because assignment of an identifier would require that you can point to the resource, which will be impossible for web resources with restricted access, which place us back at the starting point."
   
If URNs are machine generated, it is possible to create billions of them. RFC 8141 does not impose any limits on how many URNs there can be per namespace. Of course technologies used may set some limits; Norwegian Web archive initially created URNs for everything but they had to stop when the database table became too large, after several hundreds of millions of assigned URNs. 

URN standard does not require URNs to be actionable, so the fact that some archived resources are not publicly accessible is not a problem either. 
> Eld: I may not have formulated this good enough, please see my explanation below (also addressing the rest of your comments).

This brings me to the issue I find hard to understand: 

PWID supports references to material in web archives with restricted access. Since these archives are usually / always built using the same tools than publicly available archives (our legal deposit archive uses the same technology as the Internet Archive) all documents in our legal deposit Web archive have archive specific cool URIs that can be used for citing either all versions of a web page or a specific harvested version of it. We can create URNs which can be used as links to these resources in such a way that this URN resolves to all the harvested copies of this page, either in our Web archive or in our own archive or elsewhere. We are currently in the process of making our resolver compliant with RFC 8141, and part of the revision is this URN -> URLs functionality. Link to the legal deposit Web archive will only be seen if the user is entitled to access the archive (that is, people on dedicated work stations, with known IP addresses). 

What is the added value of URN:PWID for a user like us, because having a non-actionable URN:PWID is not attractive when we can have e.g. URN:NBN which resolves to any archived copy of the resource (and to the one on-line in the Web)?     

It is a well known problem that copies of Web pages in different archives may differ, since the individual files have been harvested at different times. But I am not sure if URN:PWID makes it easier to solve this problem (compared with e.g. usage of URN:NBN). The former seems to be limited to page and part (in precision-spec) when we may want to identify with a URN all pages harvested from a web site at all times when harvesting newspaper Web sites) or all versions of a page, not just one. 

To sum up: this version of PWID registration request has fewer technical issues than the previous one, but I still don't know how / why for instance my library could / should use URN:PWIDs to identify resources in our restricted legal deposit Web archive. Use cases (general ones, or ones describing the plans for Denmark) might help me to see the light. My library may not find uses for PWID, but if Royal Danish Library has them and if there are no obvious technical issues in a future version of this document, I can accept this registration request.  
>Eld: Below I will explain the challenges that we in Denmark have with use of identifiers like (local) URLs to tools or use of PID systems like NBN. Furthermore, I have listed a number of use cases that already exists, and how these use cases benefits from using the PWID. I hope this is enough to illustrate the need for at least Danish researchers, and that you therefore can accept the PWID URN in form presented in the attached version.
>
> COMMENTS ON WHY WE NEED PWID
> Concerning use of local URLs for web archive tools, I agree that a lot of web archives use the same tools, but the landscape is bound to change. Right now we are using Open Wayback in Denmark, but we are also about to launch SOLR Wayback, and we are considering exchange Open Wayback with pyWayback. All these tools have different access URLs, therefore even within a closed environment, URLs for Open Wayback may not work next year, and the same can be said for the other tools over a longer period. 
> Even worse, for some of the tools we use, there are URLs pointing directly to the archived WARC file with offset of the relevant WARC record. Until the PWID was suggested, the Danish web archive Nertarkivet guidelines for references to Netarkivet was to address the actual WRC file record by the WARC-filename and offset. Last year we migrated the whole web archive to compressed WARC files, and consequently all references to uncompressed WARC files has become invalid (new file name suffixed with .gz and new offsets because of the compression). It is these observations and experiences that initiated the research of finding a more persistent way to reference material in Netarkivet, which ended up with the PWID.
>
> Concerning use of a persistent identifiers system like NBN for each element in Netarkivet, we would in Denmark have the following challenges:
>  -   It would introduce an extra index (which we want to avoid as much as possible), and it would require that we from this index have some sort of referencing mechanism into the web archive. This could be done by
>       o  Register an NBN with an URL to the current version of a “Wayback” tool for each file in the web archive. Some of the main challenges for the Royal Danish Library (having a policy that persistent identifiers needs to be reproducible from bit preserved data) will then be:
>            # Having an index or database that can administrate the large amount of entries (as the Norwegian Web archive experienced)
>            # Updating if new tools are introduced, and having to decide which tool to resolve to
>            # Preserving the references, either by archiving the NBN with the web element (which would complicate ingest of harvested materials and require adjustments to existing tools) or preserve the index/database which would give challenges since there would be different versions over time. This is necessary according to OAIS, since you would need to be able to reconstruct an archive from the preserved AIPs
>            # Finding the NBN for a web material when you want to make a reference to it. This would either require the “Wayback” tool to provide the information (requiring adjustment to the “Wayback” tool) or a separate service where you could get the NBN for a web part (and most likely the parameters would be the archived URL, the archival time and implicitly the web archive by choice of the service)
>       o  Use the information from CDX that are essential for finding the web material the web archive. The web archive would then have to be represented in the prefix = iso-cc *( ":" subspc). 
>           Apart from precision-specification, this is the PWID, but where the PWID offers a standardized and relatively readable representation of the information. 
>  -   Not having the possibilities to be precise about what you refer to for ambiguous interpretation of the reference - i.e. whether a reference to web page is referring to the web page code or the rendering of the page with all calculated elements.
>  -   Not solving the challenge of having references to other web archives in cases where the referred web material consists of parts from different web archives (as it has been the case for reconstructing of litlive.dk, which was an important website for discussion in literature science.
>With the PWID, we avoid all these challenges, since it is based on the essential (CDX) information about how to find the referred materials, and we can offer the digital humanity researchers means to make the precise references that they need to and create persistent and precise collection definitions of web elements. 
>
>USE CASES
There are several use cases for the need of very precise references, especially for researchers in digital humanities. The PWID have already been used in such case, and in other cases there are initiated work on introducing it. The following use cases are all studies in this category:
>  -   Research in how guidelines for behavior on a Danish online memorial website is changing over time. This study required precise reference (citation) of specific versions of the guidelines, which the PWID helped supporting. Several papers have been published using PWIDs in the references for citations in this research. The expert on this study is Senior Researcher Caroline Nyvang, Royal Danish Library, cany@kb.dk.
>  -   Research in discussion and use of a specific press image (a picture from news about a man spitting on refuges in the 2015) which appeared in different media (including social media), where the image appeared in different forms (e.g. memes) and in context of different discussion and commenting.  In this study there was a need for precise references to specific versions of the image as well as citation into the contexts around the image versions. PWIDs have been used to make exact reference to the images and page references to the contexts. A paper will be published in a journal in the end of 2019. The expert on this study is Senior Researcher Mette Kia Krabbe Meyer, Royal Danish Library, mkkm@kb.dk 
>  -   Research in trends of how the Danish web have change since the start of harvesting the Danish web. The research is based on extract of Netarkivet per year, but where duplicates are avoided (i.e. only one version of each page) in order to make a proper basis for extract of statistics.  These extract are refer as corpora in the research and these corporas are basically collections of web materials. The corporas have been calculated by a program running on the Netarkivet data as it looks today and in todays infrastructure. Therefore, the chances of reproducing the corporas in the future are very slim, since there may be added data (e.g. including new data from other archives, as we have already done from Internet Archive) or because there are changes in infrastructure (which make it hard or impossible to run the program). The only way to ensure the same basis for extracting statistic on the corpora is therefore to define the corpora in terms of which web material elements the corpora contains. The PWID is here essential for making persistent definitions of the corporas. The collection definition then include a PWID for each web element in the collection. A paper on the way to make persistent definition of collections with PWIDs was published at RESAW 2017, the research itself has been presented at several conferences and publications are under way. The expert on this study is  Professor Niels Brügger, Aarhus University, nb@cc.au.dk 
>  -   Reconstruction of the litlive.dk website in research of online discussions among literates. This website does no longer exist, and everything that can be found in any web archive is valuable. Therefore this research also included search for web elements from other web archives than Netarkivet, in order to define collections of elements in reconstruction of as much as possible of the website. The PWID here contributes by providing a consistent mean to define elements from in a collection spanning over several web archives. Mentioning of this work was made in the first paper about the PWID (or WPID as it was called then). Contact: Senior Researcher Thomas Hvid Kromann, Royal Danish Library, thok@kb.dk 
>  -   The initiative RESAW have been existing for some years with the purpose to build projects that can support a Web archive research infrastructure. Here the PWID can support requests and specification of web elements across web archives independently of the individual web archive access platforms or knowledge of PID systems for the web archives.
>
> You are of course welcome to ask for more specific references, submitted draft, or additional information, either through me or by contacting the mentioned persons for the use cases directly.
>
>A minor additional thing that could be looked at later: Concerning references to several versions of a web element, this has not been the focus for introducing the PWID. However, I have thought that this could be included as an r-component at a later stage, e.g. ?+all-versions?=. You might claim that the timestamp specification is then without importance, but following the requirements for referencing, which traditionally is followed by digital humanities researchers in Denmark, it makes perfectly sense to specify the verified resource and then have an extra parameter indicating that there are several versions.


Best regards, 

Juha 

-----Original Message-----
From: urn <urn-bounces@ietf.org> On Behalf Of Eld Zierau
Sent: perjantai 6. syyskuuta 2019 10.44
To: Peter Saint-Andre <stpeter@stpeter.im>; urn@ietf.org
Subject: [urn] new draft 9 - RE: new urn PWID draft (7) with corrections

Dear Peter

I will urge you to reconsider. I have made a new much slimmer draft version 09 (https://www.ietf.org/internet-drafts/draft-pwid-urn-specification-09.txt), where I have cut down on ambitions and left out parts that are not crucial for the PWID in order to fill the current gap of referencing methods to web archives. The cut down of ambitions means that most issues are solved as explained below. 

Let me stress that I really think it is important that this is a standard obtained within IETF as it is very much related to the internet. The PWID can cover a very important gap, as there are no ways to make proper references to archived material in web archives with restricted access. Many web archives have restrictions due to personal data issues, and the lack of referencing techniques means that researchers have the problem that publishers in many cases require researchers to only to have papers accepted if all their references are following a standard. This is a problem worldwide, and it will have consequences for the use of web archives as sources for research, even though a lot of communication relevant for our society is actually happening on the web.

With the new version, I think there are only two considerations left:
1)
The PWID is a bit different to other URNs by having an indirect identifier assignment instead of a direct one, since the identifier is indirectly assigned when the resource is archived: archive, timestamp and harvested URI – and supplied with precision of how much the reference includes (precision-spec).
It should be noted that it would not make sense to have direct assignment of identifiers for all archived web material at any point. Firstly, because the volume of data is too big, and secondly, because assignment of an identifier would require that you can point to the resource, which will be impossible for web resources with restricted access, which place us back at the starting point.
2)
The identification of a web archive is not via a registry, but identified via the domain name for the web archive. Over long term, it would be better with a registry where the history of changed domain can easily be traced. However, it would probably be possible to trace changes until a formal registry is established. I will certainly work for establishing a registry, where there are procedures for registration which ensure that no “… third party can create archive-ids associated with some other Internet domain or administrative domain over which it has no control or authority”.
If this is not acceptable, it will probably never be possible, since this is a question of whether the hen or the egg comes first. If the PWID cannot be accepted due to a missing registry, - a creation of a registry would probably be rejected, since it does not have a purpose yet. To me it is most obvious to start with the PWID URN without the registry, since it has a very good purpose by filling a gap, and it can be argued that the proposed identification of archives is workable and changes are traceable (additional discussion of this can be found in the “Additional Information” section of the draft).

The two main changes are:
1)
Narrowing down the precision specification to page or part and giving a method of how two distinguish whether a reference should have one or the other. I.e. following the method a PWID will uniquely identify a resource, - where page and part will mean two different things.
2)
Moving consideration of web archive identifiers to the “Additional information” part and only describing the identification of web archives by their domains.
With these changes, there are a well-defined way to construct and resolve PWIDs unambiguously. It will of course be possible to construct invalid PWID URNs, but that is true for any URN or reference.

Below, I have responded to your earlier mail as part of your mail text. 

I really hope that you will consider this slimmed version, - as I think that it is a very important and valuable contribution – and important to have it in IETF.

Best regards, Eld

-----
Eld Zierau
Digital Preservation Specialist PhD
The Royal Danish Library
Digital Cultural Heritage
P.O. Box 2149, 1016 Copenhagen K
Ph. +45 9132 4690
Email: elzi@kb.dk

-----Original Message-----
From: Peter Saint-Andre <stpeter@stpeter.im>
Sent: Tuesday, July 9, 2019 1:58 AM
To: Eld Zierau <elzi@kb.dk>; urn@ietf.org
Subject: Re: [urn] new urn PWID draft (7) with corrections

Dear Eld,

Thank you for your continued attention to this discussion.

As the team leader for the expert reviewers [1] I feel it is incumbent on me to provide additional feedback.

I looked again at the list discussion about this namespace registration request. In particular, several list participants raised some difficult issues about the viability of the proposed namespace (see [2][3][4] for example). Summarizing the discussion, it seems we still have open issues around (at least) the following topics:

- the specification contains a great deal of ambiguity (e.g., in the definition of precision-spec types)
Eld: The precisions-spec is now (in version 09) only page or part – which means different things and where there is a described method to find out whether it should be page or part, - and what the difference is of a web page being referred as part or page.

- it appears that the proposed generation algorithm cannot guarantee the uniqueness of the resulting URNs, which violates core requirements defined in RFC 8141
Eld: I think there is a misunderstanding here – the PWID is defined by the web archive, when the archive archived the resource and the URI of the (harvested) resource, - so there is no way that two different resources can get the same PWID. If I have misunderstood this, please let me know.

- PWID URNs might be intended to define a method for citation, but the proposed syntax does not capture all the information that would be needed to produce stable citations
Eld: I do not recall discussions pointing to this.– The only thing I can think of is the archive itself is “unstable” which would be the same case as for any URN being “ … deprecated or becomes Obsolete… “ as it is described in the uniqueness constraint in RFC 8141. The other thing that can be argued is the case where an archive has a new domain – I have discussed that under “additional information” and in my answer above.

- PWID URNs lack a conception of authorized archives, which means that a third party could create archive-ids associated with some other Internet domain or administrative domain over which it has no control or authority
Eld: I don’t understand this – I would expect that a future registration of archives with archive id have procedures to ensure that nobody but the web archives themselves can register. In version 09 I have left out the archive identifier part, since no such registry exists yet.

- several aspects of the proposed semantics depend on knowing the media type of the archived resource, but this information is not necessarily available to a person or application that constructs or interprets a PWID URN
Eld: This is not the case for version 09, since a precision-spec can only be part or page, and there is described a method to find out whether it is part or page that would be preferred in the PWID.

- in order for PWID URNs to be usable, they might need to support f-, r-, and q-components from RFC 8141, but this usage has not yet been defined in the specification
Eld: As far as I can see it does not seem to be needed at this stage, but please let me know if you have any specific suggestions.

All in all, there are many open issues with the proposed namespace and it seems premature to approve the registration before these issues are resolved. Unfortunately, some of these issues run so deep that it's unclear whether they *can* be resolved without performing major surgery on the specification.

I wish I could be more positive in my recommendation, but at this time I am not in favor of registering this formal namespace.

Best Regards,

Peter

[1] https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml

[2] https://mailarchive.ietf.org/arch/msg/urn/qZ4qcHHmPJyKEg-YaIHmr_ksG3I

[3] https://mailarchive.ietf.org/arch/msg/urn/pypKM9SY2jSOiyZ5g6icQqFlzfs

[4] https://mailarchive.ietf.org/arch/msg/urn/Jzd9INZxhpNFiQH1ZxUhZ_0wkNY


On 6/4/19 2:11 AM, Eld Zierau wrote:
> I just submitted a version with a minor correction in one of the 
> references (had the wrong title due to a copy/paste error) Can it be accepted as it is now?
> Best regards, Eld
> 
> -----Original Message-----
> From: Eld Zierau
> Sent: Thursday, May 2, 2019 2:56 PM
> To: 'Peter Saint-Andre' <stpeter@stpeter.im>; urn@ietf.org
> Subject: new urn PWID draft (7) with corrections
> 
> Thanks again for your comments
> I have uploaded a draft version 7 - and described how I have addressed the comments in the below mail from Peter Does this cover what is needed?
> 
> Best regards, Eld
> 
> -----Original Message-----
> From: Peter Saint-Andre <stpeter@stpeter.im>
> Sent: Tuesday, April 30, 2019 5:14 AM
> To: Eld Zierau <elzi@kb.dk>
> Cc: urn@ietf.org
> Subject: Re: [urn] Comments on PWID -05 - now PWID -06
> 
> Hello Eld,
> 
> Your proposed syntax (with "~") looks fine to me.
>> Eld: :)
> 
> The ABNF definition of your proposed syntax does not conform to RFC 5234. You can check the ABNF using this tool:
> 
> https://tools.ietf.org/tools/bap/abnf.cgi
> 
>> Eld: it conforms now - thank you so much for providing this the link 
>> to the syntax checker - that was very helpful
> 
> In particular, it's not clear to me what a rule like this is intended 
> to
> mean:
> 
>    registered-archive-id = +( unreserved )
> 
> Do you mean that a registered-archive-id can include one or more instances of characters from the `unreserved` rule? If so, change "+" to "*".
> 
>> Eld: I meant with one or more characters - but I found out it should 
>> then be 1*unreserved and likewise for other occurrences
> 
> 
> To simplify the ABNF, you could use the datetime rules from RFC 3339.
> 
>> Eld: I used to in an earlier version, but Dale noticed that there was a difference (in mail on 28th of February 2019): "But comparing that to W3CDTF, I see no single nontermainal which corresponds to the set of formats allowed in W3CDTF.  I suggest you make a more rigid specification as to what is allwed for archival-time." - so I think I better stick to the rigid version in order to be sure.
> 
> Please don't use `URI` as the name of an ABNF rule because that's already defined in RFC 3986 and could cause confusion. Perhaps call it `uri-string`.
> 
>> Eld: Done
> 
> Personally I found the `precision-spec` categories difficult to understand and sometimes ambiguous. For instance:
> 
> * A precision level of "part" seems to be an HTML file only (at least in the case when "it refers to an html web element"), however a URI can point to many file types other than HTML files. Perhaps "single" (as in a single file) would be clearer; it would also be good to specify how this is handled in the case of file types other than HTML.
> 
> * Does a precision level of "page" apply only to HTML pages with all 
> "referenced web parts"? (By the latter term I think you mean what the 
> HTML 5.2 specification defines as "embedded content"; in general it 
> would be good to align terminology.)
> 
>> Eld: I have rephrased to make it more clear - it was explained in two 
>> steps before, - I have therefore also restructured a bit to make it 
>> more clear (page 11-13)
> 
> As to the registration, instead of version 6 it should be version 1 because this is the initial registration (i.e., whenever we are finished with this process it will be the initial version, whereas if you update the entire registration in the future that would be version 2).
> 
>> Eld: got it - I change it and left details to change log comment Eld? 
>> I have also change the version in the top of the template - since I guess that is the same thing - is that correct?
> 
> The security considerations strike me as underspecified. An archived web page or part could be just as dangerous as a "live" page or part; for instance, it could include insecure scripts, malware, trackers, etc.
> Furthermore, an archived page could in fact be more dangerous, because it could include outdated scripts with known vulnerabilities that can never be patched because the script is archived for all time in a vulnerable state (an attack of this sort was recently discovered in the wild).
> 
>> Eld: You are quite right, - I have taken the liberty to rephrase you 
>> comment and add it to the section, - hope that is ok
> 
> Best Regards,
> 
> Peter
> 
> On 4/29/19 6:10 AM, Eld Zierau wrote:
>> Did any of you have comments to my previous mail?
>> Is there any action you want me to take in order to get it accepted?
>> Best Regards, Eld
>>
>> -----Original Message-----
>> From: Eld Zierau
>> Sent: Friday, March 1, 2019 1:29 PM
>> To: 'Martin J. Dürst' <duerst@it.aoyama.ac.jp>; 'Dale R. Worley' 
>> <worley@ariadne.com>
>> Cc: 'urn@ietf.org' <urn@ietf.org>; 'L.Svensson@dnb.de' 
>> <L.Svensson@dnb.de>
>> Subject: [urn] Comments on PWID -05 - now PWID -06
>>
>> I have now uploade a new version: draft-pwid-urn-specification-06
>>  - and thanks again for comments and suggestions
>>
>> Regarding the suggestion from Martin (included below), I can as a computer scientist certainly see the reasoning as quite obvious. However, my experience with presentation of the PWID is that syntax based on computational reasoning is something that users find illogically, e.g. that the archived-item-id (usually URI) is included in the end of the PWID. I believe that adding a "~" for identifiers that are registered separately is acceptable for such users, but I am also convinced that a "+" before a domain will be something that confuses (non-computer science) users a lot. 
>> Also, as said in my previous mail, it is highly unlikely that there will ever be a case where "~" is the first character in a domain for a web archive. Therefore, it seems that it should not be necessary. 
>> A minor extra thing is that all existing PWIDs (and tools providing and resolving PWIDs) would not comply, which they would otherwise (none of these use registered identifiers yet only domains and URIs).
>> In other words: I will be very sorry to add a "+" to domains, and I believe it is not necessary.
>>
>> The uploaded version  does not include a "+" to domains, - If 
>> required, I will of course add it (although sorry to do so)
>>
>> Please let me know if it acceptable, and I will act accordingly.
>>
>> Best regards, Eld
>>
>>
>> On 2019/03/01 11:31, Dale R. Worley wrote:
>>> Martin J. Duerst <duerst@it.aoyama.ac.jp> writes:
>>>>> [...]  E.g., one could require that any archive-id that is not 
>>>>> intended to be interpreted as a DNS name to start with one of "-", 
>>>>> ".", "_", "~".
>>>>
>>>> I haven't looked into the details, but in general, I think this is 
>>>> a bad idea. It is much better to have an explicit distinction than 
>>>> to rely on some syntax restrictions. Such syntax restrictions may 
>>>> or may not actually hold in practice. It's very easy to create a 
>>>> DNS name starting with '-' or '_', for example, even though officially, that's not allowed.
>>>
>>> I may agree with you ... But what do you mean by "an explicit 
>>> distinction"?  E.g., I would tend to consider "archive-ids starting 
>>> with '~' are registered archive names, and archive-ids that do not 
>>> are considered DNS names" to be an "explicit" distinction, but you 
>>> mean something else.
>>
>> Well, the explicit distinction would be "if it starts with '~', what follows is a registered archive name, and if it starts with '+', what follows is a DNS name" or some such. This would not exclude any leading characters in either archive names or DNS names.
>>
>> Regards,   Martin.
>>
>>> Or maybe the right question is, What do you propose as an alternative?
>> _______________________________________________
>> urn mailing list
>> urn@ietf.org
>> https://www.ietf.org/mailman/listinfo/urn
>>
> 
> _______________________________________________
> urn mailing list
> urn@ietf.org
> https://www.ietf.org/mailman/listinfo/urn
> 

_______________________________________________
urn mailing list
urn@ietf.org
https://www.ietf.org/mailman/listinfo/urn