Re: [urn] new draft 9 - RE: new urn PWID draft (7) with corrections

Dear Eld; all, 

I have some general and specific comments about this version of the PWID requests: 

There is a fair amount of redundancy, the same text is repeated in introduction and further chapters. The draft would be easier to read if this kind of duplication were reduced. 

The process specified in RFC 8141 should be used to register a namespace. That is, the registration does not need to be an RFC unless there is an acceptable reason to do so (in case of NBN, there is no other document to describe NBNs than RFC 8458). Note that the RFC process is more complicated than the RFC 8141 one, even for informational RFCs. 

Reference to W3CDTF is out of date, since the problem the W3C profile fixes (ISO 8601 allowed presentation of year with two digits) has been corrected in ISO 8601:2004 and later versions of the standard. 

ISO 8601 has been revised in 2019, so unless you want to cite the old version you should either change the year in the citation or remove it completely. Without the year specification the citation will always refer to the latest version of the standard. I recommend this latter approach. 

On page 9. sentence "The following valid precision-spec values are exists:" should be fixed. Whether it is enough to specify just page and file is a good question; in theory it is possible to specify also all copies of a page or all pages harvested from a web site, but this is clearly beyond the intended scope of PWID.   

Concerning the two issues you mention below: 

Domain name of the archive is indeed a weak way of identifying the archive.  It might be better not to talk about identification at all in this context, and just call this component domain name of the archive, and say that the name will separate the Web archives from one another. 

Web archives are organizations and they may eventually receive e.g. ISNI (International Standard Name Identifier) or other organization identifier. Surprisingly not even the Internet Archive has an ISNI yet. And even if there were one, it would be difficult to use it during the automatic PWID generation process. 

I do not agree with this statement: 

"it would not make sense to have direct assignment of identifiers for all archived web material at any point. Firstly, because the volume of data is too big, and secondly, because assignment of an identifier would require that you can point to the resource, which will be impossible for web resources with restricted access, which place us back at the starting point."

If URNs are machine generated, it is possible to create billions of them. RFC 8141 does not impose any limits on how many URNs there can be per namespace. Of course technologies used may set some limits; Norwegian Web archive initially created URNs for everything but they had to stop when the database table became too large, after several hundreds of millions of assigned URNs. 

URN standard does not require URNs to be actionable, so the fact that some archived resources are not publicly accessible is not a problem either. 

This brings me to the issue I find hard to understand: 

PWID supports references to material in web archives with restricted access. Since these archives are usually / always built using the same tools than publicly available archives (our legal deposit archive uses the same technology as the Internet Archive) all documents in our legal deposit Web archive have archive specific cool URIs that can be used for citing either all versions of a web page or a specific harvested version of it. We can create URNs which can be used as links to these resources in such a way that this URN resolves to all the harvested copies of this page, either in our Web archive or in our own archive or elsewhere. We are currently in the process of making our resolver compliant with RFC 8141, and part of the revision is this URN -> URLs functionality. Link to the legal deposit Web archive will only be seen if the user is entitled to access the archive (that is, people on dedicated work stations, with known IP addresses). 

What is the added value of URN:PWID for a user like us, because having a non-actionable URN:PWID is not attractive when we can have e.g. URN:NBN which resolves to any archived copy of the resource (and to the one on-line in the Web)?     

It is a well known problem that copies of Web pages in different archives may differ, since the individual files have been harvested at different times. But I am not sure if URN:PWID makes it easier to solve this problem (compared with e.g. usage of URN:NBN). The former seems to be limited to page and part (in precision-spec) when we may want to identify with a URN all pages harvested from a web site at all times when harvesting newspaper Web sites) or all versions of a page, not just one. 

To sum up: this version of PWID registration request has fewer technical issues than the previous one, but I still don't know how / why for instance my library could / should use URN:PWIDs to identify resources in our restricted legal deposit Web archive. Use cases (general ones, or ones describing the plans for Denmark) might help me to see the light. My library may not find uses for PWID, but if Royal Danish Library has them and if there are no obvious technical issues in a future version of this document, I can accept this registration request.  

Best regards, 

Juha 

-----Original Message-----
From: urn <urn-bounces@ietf.org> On Behalf Of Eld Zierau
Sent: perjantai 6. syyskuuta 2019 10.44
To: Peter Saint-Andre <stpeter@stpeter.im>; urn@ietf.org
Subject: [urn] new draft 9 - RE: new urn PWID draft (7) with corrections

Dear Peter

I will urge you to reconsider. I have made a new much slimmer draft version 09 (https://www.ietf.org/internet-drafts/draft-pwid-urn-specification-09.txt), where I have cut down on ambitions and left out parts that are not crucial for the PWID in order to fill the current gap of referencing methods to web archives. The cut down of ambitions means that most issues are solved as explained below. 

Let me stress that I really think it is important that this is a standard obtained within IETF as it is very much related to the internet. The PWID can cover a very important gap, as there are no ways to make proper references to archived material in web archives with restricted access. Many web archives have restrictions due to personal data issues, and the lack of referencing techniques means that researchers have the problem that publishers in many cases require researchers to only to have papers accepted if all their references are following a standard. This is a problem worldwide, and it will have consequences for the use of web archives as sources for research, even though a lot of communication relevant for our society is actually happening on the web.

With the new version, I think there are only two considerations left:
1)
The PWID is a bit different to other URNs by having an indirect identifier assignment instead of a direct one, since the identifier is indirectly assigned when the resource is archived: archive, timestamp and harvested URI – and supplied with precision of how much the reference includes (precision-spec).
It should be noted that it would not make sense to have direct assignment of identifiers for all archived web material at any point. Firstly, because the volume of data is too big, and secondly, because assignment of an identifier would require that you can point to the resource, which will be impossible for web resources with restricted access, which place us back at the starting point.
2)
The identification of a web archive is not via a registry, but identified via the domain name for the web archive. Over long term, it would be better with a registry where the history of changed domain can easily be traced. However, it would probably be possible to trace changes until a formal registry is established. I will certainly work for establishing a registry, where there are procedures for registration which ensure that no “… third party can create archive-ids associated with some other Internet domain or administrative domain over which it has no control or authority”.
If this is not acceptable, it will probably never be possible, since this is a question of whether the hen or the egg comes first. If the PWID cannot be accepted due to a missing registry, - a creation of a registry would probably be rejected, since it does not have a purpose yet. To me it is most obvious to start with the PWID URN without the registry, since it has a very good purpose by filling a gap, and it can be argued that the proposed identification of archives is workable and changes are traceable (additional discussion of this can be found in the “Additional Information” section of the draft).

The two main changes are:
1)
Narrowing down the precision specification to page or part and giving a method of how two distinguish whether a reference should have one or the other. I.e. following the method a PWID will uniquely identify a resource, - where page and part will mean two different things.
2)
Moving consideration of web archive identifiers to the “Additional information” part and only describing the identification of web archives by their domains.
With these changes, there are a well-defined way to construct and resolve PWIDs unambiguously. It will of course be possible to construct invalid PWID URNs, but that is true for any URN or reference.

Below, I have responded to your earlier mail as part of your mail text. 

I really hope that you will consider this slimmed version, - as I think that it is a very important and valuable contribution – and important to have it in IETF.

Best regards, Eld

-----
Eld Zierau
Digital Preservation Specialist PhD
The Royal Danish Library
Digital Cultural Heritage
P.O. Box 2149, 1016 Copenhagen K
Ph. +45 9132 4690
Email: elzi@kb.dk

-----Original Message-----
From: Peter Saint-Andre <stpeter@stpeter.im>
Sent: Tuesday, July 9, 2019 1:58 AM
To: Eld Zierau <elzi@kb.dk>; urn@ietf.org
Subject: Re: [urn] new urn PWID draft (7) with corrections

Dear Eld,

Thank you for your continued attention to this discussion.

As the team leader for the expert reviewers [1] I feel it is incumbent on me to provide additional feedback.

I looked again at the list discussion about this namespace registration request. In particular, several list participants raised some difficult issues about the viability of the proposed namespace (see [2][3][4] for example). Summarizing the discussion, it seems we still have open issues around (at least) the following topics:

- the specification contains a great deal of ambiguity (e.g., in the definition of precision-spec types)
Eld: The precisions-spec is now (in version 09) only page or part – which means different things and where there is a described method to find out whether it should be page or part, - and what the difference is of a web page being referred as part or page.

- it appears that the proposed generation algorithm cannot guarantee the uniqueness of the resulting URNs, which violates core requirements defined in RFC 8141
Eld: I think there is a misunderstanding here – the PWID is defined by the web archive, when the archive archived the resource and the URI of the (harvested) resource, - so there is no way that two different resources can get the same PWID. If I have misunderstood this, please let me know.

- PWID URNs might be intended to define a method for citation, but the proposed syntax does not capture all the information that would be needed to produce stable citations
Eld: I do not recall discussions pointing to this.– The only thing I can think of is the archive itself is “unstable” which would be the same case as for any URN being “ … deprecated or becomes Obsolete… “ as it is described in the uniqueness constraint in RFC 8141. The other thing that can be argued is the case where an archive has a new domain – I have discussed that under “additional information” and in my answer above.

- PWID URNs lack a conception of authorized archives, which means that a third party could create archive-ids associated with some other Internet domain or administrative domain over which it has no control or authority
Eld: I don’t understand this – I would expect that a future registration of archives with archive id have procedures to ensure that nobody but the web archives themselves can register. In version 09 I have left out the archive identifier part, since no such registry exists yet.

- several aspects of the proposed semantics depend on knowing the media type of the archived resource, but this information is not necessarily available to a person or application that constructs or interprets a PWID URN
Eld: This is not the case for version 09, since a precision-spec can only be part or page, and there is described a method to find out whether it is part or page that would be preferred in the PWID.

- in order for PWID URNs to be usable, they might need to support f-, r-, and q-components from RFC 8141, but this usage has not yet been defined in the specification
Eld: As far as I can see it does not seem to be needed at this stage, but please let me know if you have any specific suggestions.

All in all, there are many open issues with the proposed namespace and it seems premature to approve the registration before these issues are resolved. Unfortunately, some of these issues run so deep that it's unclear whether they *can* be resolved without performing major surgery on the specification.

I wish I could be more positive in my recommendation, but at this time I am not in favor of registering this formal namespace.

Best Regards,

Peter

[1] https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml

[2] https://mailarchive.ietf.org/arch/msg/urn/qZ4qcHHmPJyKEg-YaIHmr_ksG3I

[3] https://mailarchive.ietf.org/arch/msg/urn/pypKM9SY2jSOiyZ5g6icQqFlzfs

[4] https://mailarchive.ietf.org/arch/msg/urn/Jzd9INZxhpNFiQH1ZxUhZ_0wkNY

On 6/4/19 2:11 AM, Eld Zierau wrote:
> I just submitted a version with a minor correction in one of the 
> references (had the wrong title due to a copy/paste error) Can it be accepted as it is now?
> Best regards, Eld
> 
> -----Original Message-----
> From: Eld Zierau
> Sent: Thursday, May 2, 2019 2:56 PM
> To: 'Peter Saint-Andre' <stpeter@stpeter.im>; urn@ietf.org
> Subject: new urn PWID draft (7) with corrections
> 
> Thanks again for your comments
> I have uploaded a draft version 7 - and described how I have addressed the comments in the below mail from Peter Does this cover what is needed?
> 
> Best regards, Eld
> 
> -----Original Message-----
> From: Peter Saint-Andre <stpeter@stpeter.im>
> Sent: Tuesday, April 30, 2019 5:14 AM
> To: Eld Zierau <elzi@kb.dk>
> Cc: urn@ietf.org
> Subject: Re: [urn] Comments on PWID -05 - now PWID -06
> 
> Hello Eld,
> 
> Your proposed syntax (with "~") looks fine to me.
>> Eld: :)
> 
> The ABNF definition of your proposed syntax does not conform to RFC 5234. You can check the ABNF using this tool:
> 
> https://tools.ietf.org/tools/bap/abnf.cgi
> 
>> Eld: it conforms now - thank you so much for providing this the link 
>> to the syntax checker - that was very helpful
> 
> In particular, it's not clear to me what a rule like this is intended 
> to
> mean:
> 
>    registered-archive-id = +( unreserved )
> 
> Do you mean that a registered-archive-id can include one or more instances of characters from the `unreserved` rule? If so, change "+" to "*".
> 
>> Eld: I meant with one or more characters - but I found out it should 
>> then be 1*unreserved and likewise for other occurrences
> 
> 
> To simplify the ABNF, you could use the datetime rules from RFC 3339.
> 
>> Eld: I used to in an earlier version, but Dale noticed that there was a difference (in mail on 28th of February 2019): "But comparing that to W3CDTF, I see no single nontermainal which corresponds to the set of formats allowed in W3CDTF.  I suggest you make a more rigid specification as to what is allwed for archival-time." - so I think I better stick to the rigid version in order to be sure.
> 
> Please don't use `URI` as the name of an ABNF rule because that's already defined in RFC 3986 and could cause confusion. Perhaps call it `uri-string`.
> 
>> Eld: Done
> 
> Personally I found the `precision-spec` categories difficult to understand and sometimes ambiguous. For instance:
> 
> * A precision level of "part" seems to be an HTML file only (at least in the case when "it refers to an html web element"), however a URI can point to many file types other than HTML files. Perhaps "single" (as in a single file) would be clearer; it would also be good to specify how this is handled in the case of file types other than HTML.
> 
> * Does a precision level of "page" apply only to HTML pages with all 
> "referenced web parts"? (By the latter term I think you mean what the 
> HTML 5.2 specification defines as "embedded content"; in general it 
> would be good to align terminology.)
> 
>> Eld: I have rephrased to make it more clear - it was explained in two 
>> steps before, - I have therefore also restructured a bit to make it 
>> more clear (page 11-13)
> 
> As to the registration, instead of version 6 it should be version 1 because this is the initial registration (i.e., whenever we are finished with this process it will be the initial version, whereas if you update the entire registration in the future that would be version 2).
> 
>> Eld: got it - I change it and left details to change log comment Eld? 
>> I have also change the version in the top of the template - since I guess that is the same thing - is that correct?
> 
> The security considerations strike me as underspecified. An archived web page or part could be just as dangerous as a "live" page or part; for instance, it could include insecure scripts, malware, trackers, etc.
> Furthermore, an archived page could in fact be more dangerous, because it could include outdated scripts with known vulnerabilities that can never be patched because the script is archived for all time in a vulnerable state (an attack of this sort was recently discovered in the wild).
> 
>> Eld: You are quite right, - I have taken the liberty to rephrase you 
>> comment and add it to the section, - hope that is ok
> 
> Best Regards,
> 
> Peter
> 
> On 4/29/19 6:10 AM, Eld Zierau wrote:
>> Did any of you have comments to my previous mail?
>> Is there any action you want me to take in order to get it accepted?
>> Best Regards, Eld
>>
>> -----Original Message-----
>> From: Eld Zierau
>> Sent: Friday, March 1, 2019 1:29 PM
>> To: 'Martin J. Dürst' <duerst@it.aoyama.ac.jp>; 'Dale R. Worley' 
>> <worley@ariadne.com>
>> Cc: 'urn@ietf.org' <urn@ietf.org>; 'L.Svensson@dnb.de' 
>> <L.Svensson@dnb.de>
>> Subject: [urn] Comments on PWID -05 - now PWID -06
>>
>> I have now uploade a new version: draft-pwid-urn-specification-06
>>  - and thanks again for comments and suggestions
>>
>> Regarding the suggestion from Martin (included below), I can as a computer scientist certainly see the reasoning as quite obvious. However, my experience with presentation of the PWID is that syntax based on computational reasoning is something that users find illogically, e.g. that the archived-item-id (usually URI) is included in the end of the PWID. I believe that adding a "~" for identifiers that are registered separately is acceptable for such users, but I am also convinced that a "+" before a domain will be something that confuses (non-computer science) users a lot. 
>> Also, as said in my previous mail, it is highly unlikely that there will ever be a case where "~" is the first character in a domain for a web archive. Therefore, it seems that it should not be necessary. 
>> A minor extra thing is that all existing PWIDs (and tools providing and resolving PWIDs) would not comply, which they would otherwise (none of these use registered identifiers yet only domains and URIs).
>> In other words: I will be very sorry to add a "+" to domains, and I believe it is not necessary.
>>
>> The uploaded version  does not include a "+" to domains, - If 
>> required, I will of course add it (although sorry to do so)
>>
>> Please let me know if it acceptable, and I will act accordingly.
>>
>> Best regards, Eld
>>
>>
>> On 2019/03/01 11:31, Dale R. Worley wrote:
>>> Martin J. Duerst <duerst@it.aoyama.ac.jp> writes:
>>>>> [...]  E.g., one could require that any archive-id that is not 
>>>>> intended to be interpreted as a DNS name to start with one of "-", 
>>>>> ".", "_", "~".
>>>>
>>>> I haven't looked into the details, but in general, I think this is 
>>>> a bad idea. It is much better to have an explicit distinction than 
>>>> to rely on some syntax restrictions. Such syntax restrictions may 
>>>> or may not actually hold in practice. It's very easy to create a 
>>>> DNS name starting with '-' or '_', for example, even though officially, that's not allowed.
>>>
>>> I may agree with you ... But what do you mean by "an explicit 
>>> distinction"?  E.g., I would tend to consider "archive-ids starting 
>>> with '~' are registered archive names, and archive-ids that do not 
>>> are considered DNS names" to be an "explicit" distinction, but you 
>>> mean something else.
>>
>> Well, the explicit distinction would be "if it starts with '~', what follows is a registered archive name, and if it starts with '+', what follows is a DNS name" or some such. This would not exclude any leading characters in either archive names or DNS names.
>>
>> Regards,   Martin.
>>
>>> Or maybe the right question is, What do you propose as an alternative?
>> _______________________________________________
>> urn mailing list
>> urn@ietf.org
>> https://www.ietf.org/mailman/listinfo/urn
>>
> 
> _______________________________________________
> urn mailing list
> urn@ietf.org
> https://www.ietf.org/mailman/listinfo/urn
> 

_______________________________________________
urn mailing list
urn@ietf.org
https://www.ietf.org/mailman/listinfo/urn