[urn] new draft 9 - RE: new urn PWID draft (7) with corrections

Dear Peter

I will urge you to reconsider. I have made a new much slimmer draft version 09 (https://www.ietf.org/internet-drafts/draft-pwid-urn-specification-09.txt), where I have cut down on ambitions and left out parts that are not crucial for the PWID in order to fill the current gap of referencing methods to web archives. The cut down of ambitions means that most issues are solved as explained below. 

Let me stress that I really think it is important that this is a standard obtained within IETF as it is very much related to the internet. The PWID can cover a very important gap, as there are no ways to make proper references to archived material in web archives with restricted access. Many web archives have restrictions due to personal data issues, and the lack of referencing techniques means that researchers have the problem that publishers in many cases require researchers to only to have papers accepted if all their references are following a standard. This is a problem worldwide, and it will have consequences for the use of web archives as sources for research, even though a lot of communication relevant for our society is actually happening on the web.

With the new version, I think there are only two considerations left:
1)
The PWID is a bit different to other URNs by having an indirect identifier assignment instead of a direct one, since the identifier is indirectly assigned when the resource is archived: archive, timestamp and harvested URI – and supplied with precision of how much the reference includes (precision-spec).
It should be noted that it would not make sense to have direct assignment of identifiers for all archived web material at any point. Firstly, because the volume of data is too big, and secondly, because assignment of an identifier would require that you can point to the resource, which will be impossible for web resources with restricted access, which place us back at the starting point.
2)
The identification of a web archive is not via a registry, but identified via the domain name for the web archive. Over long term, it would be better with a registry where the history of changed domain can easily be traced. However, it would probably be possible to trace changes until a formal registry is established. I will certainly work for establishing a registry, where there are procedures for registration which ensure that no “… third party can create archive-ids associated with some other Internet domain or administrative domain over which it has no control or authority”.
If this is not acceptable, it will probably never be possible, since this is a question of whether the hen or the egg comes first. If the PWID cannot be accepted due to a missing registry, - a creation of a registry would probably be rejected, since it does not have a purpose yet. To me it is most obvious to start with the PWID URN without the registry, since it has a very good purpose by filling a gap, and it can be argued that the proposed identification of archives is workable and changes are traceable (additional discussion of this can be found in the “Additional Information” section of the draft).

The two main changes are:
1) 
Narrowing down the precision specification to page or part and giving a method of how two distinguish whether a reference should have one or the other. I.e. following the method a PWID will uniquely identify a resource, - where page and part will mean two different things.
2)
Moving consideration of web archive identifiers to the “Additional information” part and only describing the identification of web archives by their domains.
With these changes, there are a well-defined way to construct and resolve PWIDs unambiguously. It will of course be possible to construct invalid PWID URNs, but that is true for any URN or reference.

Below, I have responded to your earlier mail as part of your mail text. 

I really hope that you will consider this slimmed version, - as I think that it is a very important and valuable contribution – and important to have it in IETF.

Best regards, Eld

-----
Eld Zierau
Digital Preservation Specialist PhD
The Royal Danish Library
Digital Cultural Heritage
P.O. Box 2149, 1016 Copenhagen K
Ph. +45 9132 4690 
Email: elzi@kb.dk

-----Original Message-----
From: Peter Saint-Andre <stpeter@stpeter.im> 
Sent: Tuesday, July 9, 2019 1:58 AM
To: Eld Zierau <elzi@kb.dk>; urn@ietf.org
Subject: Re: [urn] new urn PWID draft (7) with corrections

Dear Eld,

Thank you for your continued attention to this discussion.

As the team leader for the expert reviewers [1] I feel it is incumbent on me to provide additional feedback.

I looked again at the list discussion about this namespace registration request. In particular, several list participants raised some difficult issues about the viability of the proposed namespace (see [2][3][4] for example). Summarizing the discussion, it seems we still have open issues around (at least) the following topics:

- the specification contains a great deal of ambiguity (e.g., in the definition of precision-spec types)
Eld: The precisions-spec is now (in version 09) only page or part – which means different things and where there is a described method to find out whether it should be page or part, - and what the difference is of a web page being referred as part or page.

- it appears that the proposed generation algorithm cannot guarantee the uniqueness of the resulting URNs, which violates core requirements defined in RFC 8141
Eld: I think there is a misunderstanding here – the PWID is defined by the web archive, when the archive archived the resource and the URI of the (harvested) resource, - so there is no way that two different resources can get the same PWID. If I have misunderstood this, please let me know.

- PWID URNs might be intended to define a method for citation, but the proposed syntax does not capture all the information that would be needed to produce stable citations
Eld: I do not recall discussions pointing to this.– The only thing I can think of is the archive itself is “unstable” which would be the same case as for any URN being “ … deprecated or becomes Obsolete… “ as it is described in the uniqueness constraint in RFC 8141. The other thing that can be argued is the case where an archive has a new domain – I have discussed that under “additional information” and in my answer above.

- PWID URNs lack a conception of authorized archives, which means that a third party could create archive-ids associated with some other Internet domain or administrative domain over which it has no control or authority
Eld: I don’t understand this – I would expect that a future registration of archives with archive id have procedures to ensure that nobody but the web archives themselves can register. In version 09 I have left out the archive identifier part, since no such registry exists yet.

- several aspects of the proposed semantics depend on knowing the media type of the archived resource, but this information is not necessarily available to a person or application that constructs or interprets a PWID URN
Eld: This is not the case for version 09, since a precision-spec can only be part or page, and there is described a method to find out whether it is part or page that would be preferred in the PWID.

- in order for PWID URNs to be usable, they might need to support f-, r-, and q-components from RFC 8141, but this usage has not yet been defined in the specification
Eld: As far as I can see it does not seem to be needed at this stage, but please let me know if you have any specific suggestions.

All in all, there are many open issues with the proposed namespace and it seems premature to approve the registration before these issues are resolved. Unfortunately, some of these issues run so deep that it's unclear whether they *can* be resolved without performing major surgery on the specification.

I wish I could be more positive in my recommendation, but at this time I am not in favor of registering this formal namespace.

Best Regards,

Peter

[1] https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml

[2] https://mailarchive.ietf.org/arch/msg/urn/qZ4qcHHmPJyKEg-YaIHmr_ksG3I

[3] https://mailarchive.ietf.org/arch/msg/urn/pypKM9SY2jSOiyZ5g6icQqFlzfs

[4] https://mailarchive.ietf.org/arch/msg/urn/Jzd9INZxhpNFiQH1ZxUhZ_0wkNY

On 6/4/19 2:11 AM, Eld Zierau wrote:
> I just submitted a version with a minor correction in one of the 
> references (had the wrong title due to a copy/paste error) Can it be accepted as it is now?
> Best regards, Eld
> 
> -----Original Message-----
> From: Eld Zierau
> Sent: Thursday, May 2, 2019 2:56 PM
> To: 'Peter Saint-Andre' <stpeter@stpeter.im>; urn@ietf.org
> Subject: new urn PWID draft (7) with corrections
> 
> Thanks again for your comments
> I have uploaded a draft version 7 - and described how I have addressed the comments in the below mail from Peter Does this cover what is needed?
> 
> Best regards, Eld
> 
> -----Original Message-----
> From: Peter Saint-Andre <stpeter@stpeter.im>
> Sent: Tuesday, April 30, 2019 5:14 AM
> To: Eld Zierau <elzi@kb.dk>
> Cc: urn@ietf.org
> Subject: Re: [urn] Comments on PWID -05 - now PWID -06
> 
> Hello Eld,
> 
> Your proposed syntax (with "~") looks fine to me.
>> Eld: :)
> 
> The ABNF definition of your proposed syntax does not conform to RFC 5234. You can check the ABNF using this tool:
> 
> https://tools.ietf.org/tools/bap/abnf.cgi
> 
>> Eld: it conforms now - thank you so much for providing this the link 
>> to the syntax checker - that was very helpful
> 
> In particular, it's not clear to me what a rule like this is intended 
> to
> mean:
> 
>    registered-archive-id = +( unreserved )
> 
> Do you mean that a registered-archive-id can include one or more instances of characters from the `unreserved` rule? If so, change "+" to "*".
> 
>> Eld: I meant with one or more characters - but I found out it should 
>> then be 1*unreserved and likewise for other occurrences
> 
> 
> To simplify the ABNF, you could use the datetime rules from RFC 3339.
> 
>> Eld: I used to in an earlier version, but Dale noticed that there was a difference (in mail on 28th of February 2019): "But comparing that to W3CDTF, I see no single nontermainal which corresponds to the set of formats allowed in W3CDTF.  I suggest you make a more rigid specification as to what is allwed for archival-time." - so I think I better stick to the rigid version in order to be sure.
> 
> Please don't use `URI` as the name of an ABNF rule because that's already defined in RFC 3986 and could cause confusion. Perhaps call it `uri-string`.
> 
>> Eld: Done
> 
> Personally I found the `precision-spec` categories difficult to understand and sometimes ambiguous. For instance:
> 
> * A precision level of "part" seems to be an HTML file only (at least in the case when "it refers to an html web element"), however a URI can point to many file types other than HTML files. Perhaps "single" (as in a single file) would be clearer; it would also be good to specify how this is handled in the case of file types other than HTML.
> 
> * Does a precision level of "page" apply only to HTML pages with all 
> "referenced web parts"? (By the latter term I think you mean what the 
> HTML 5.2 specification defines as "embedded content"; in general it 
> would be good to align terminology.)
> 
>> Eld: I have rephrased to make it more clear - it was explained in two 
>> steps before, - I have therefore also restructured a bit to make it 
>> more clear (page 11-13)
> 
> As to the registration, instead of version 6 it should be version 1 because this is the initial registration (i.e., whenever we are finished with this process it will be the initial version, whereas if you update the entire registration in the future that would be version 2).
> 
>> Eld: got it - I change it and left details to change log comment Eld? 
>> I have also change the version in the top of the template - since I guess that is the same thing - is that correct?
> 
> The security considerations strike me as underspecified. An archived web page or part could be just as dangerous as a "live" page or part; for instance, it could include insecure scripts, malware, trackers, etc.
> Furthermore, an archived page could in fact be more dangerous, because it could include outdated scripts with known vulnerabilities that can never be patched because the script is archived for all time in a vulnerable state (an attack of this sort was recently discovered in the wild).
> 
>> Eld: You are quite right, - I have taken the liberty to rephrase you 
>> comment and add it to the section, - hope that is ok
> 
> Best Regards,
> 
> Peter
> 
> On 4/29/19 6:10 AM, Eld Zierau wrote:
>> Did any of you have comments to my previous mail?
>> Is there any action you want me to take in order to get it accepted?
>> Best Regards, Eld
>>
>> -----Original Message-----
>> From: Eld Zierau
>> Sent: Friday, March 1, 2019 1:29 PM
>> To: 'Martin J. Dürst' <duerst@it.aoyama.ac.jp>; 'Dale R. Worley' 
>> <worley@ariadne.com>
>> Cc: 'urn@ietf.org' <urn@ietf.org>; 'L.Svensson@dnb.de' 
>> <L.Svensson@dnb.de>
>> Subject: [urn] Comments on PWID -05 - now PWID -06
>>
>> I have now uploade a new version: draft-pwid-urn-specification-06
>>  - and thanks again for comments and suggestions
>>
>> Regarding the suggestion from Martin (included below), I can as a computer scientist certainly see the reasoning as quite obvious. However, my experience with presentation of the PWID is that syntax based on computational reasoning is something that users find illogically, e.g. that the archived-item-id (usually URI) is included in the end of the PWID. I believe that adding a "~" for identifiers that are registered separately is acceptable for such users, but I am also convinced that a "+" before a domain will be something that confuses (non-computer science) users a lot. 
>> Also, as said in my previous mail, it is highly unlikely that there will ever be a case where "~" is the first character in a domain for a web archive. Therefore, it seems that it should not be necessary. 
>> A minor extra thing is that all existing PWIDs (and tools providing and resolving PWIDs) would not comply, which they would otherwise (none of these use registered identifiers yet only domains and URIs).
>> In other words: I will be very sorry to add a "+" to domains, and I believe it is not necessary.
>>
>> The uploaded version  does not include a "+" to domains, - If 
>> required, I will of course add it (although sorry to do so)
>>
>> Please let me know if it acceptable, and I will act accordingly.
>>
>> Best regards, Eld
>>
>>
>> On 2019/03/01 11:31, Dale R. Worley wrote:
>>> Martin J. Duerst <duerst@it.aoyama.ac.jp> writes:
>>>>> [...]  E.g., one could require that any archive-id that is not 
>>>>> intended to be interpreted as a DNS name to start with one of "-", 
>>>>> ".", "_", "~".
>>>>
>>>> I haven't looked into the details, but in general, I think this is 
>>>> a bad idea. It is much better to have an explicit distinction than 
>>>> to rely on some syntax restrictions. Such syntax restrictions may 
>>>> or may not actually hold in practice. It's very easy to create a 
>>>> DNS name starting with '-' or '_', for example, even though officially, that's not allowed.
>>>
>>> I may agree with you ... But what do you mean by "an explicit 
>>> distinction"?  E.g., I would tend to consider "archive-ids starting 
>>> with '~' are registered archive names, and archive-ids that do not 
>>> are considered DNS names" to be an "explicit" distinction, but you 
>>> mean something else.
>>
>> Well, the explicit distinction would be "if it starts with '~', what follows is a registered archive name, and if it starts with '+', what follows is a DNS name" or some such. This would not exclude any leading characters in either archive names or DNS names.
>>
>> Regards,   Martin.
>>
>>> Or maybe the right question is, What do you propose as an alternative?
>> _______________________________________________
>> urn mailing list
>> urn@ietf.org
>> https://www.ietf.org/mailman/listinfo/urn
>>
> 
> _______________________________________________
> urn mailing list
> urn@ietf.org
> https://www.ietf.org/mailman/listinfo/urn
>