Re: [urn] PWID URN (shortened title)

Eld Zierau <elzi@kb.dk> Mon, 27 April 2020 11:00 UTC

Return-Path: <elzi@kb.dk>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C4EB63A0AAF for <urn@ietfa.amsl.com>; Mon, 27 Apr 2020 04:00:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hCGbq75FHCtg for <urn@ietfa.amsl.com>; Mon, 27 Apr 2020 04:00:00 -0700 (PDT)
Received: from smtp-out12.electric.net (smtp-out12.electric.net [89.104.206.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9D2C03A0AAE for <urn@ietf.org>; Mon, 27 Apr 2020 04:00:00 -0700 (PDT)
Received: from 1jT1US-000668-U5 by out12d.electric.net with emc1-ok (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1jT1UT-00069e-Tj; Mon, 27 Apr 2020 03:59:57 -0700
Received: by emcmailer; Mon, 27 Apr 2020 03:59:57 -0700
Received: from [92.43.124.147] (helo=deliveryscan.hostedsepo.dk) by out12d.electric.net with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1jT1US-000668-U5; Mon, 27 Apr 2020 03:59:56 -0700
Received: from localhost (unknown [10.72.17.201]) by deliveryscan.hostedsepo.dk (Postfix) with ESMTP id 36F349FAFA; Mon, 27 Apr 2020 12:59:56 +0200 (CEST)
Received: from 10.72.17.201 ([10.72.17.201]) by dispatch-outgoing.hostedsepo.dk (JAMES SMTP Server 2.3.2-1) with SMTP ID 145; Mon, 27 Apr 2020 13:00:00 +0200 (CEST)
Received: from out12a.electric.net (smtp-out12.electric.net [89.104.206.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "electric.net", Issuer "COMODO RSA Domain Validation Secure Server CA" (verified OK)) by outgoing-postscan.hostedsepo.dk (Postfix) with ESMTPS id EA18915A9; Mon, 27 Apr 2020 12:59:55 +0200 (CEST)
Received: from 1jT1UR-0006nr-Tr by out12a.electric.net with hostsite:2468467 (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1jT1UR-0006qX-Vo; Mon, 27 Apr 2020 03:59:55 -0700
Received: by emcmailer; Mon, 27 Apr 2020 03:59:55 -0700
Received: from [92.43.124.46] (helo=pf1.outprescan-mta.hostedsepo.dk) by out12a.electric.net with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1jT1UR-0006nr-Tr; Mon, 27 Apr 2020 03:59:55 -0700
Received: from post.kb.dk (post-03.kb.dk [130.226.226.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by pf1.outprescan-mta.hostedsepo.dk (Postfix) with ESMTPS id 965609FB04; Mon, 27 Apr 2020 12:59:54 +0200 (CEST)
Received: from EXCH-01.kb.dk (exch-01.kb.dk [10.5.0.111]) by post.kb.dk (Postfix) with ESMTPS id 73CE295A27; Mon, 27 Apr 2020 12:59:54 +0200 (CEST)
Received: from EXCH-02.kb.dk (10.5.0.112) by EXCH-01.kb.dk (10.5.0.111) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1979.3; Mon, 27 Apr 2020 12:59:54 +0200
Received: from EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29]) by EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29%7]) with mapi id 15.01.1979.003; Mon, 27 Apr 2020 12:59:54 +0200
From: Eld Zierau <elzi@kb.dk>
To: "Hakala, Juha E" <juha.hakala@helsinki.fi>, Peter Saint-Andre <stpeter@stpeter.im>, "urn@ietf.org" <urn@ietf.org>
Thread-Topic: [urn] PWID URN (shortened title)
Thread-Index: AdYcfcHybvVQ+txQQCm3roPO2mfVmg==
Date: Mon, 27 Apr 2020 10:59:53 +0000
Message-ID: <fb9d2f5794a04c2c90b2d1c4ca49a509@kb.dk>
Accept-Language: da-DK, en-US
Content-Language: da-DK
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [130.226.229.95]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Outbound-IP: 92.43.124.46
X-Env-From: elzi@kb.dk
X-Proto: esmtps
X-Revdns: outprescan-mta.hostedsepo.dk
X-HELO: pf1.outprescan-mta.hostedsepo.dk
X-TLS: TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256
X-Authenticated_ID:
X-Virus-Status: Scanned by VirusSMART (c)
X-Virus-Status: Scanned by VirusSMART (b)
X-PolicySMART: 10573177, 19718497
X-PolicySMART: 10573177, 19718497
X-PolicySMART: 10573177, 19718497
X-PolicySMART: 10573177, 19718497
X-Outbound-IP: 92.43.124.147
X-Env-From: elzi@kb.dk
X-Proto: esmtps
X-Revdns: deliveryscan.hostedsepo.dk
X-HELO: deliveryscan.hostedsepo.dk
X-TLS: TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256
X-Authenticated_ID:
X-Virus-Status: Scanned by VirusSMART (c)
X-Virus-Status: Scanned by VirusSMART (b)
Archived-At: <https://mailarchive.ietf.org/arch/msg/urn/JnkWTxSQY1YZiLMgCeK34WQK0bE>
Subject: Re: [urn] PWID URN (shortened title)
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Apr 2020 11:00:05 -0000

Dear all
I hope you are all well wherever you are

I have been waiting with this mail due to the many events in the start of this year.

I really would like to hear if you can accept the PWID URN as it is now.

Due to the discussion late last year, I have updated the specification, explaining a case where it is very relevant to distinguish between whether it is a reference to a part or a page: The case where the page source includes virus. Several web archives deliberately archive contents with virus/malware, e.g. the Danish web archive. The reason is that it should be possible to make research on Danish internet as it looked when harvested, and that includes research subjects like what kind of malware that existed.

The new specification is send in separate mail in an attachment, just to make sure that this mail is not caught in a spam filter. Differences between this version and the last one is listed below.

Stay safe

Best regards, Eld



Differences from last time:
======================

page 2: 
2019-11-14
->
2020-04-27



page 4:
   The possibility ... implementations.
->
   The possibility ... implementations. 
   Furthermore, the precision part can tell whether the referred source 
   is to be viewed as a file or as a web page, which can be essential 
   for web page code (like html) which contain hidden information, e.g. 
   malware.


page 7:
      *  'part'
         Meaning the single archived file/web part harvested as from the
         specified URI.  For references to web pages with html code
         (i.e. pages where there is an option to "View page source"),
         this will mean the actual file with the html code.  It is
         relevant to refer to web pages this way, in case it is part of
         a collection specification or in case it is the html that is of
         interest (e.g. java scripts or hidden links that are not
         visible when rendering the web page).
         For all other types of files, the URI will be for single files
         to be interpreted a file.
->
      *  'part'
         Meaning the single archived file/web part harvested as from the
         specified URI.  For references to web pages with html code
         (viewed e.g. by the "View page source" on a web page or using a 
	 get-file function), this will mean the actual file comprising 
	 the html code.  It is relevant to refer to web parts this way, 
	 in case it is part of a collection specification or in case it 
	 is the html that is of interest. It can be relevant only to 
	 view the html code in cases where the referred material is hidden 
	 links, java- script or embedded malware.

page 8:
   o  precision-spec (part or page):
      The precision specification specifies how the referred item should
      be regarded.  A typical PWID URN reference in a paper would be
      'page', where a tool will be needed to render the web page.
      Alternatively, the precision-spec can be 'part', which is the most
      precise reference since it reference a specific file where no
      additional calculations are needed (e.g. as part of a collection,
      a specific html file with hidden links or to indicate that a
      single image is referenced).  In order to see whether a viewed
      browser page is a computed web page or a single file, browsers
      have a function "View page source" which is not activated if for
      single files).
->
   o  precision-spec (part or page):
      The precision specification specifies how the referred item should
      be regarded.  A typical PWID URN reference in a paper would be
      'page', where a tool will be needed to render the web page.
      Alternatively, the precision-spec can be 'part', which is the most
      precise reference since it reference a specific file where no
      additional calculations are needed (e.g. as part of a collection,
      a specific html file with hidden links or malware).  In order to 
      see whether a viewed browser page is a computed web page or a 
      single file, browsers have a function "View page source" which is 
      inactive for single files.


page 10:
   o  Precision of what is referred
      The precision contributes to the guidance ... page.  If the precision is
      'part', the "View page source" browser function can be used for
      pages to get the referred resource.  If the resource is a single
      file (this option is not activated, since the full resource is
      already shown).  The part precision can also be indicator for
      tools (e.g. a collection extraction tool) that they can fetch the
      contents by fetching the file pointed to.
->
      The precision contributes to the guidance ... page.  If the precision is
      'part', it is the file that is of interest. That means, that it
      is trivial for web parts that are files themselves, e.g. a PDF
      document. For web pages, it is the web page code that is of 
      interest. In most cases, the page can be found and the final
      result can be accessed by us of the "View page source" browser 
      function. However, in cases where the referenced code are of 
      interest because of embedded malware, the resource should instead
      be access by fetching the contents by some sort of get-file 
      function. The part precision can in this way also be used as 
      indicator for access tools, to indicate whether web pages should 
      be viewed as a web page or as the code for the web page.

-----
Eld Zierau
Digital Preservation Specialist PhD
The Royal Danish Library
Digital Cultural Heritage
P.O. Box 2149, 1016 Copenhagen K
Ph. +45 9132 4690 
Email: elzi@kb.dk-----Original Message-----
From: Eld Zierau 
Sent: Tuesday, December 10, 2019 2:19 PM
To: Hakala, Juha E <juha.hakala@helsinki.fi>; Peter Saint-Andre <stpeter@stpeter.im>; urn@ietf.org
Cc: 'Henry S. Thompson' <ht@inf.ed.ac.uk>
Subject: FW: [urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)

Forgot to reply all - so therefore forwarded Best Regards, Eld -----Original Message-----
From: Eld Zierau
Sent: Tuesday, December 10, 2019 11:55 AM
To: 'Henry S. Thompson' <ht@inf.ed.ac.uk>
Subject: RE: [urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)

Hi

The PWID has been developed in collaboration with digital humanity researchers (literate and historian), who have requirements for precise and persistent references. The precision part is for specification of how precise a reference is, and what the purpose with the reference is. It is needed because of the nature of web material where a web page usually consist of many different parts that are harvested and archived individually. In the example of https://www.dr.dk, the differences are:
* 'Page':
       urn:pwid:archive.org:2016-01-22T10:08:23Z:page:https://www.dr.dk
    is a short hand of the archived version of the harvest of the file https://www.dr.dk on 2016-01-22T10:08:23Z 
    together with all the elements (existing in the web archive) needed in order to render it as a web page in a browser, 
    where the version of the elements is the ones with archiving date closest to 2016-01-22T10:08:23Z

* 'Part':
       urn:pwid:archive.org:2016-01-22T10:08:23Z:part:https://www.dr.dk
    is the exact file harvested of the file https://www.dr.dk on 2016-01-22T10:08:23Z

Both references are highly relevant, - 'page' because this is what is needed for most referencing purposes today, and part because it can be the code that is of interest or it can be part of a collection specification (see my response to Juha 11th of October 2019 15:44).

It is two very different things, and therefore it cannot be the same references. Furthermore, it provides the possibility for digital humanity researcher to make the precision. Please note that what you say "it's trying to pre-judge what someone may want to do with a URI" is NOT the case, it is telling you what the person making the reference want you to see if you use the reference.

If you want to view a web page consisting of different elements, the way to do that today is by use of a browser and a web archive tool to calculate the elements. And yes, there will be differences between browsers, just like Word documents looks very different depending on version fonts etc. That is nothing that a reference can or should solve. 

Please note, that use of the words "browser" and "View page source" are only used in explanation of how you can assign a PWID and resolve a PWID, and explanation of what part means for a coded page. There is nothing that suggests, that this is the only way to see it. The explanation/examples are needed, since the PWID is for all, including non-technical persons without knowledge of coded web pages and that there are different parts from different files. So I cannot see that the specification as it is now restrict PWIDs to URIs "that are supposed to be rendered by browsers", but since we are talking about web material it is likely that this is what the PWID will be used for.
Please also note, that both the examples that you give are files and therefore there will not be any "View page source" (I don't know whether that is why you say that "'View Source' is literally unusable", - since its absence is actually use for determination of the reference as a part) as explained in the instruction of how to creation a PWID in section "Assignment":
"precision-spec (part or page):
      The precision specification specifies how the referred item should
      be regarded.  A typical PWID URN reference in a paper would be
      'page', where a tool will be needed to render the web page.
      Alternatively, the precision-spec can be 'part', which is the most
      precise reference since it reference a specific file where no
      additional calculations are needed (e.g. as part of a collection,
      a specific html file with hidden links or to indicate that a
      single image is referenced).  In order to see whether a viewed
      browser page is a computed web page or a single file, browsers
      have a function "View page source" which is not activated if for
      single files). 
Best regards, Eld
-----
Eld Zierau
Digital Preservation Specialist PhD
The Royal Danish Library
Digital Cultural Heritage
P.O. Box 2149, 1016 Copenhagen K
Ph. +45 9132 4690
Email: elzi@kb.dk

-----Original Message-----
From: Henry S. Thompson <ht@inf.ed.ac.uk>
Sent: Wednesday, December 4, 2019 3:24 PM
To: Eld Zierau <elzi@kb.dk>
Cc: Hakala, Juha E <juha.hakala@helsinki.fi>; Peter Saint-Andre <stpeter@stpeter.im>; urn@ietf.org
Subject: Re: [urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)

[resending with a elision of more original text which I didn't directly respond to -- please ignore the preceding version]

Eld Zierau <elzi@kb.dk> writes:

> Please see my comments below

>> -----Original Message-----
>> From: Henry S. Thompson <ht@inf.ed.ac.uk>

>> The fourth constituent of a PWID, the *precision-spec*, now has only 
>> two possible values, 'part' and 'page', which appear to be attempting 
>> to distinguish between, respectively,
>>
>>  1) the representation that is/was retrievable for the archived-uri as
>>     at the archival-time;

> Eld: I guess you mean 'part' here, which in the specification is 
> described as:
> "*  'part'
> ...

Yes.

> And I must admit that I do not see anything that can be interpreted as 
> representation in this description, so I cannot follow you point here.

The relevant RFCs (3986, 7231, 8141) all use "representation" for the actual result of a successful retrieval, that is, a message consisting of a media type and an encoded character sequence interpretable with respect to that media type.  That's precisely what you are talking about in defining what you mean by 'part'.

> ...

>>  2) a (Content-type conformant?) rendering of that representation
>>     (with all other digital objects implicated in that rendering
>>     ...

> Eld: I guess you mean 'page' here, which in the specification is described as:
> "* 'page'

Yes.

> ...

> So it is up to browser software to tell whether it is something that 
> is interpreted as a page.

> And yes a page is "with all other digital objects implicated in that 
> rendering (e.g. scripts, stylesheets, icons, graphics, fonts ...)", 
> where I have used terms that are less technical "parts (display 
> templates, images etc.)"

Precisely.  So it's a matter of rendering, or presentation, as implemented by a browser, or any application appropriate to interpreting the retrieved representation.

But, and again according to the relevant specifications, what you _do_ with a representation of a resource is a matter for the recipient of that resource, which will be different for different users/uses at different times.  The URI itself has nothing to say about that.

So now I can express more succinctly what I was confused about: the
*precision-spec* doesn't make sense, it's trying to pre-judge what someone may want to do with a URI.

Insofar as one can state what a PWID identifies, it identifies

   the archival record maintained by the *archive-domain* for the
   representation retrieved at the *archival-time* of the state of the
   resource identified by the *archived-uri*.

So, if we construct a PWID with

   *archive-domain* web.archive.org
   *archived-uri*   https://link.springer.com/article/10.1007/JHEP12(2018)098
   *archival-time*  2019-11-18T06:00:21

it identifies such an archival record at the Internet Archive, whose current representation can be retrieved via an HTTP GET request for

   https://web.archive.org/web/20191118060021/https://link.springer.com/article/10.1007/JHEP12(2018)098

If that GET request is performed using, say, the curl or wget command-line tools, you will get what you call the 'part' as a file on your local 'disk'.

If that GET request is performed using, say, Chrome or Safari, you will see a presentation of what you call the 'page' on your screen.

>> How does this distinction survive a change of media type?  To 
>> image/png, or application/pdf, or audio/ogg?

> Eld: The PWID is for referencing purposes, it is not concerned with 
> types. The PWID is a reference to something that has been harvested 
> from the internet and is supposed to be rendered by a browser, - so 
> this is a concern of the original publisher of the element and the 
> browser software.

URIs are not just for browsers!  Surely you don't want to restrict PWIDs to URIs "that are supposed to be rendered by browsers".  All the talk in your drafts about browsers and "View Source" is at best misleading, and at worst suggests that PWIDs are only applicable to a very narrow view of what archives may contain.

Consider PWIDs constructed from

   *archive-domain* web.archive.org
   *archived-uri*   https://media.springernature.com/w306/springer-static/cover/journal/13130/2018/12.jpg
   *archival-time*  2019-11-18T05:35:49

and

   *archive-domain* web.archive.org
   *archived-uri*   https://link.springer.com/content/pdf/10.1007%2FJHEP12%282018%29098.pdf
   *archival-time*  2019-11-18T06:00:05

which identify other archival records at the Internet Archive, whose current representations can be retrieved via an HTTP GET request for

   https://web.archive.org/web/20191118053549im_/https://media.springernature.com/w306/springer-static/cover/journal/13130/2018/12.jpg

and

   https://web.archive.org/web/20191118060005/https://link.springer.com/content/pdf/10.1007%2FJHEP12%282018%29098.pdf

Perfectly useful, but the 'page'/'part' distinction doesn't make any sense here, and 'View Source' is literally unusable.

Your spec would be much shorter, simpler and easier to understand, and its potential utility much greater, if you removed the *precision-spec* altogether, and made more use of the descriptive terminology and its underlying semantics as found in the relevant RFCs mentioned above.

ht
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/  [mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.