Re: [urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)

Eld Zierau <elzi@kb.dk> Thu, 28 November 2019 14:50 UTC

Return-Path: <elzi@kb.dk>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D2B1B12088A for <urn@ietfa.amsl.com>; Thu, 28 Nov 2019 06:50:35 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Sm5nYqlYMfKb for <urn@ietfa.amsl.com>; Thu, 28 Nov 2019 06:50:30 -0800 (PST)
Received: from smtp-out12.electric.net (smtp-out12.electric.net [89.104.206.38]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 35AAE120886 for <urn@ietf.org>; Thu, 28 Nov 2019 06:50:29 -0800 (PST)
Received: from 1iaL7i-0006Hh-VO by out12b.electric.net with emc1-ok (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1iaL7j-0006Kf-Un; Thu, 28 Nov 2019 06:50:27 -0800
Received: by emcmailer; Thu, 28 Nov 2019 06:50:27 -0800
Received: from [92.43.124.147] (helo=deliveryscan.hostedsepo.dk) by out12b.electric.net with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1iaL7i-0006Hh-VO; Thu, 28 Nov 2019 06:50:26 -0800
Received: from localhost (unknown [10.72.17.201]) by deliveryscan.hostedsepo.dk (Postfix) with ESMTP id 93E1FEFF; Thu, 28 Nov 2019 15:50:26 +0100 (CET)
Received: from 10.72.17.201 ([10.72.17.201]) by dispatch-outgoing.hostedsepo.dk (JAMES SMTP Server 2.3.2-1) with SMTP ID 1; Thu, 28 Nov 2019 15:50:26 +0100 (CET)
Received: from out12d.electric.net (smtp-out12.electric.net [89.104.206.38]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "electric.net", Issuer "COMODO RSA Domain Validation Secure Server CA" (verified OK)) by pf1.outpostscan-mta.hostedsepo.dk (Postfix) with ESMTPS id 3F5319F1F7; Thu, 28 Nov 2019 15:50:26 +0100 (CET)
Received: from 1iaL7h-0000Ho-UO by out12d.electric.net with hostsite:2468467 (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1iaL7i-0000NV-TX; Thu, 28 Nov 2019 06:50:26 -0800
Received: by emcmailer; Thu, 28 Nov 2019 06:50:26 -0800
Received: from [92.43.124.46] (helo=pf1.outprescan-mta.hostedsepo.dk) by out12d.electric.net with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92.3) (envelope-from <elzi@kb.dk>) id 1iaL7h-0000Ho-UO; Thu, 28 Nov 2019 06:50:25 -0800
Received: from post.kb.dk (post-03.kb.dk [130.226.226.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by pf1.outprescan-mta.hostedsepo.dk (Postfix) with ESMTPS id DAC699F2D2; Thu, 28 Nov 2019 15:50:24 +0100 (CET)
Received: from EXCH-02.kb.dk (exch-02.kb.dk [10.5.0.112]) by post.kb.dk (Postfix) with ESMTPS id B4B48BEB07; Thu, 28 Nov 2019 15:50:24 +0100 (CET)
Received: from EXCH-02.kb.dk (10.5.0.112) by EXCH-02.kb.dk (10.5.0.112) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1847.3; Thu, 28 Nov 2019 15:50:24 +0100
Received: from EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29]) by EXCH-02.kb.dk ([fe80::b595:1a1f:5666:b29%7]) with mapi id 15.01.1847.005; Thu, 28 Nov 2019 15:50:24 +0100
From: Eld Zierau <elzi@kb.dk>
To: "'Henry S. Thompson'" <ht@inf.ed.ac.uk>
CC: "'Hakala, Juha E'" <juha.hakala@helsinki.fi>, Peter Saint-Andre <stpeter@stpeter.im>, "urn@ietf.org" <urn@ietf.org>
Thread-Topic: [urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)
Thread-Index: AQHVo9/A5kRRJ7EN1UCh4Vv2CVNmCaegmRHQ
Date: Thu, 28 Nov 2019 14:50:24 +0000
Message-ID: <4499f99af7b84135ae88b14cc444d080@kb.dk>
References: <2396dbcf66bb4c8689bcbabca2cc8492@kb.dk> <f5by2w3380y.fsf@ecclerig.inf.ed.ac.uk>
In-Reply-To: <f5by2w3380y.fsf@ecclerig.inf.ed.ac.uk>
Accept-Language: da-DK, en-US
Content-Language: da-DK
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator:
x-originating-ip: [130.226.229.95]
Content-Type: multipart/mixed; boundary="_002_4499f99af7b84135ae88b14cc444d080kbdk_"
MIME-Version: 1.0
X-Outbound-IP: 92.43.124.46
X-Env-From: elzi@kb.dk
X-Proto: esmtps
X-Revdns: outprescan-mta.hostedsepo.dk
X-HELO: pf1.outprescan-mta.hostedsepo.dk
X-TLS: TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256
X-Authenticated_ID:
X-PolicySMART: 10573177, 19718497
X-Virus-Status: Scanned by VirusSMART (b)
X-Virus-Status: Scanned by VirusSMART (c)
X-Outbound-IP: 92.43.124.147
X-Env-From: elzi@kb.dk
X-Proto: esmtps
X-Revdns: deliveryscan.hostedsepo.dk
X-HELO: deliveryscan.hostedsepo.dk
X-TLS: TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256
X-Authenticated_ID:
X-Virus-Status: Scanned by VirusSMART (b)
X-Virus-Status: Scanned by VirusSMART (c)
Archived-At: <https://mailarchive.ietf.org/arch/msg/urn/MPy6zDDQvOeSEouYpyD9Dccd5pI>
Subject: Re: [urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Revisions to URN RFCs <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/urn/>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 28 Nov 2019 14:50:36 -0000

Please see my comments below
Best regards, Eld

PS: I have attached new version with the few changes that I have suggested below 
(and a small correction from "Archived URI or identifier" to "Archived URI" which missed my eye in the last version)

-----Original Message-----
From: Henry S. Thompson <ht@inf.ed.ac.uk> 
Sent: Monday, November 25, 2019 11:29 PM
To: Eld Zierau <elzi@kb.dk>
Cc: 'Hakala, Juha E' <juha.hakala@helsinki.fi>fi>; Peter Saint-Andre <stpeter@stpeter.im>im>; urn@ietf.org
Subject: Re: [urn] new draft 10 - new form (RE: new draft 9 - RE: new urn PWID draft (7) with corrections)

I continue to be confused by various aspects of this proposal.

First confusion (more to follow):

The fourth constituent of a PWID, the *precision-spec*, now has only two possible values, 'part' and 'page', which appear to be attempting to distinguish between, respectively,

 1) the representation that is/was retrievable for the archived-uri as
    at the archival-time;

> Eld: I guess you mean 'part' here, which in the specification is described as:
> "*  'part'
>          Meaning the single archived file/web part harvested as from the
>          specified URI.  For references to web pages with html code
>          (i.e. pages where there is an option to "View page source"),
>          this will mean the actual file with the html code.  It is
>          relevant to refer to web pages this way, in case it is part of
>          a collection specification or in case it is the html that is of
>          interest (e.g. java scripts or hidden links that are not
>          visible when rendering the web page).
>          For all other types of files, the URI will be for single files
>          to be interpreted a file."
> And I must admit that I do not see anything that can be interpreted as representation in this description, so I cannot follow you point here.
>
> Please also note that introduction say that
>    "The purpose of the PWID URN is also to express a web archive
>    reference as simple as possible and at the same time meet the
>    requirements for sustainability, usability and scope.  Therefore, the
>    PWID URN is focused on having only the minimum required information
>    to make a precise identification of a resource in an arbitrary web
>    archive.  Recent research have shown that this can be obtained by the
>    following information [ResawRef]:
>
>    o  Identification of web archive
>
>    o  Identification of source:
>       *  Archived URI
>       *  Archival timestamp
>
>    o  Intended precision (page, part/file)"
>
> I could of course make it sharper by saying that 
>       *  Archival timestamp (the archives timestamp for when the URI was archived)
> But I must admit that I do not see it add value, and it is described several places throughout the specification
> (and therefore not included in attached version).

 2) a (Content-type conformant?) rendering of that representation
    (with all other digital objects implicated in that rendering
    (e.g. scripts, stylesheets, icons, graphics, fonts, ..., to say
    nothing of advertisements) being also retrieved from the same
    archive?  As it the same archival-time?).  Let's just say I can't
    imagine any creator of a PWID having any idea of when they are
    supposed to use which, or any consumer of one to have any idea of
    what they are meant to do as a consequence of which is present.

> Eld: I guess you mean 'page' here, which in the specification is described as:
>
> "* 'page'
>            Meaning that an application like Wayback calculates a resulting
>            web page based on calculated referenced web parts (display
>            templates, images etc.).  For example, an html page displaying
>            an image will need both the html and the referred image."
>
> The PWID is not concerned with Content-type, - it is only concerned with information created when the resource was archived. 
> It is explained in the assignment section how you can determine whether the precision-spec should be part or page:
>
> "o  precision-spec (part or page):
>       The precision specification specifies how the referred item should
>       be regarded.  A typical PWID URN reference in a paper would be
>       'page', where a tool will be needed to render the web page.
>       Alternatively, the precision-spec can be 'part', which is the most
>       precise reference since it reference a specific file where no
>       additional calculations are needed (e.g. as part of a collection,
>       a specific html file with hidden links or to indicate that a
>       single image is referenced).  In order to see whether a viewed
>       browser page is a computed web page or a single file, browsers
>       have a function "View page source" which is not activated if for
>       single files)."
>
> So it is up to browser software to tell whether it is something that is interpreted as a page. 
>
> And yes a page is "with all other digital objects implicated in that rendering  (e.g. scripts, stylesheets, icons, graphics, fonts ...)", 
> where I have used terms that are less technical "parts (display templates, images etc.)"
>
> And no - not necessarily the same archival time, since each digital object will be harvested individually and therefore usually have different archival times.
> That is why there are calculation involved when a Wayback machine renders a page, since it takes the elements (in the archive) it can
> find with time closest to the archival time for the archived file with the page code.
>
> I can make it more clear by extending the description:
>
>" 'page'
>           Meaning that an application like Wayback calculates a resulting
>           web page based on calculated referenced web parts (display
>           templates, images etc.), which the Wayback machine can find in the 
>           archive.  For example, an html page displaying an image will need 
>           both the archived html file for the page and the archived file with 
>           the referred image. Since the two files are archived individually
>           they will normally have different archival times. In cases where the 
>           archive have several versions of the image (harvested at different times), 
>           it is the Wayback machine that calculates which archived file of the 
>           image that is relevant (the one with archival time closest to the 
>           archived html file)."

Or is 'page' to be understood as meaning that the PWID as a whole is meant by its minter to denote the same resource (in the RFC 3986 sense of the word) that the archived URI denoted at the time of archiving?

> Eld: see above

The discussion of the use of 'page' in the worked example in the
*Resolution* section seems to support either interpretation.

It would help if a more detailed description of what the successful resolution of the example PWID, that is

  urn:pwid:archive.org:2016-01-22T10:08:23Z:page:https://www.dr.dk

would look like.  Per RFC3986, this should be a representation of the resource identified by the example PWID.

And, how would that be different if the example PWID had been

  urn:pwid:archive.org:2016-01-22T10:08:23Z:part:https://www.dr.dk

?

> Eld: Unfortunately I cannot insert images in the specification, - and it is already described in the resolution section in the paragraph:
>  
>    "The 'page' information is used in verification that the right
>    precision level is reached.  In case the precision-spec had been
>    'part', it would require an extra step selecting "View page source"
>    on the resulting page."
>
> but I can of course make it an example and repeat it under the description of the alternative resolution at:
>
>    "Alternative resolution (automatically or manually) of this URN PWID
>    can be deduced based on the current (2019) knowledge of Internet
>    Archive's open Wayback access web interface, which has the pattern:
>
>       https://web.archive.org/web/<time>/<uri>
>
>    Using this pattern (where only digits from the timestamp is
>    included), it is possible to deduce the online https URI:
>
>       https://web.archive.org/web/20160122100823/https://www.dr.dk"
>
> by adding the following text
>
>   "Following this URL to Internet Archive's Wayback tool will result in the harvested web page including 
>   elements  like images etc. which the Wayback machine calculates based on the referred elements and the 
>   contents  of the archive (usually the elements with closest archival time to the archival time of the web 
>   page). This page corresponds to the PWID with precision specification 'page':
>        urn:pwid:archive.org:2016-01-22T10:08:23Z:page:https://www.dr.dk
>   If the PWID had specified 'part' instead of 'page' i.e.
>        urn:pwid:archive.org:2016-01-22T10:08:23Z:part:https://www.dr.dk
>   then the referred element (in this case the page code) will be 
>   found by clicking "View page source", which will show the page code in text form."

How does this distinction survive a change of media type?  To image/png, or application/pdf, or audio/ogg?

> Eld: The PWID is for referencing purposes, it is not concerned with types. The PWID is a reference to something
> that has been harvested from the internet and is supposed to be rendered by a browser, - so this is a concern of 
> the original publisher of the element and the browser software.

ht
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/  [mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.