Re: [Wpack] Web Archives, Replaying Web Pages and WPACK

Hi Jeffrey,

On Wed, Jul 22, 2020 at 3:45 PM Jeffrey Yasskin <jyasskin@chromium.org>
wrote:

> Thanks for reaching out, and sorry for the slowness of my reply. In
> general, I'd like this group's work to be useful to the archiving
> community. Have you seen
> https://www.iab.org/wp-content/IAB-uploads/2019/06/sawood-alam-2.pdf with
> another archiving group's take on how we could cooperate?
>

No worries, thanks so much for writing back! Yes, I know Sawood and the
other authors of the paper quite well, and we communicate regularly.
I strongly agree with some of the general proposals that they suggest,
especially regarding the potential for improving trust on the web and for
archives with signed exchanges, (more in response to point 3 below), while
strongly disagree with others, such as the focus on HTTP content
negotiation as a useful mechanism for specifying or accessing archives
(which I don't think would be relevant to bundles and not a good idea for
many reasons including those outlined in
https://wiki.whatwg.org/wiki/Why_not_conneg)

I am not a researcher, but a developer/implementer, so wanted to get a more
concrete understanding/path to what cooperation may look like, and I think
this conversation is a great start! More comments below.

>
> I have a couple comments on your message, but please let me know if I've
> failed to reply to something you wanted an answer to.
>
> 1. Your and other replay systems show that a change to browsers to
> natively support a bundling format isn't strictly necessary for archivists
> to be able to show the archived content, but I suspect that having some
> native support will make several aspects of your job easier. Sawood's paper
> calls out "live-leakage
> <https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html>, temporal
> violations <https://arxiv.org/abs/1402.0928>, origin violations
> <https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html>,
> cookie violations
> <https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html>,
> broken links, and security risks
> <https://dl.acm.org/doi/10.1145/3133956.3134042>". Browser support for
> bundles won't fix all of that, but it will fix some of the security issues
> and origin violations ... if we can find a good conclusion to the
> discussion around
> https://github.com/WICG/webpackage/blob/master/explainers/bundle-urls-and-origins.md
> .
>

Indeed, I am very happy to hear that this would be important for you as
well! I definitely welcome native support for web bundle replay if it can
make my job easier,
but wasn't sure how/if that was going to happen. Yes, I wouldn't expect
this proposal to fix all of the issues raised in Sawood's research: some
have already been addressed, others apply specifically to large scale
archives. I am also working with archives at a much smaller scale, for
example an archive of a single site at a single point in time,
where things like temporal violations are less of a concern.

Thanks for pointing to the Origin document, I had not seen it before.

I would say that being able to emulate the original origin/location of a
bundle/archive in an isolated environment would probably be the biggest
improvement to make my job easier -- I am maintaining thousands of lines of
JS code which essentially tries to emulate the original location of the
page when it's loaded from an archive.

If the browser could do so natively, that would be an immense improvement!
My immediate feedback on the proposal is that bundles should be replayed
with all contents served from their original, exact origins but in an
isolated session, similar to Chrome's guest session mode.
Anything less than the original origin, and I'm back to injecting thousands
of lines of JS code to emulate the original origin. If a page was archived
from 'https://example.com' but is then served from 'package:example.com' or
some variation of that, it will still break any site that checks for
'window.location'.
(The complex JS injection that I have precisely deals with emulating the
window.location to be what the site expects).

Probably a guest session window (or tab?) that is isolated to the contents
of the bundle, and can not access any data outside of it, would be the
ideal solution,
but of course I don't know if that is possible.

>
> 2. I'd like to fix the misunderstanding you've found in
> https://github.com/WICG/webpackage/blob/@%7B2020-07-22%7D/explainers/navigation-to-unsigned-bundles.md#warc.
> I didn't follow what that misunderstanding was, though. Your message seems
> to agree that WARC itself is not random-access, even though a random-access
> structure can be built on top of it.
>

I should have been more clear also, as this is a bit confusing. The way
WARCs are structured allows for seeking to byte N to read url U, without
having to read the entire WARC first. However, *where* to seek in the WARC
is not specified in the WARC itself, and does require maintaining an
auxiliary index. All uses of WARCs generally maintain such an auxiliary
index to provide fast random access to the WARC data. This is perhaps both
a strength and a weakness of the WARC format. So yes, perhaps it's
partially random access, or random access with extra data, based on how you
define that :) In a browser, the auxiliary index could be specified with an
IndexedDB store, and the WARC data loaded over HTTP range requests, which
seems to work pretty well.

This sort of leads to another question/suggestion that could immensely make
my work easier. There are PBs of WARCs out there and it would be great if
existing works could be used with this system, without having to be
converted or re-serialized. What if the web bundle spec could de-couple the
replay from the storage, via a customizable API?

The idea I had in mind is something similar to the CacheStorage interface,
where the user could implement a match() function, like this:

interface BaseWebBundle
{
    Promise<Response> match(Request, options);
}

Then, one could implement a 'class WARCBundle extends BaseWebBundle' which
handles lookup from WARC (or any other format) and performs any custom
matching rules. The current implementation would of course also be provided
by default. This could be used to provide a more flexible matching system,
and perhaps simplify some of what needs to happen in:
https://wicg.github.io/webpackage/loading.html#request-matching. Likely,
this would have to be limited to unsigned exchanges only, though.
To get accurate replay, matching of request to response is often requires
custom 'fuzzy' matching, eg.
a request for https://example.com/?_=1235 should match the response for
https://example.com/?_=1234. I think this is outside the scope of what
should be standardized,
but having a flexible way to add custom matching, and support loading for
WARC (or HAR, or other formats) would greatly help my work as well.

> 3. There are interesting technical constraints around giving a client the
> tools to prove that it didn't forge an archive it presents to a peer. As
> far as I've been able to design, you can't get real proof, just a
> collection of assertions by variously-trusted entities that the content is
> accurate. If the original site cooperates, it can sign a bundle of its
> content to vouch for it, and then the client could pass the bundle around
> to various archiving entities for them to add additional signatures
> representing their claims that the content is authentic. The client can add
> a signature too, but it's unlikely their signing key will be known well
> enough to be trusted.
>
>
Yes, I agree that there are additional constraints, though it seems like a
'problem worth solving' to give more power to individual users vs larger
content aggregators, as
Sawood et al suggest. Under the current scheme, only the content
provider/server operator can choose whether their data is verifiable, even
though two parties are involved
in any HTTP exchange. I think peers could exchange public keys for
verification as they do in other systems. Archives large and small could
share their public keys, for instance, and offer verifiable archives. I
agree there's more design that is needed on this, though, and I don't have
any exact answers myself.

A major motivation for this is a project I've been working to support
archiving/bundle creation directly in the browser via an extension. (The
extnension relies on the chrome.debugger apis and so currently limited to
Chrome)

Here is a video of a demo of how it works, archiving several pages as you
browse:
https://drive.google.com/file/d/1aH5X1GN8X84jQXSnS4NJB78BqN3ZjrEA/view

In a perfect world, the browser could provide this functionality natively,
as a working replacement for 'Save Page As...' which is of course broken
for most dynamic pages.
For now, the extension supports exporting the content as WARC, but of
course it could support exporting as a web bundle, and better still if the
exported bundle could be signed by the browser. If this kind of saving of
bundles could be done from the browser, that would really make my job, and
probably many others in the archiving community, easier :)

Let me know if there are any questions and happy to be more involved in
this process!

Thanks again,
Ilya

> Jeffrey
>
> On Fri, Jul 10, 2020 at 3:30 PM Ilya Kreymer <ikreymer@gmail.com> wrote:
>
>> Hi,
>>
>> I wanted to reach out again to the webpack group, as I believe I am
>> working on solving some of the same problems as wpack, but from the
>> perspective of web archiving. It would be great if there was some way to
>> collaborate with this group, though I am struggling to understand how that
>> could be done.
>>
>> The overarching goal I believe seems to be the same: to replay HTTP
>> network traffic in a way that recreates an authentic representation of a
>> website, and to have a way to verify that the traffic was not forged. It
>> seems a 'web archive' and a 'bundled http exchange' are fundamentally
>> describing the same type of object, with perhaps different storage
>> requirements and use cases.
>>
>> I wanted to share a system, called https://replayweb.page/ which can
>> replay HTTP network traffic stored in a variety of formats directly in the
>> browser, using existing web standards, particularly Service Workers,
>> Fetch and IndexedDB (for caching).
>>
>> Here are a few examples, which replay bundled HTTP traffic, and can even
>> be embedded in other pages as iframes:
>> https://webrecorder.net/embed-demo-1.html - replaying smaller
>> archives/bundles
>> https://webrecorder.net/embed-demo-2.html - replaying from a 17GB
>> archive/bundle
>> https://webrecorder.net/embed-demo-3.html - replaying more complex web
>> sites, including one with 3d viewer
>>
>> These examples are all isolated and rendered independent of each other:
>> through Javascript rewriting and injection, the original Origin of the page
>> is emulated so that the site behaves as it is running on its initial
>> origin. This allows for replaying of complex, interactive web pages, though
>> is not perfect.
>>
>> As I come from the web archiving community, I've focused mostly on WARC
>> format, as that is an existing ISO standard and widely in use, and the
>> system also supports replaying from HAR and the web bundles created via the
>> WBN tool.
>>
>> However, the WARC format alone is a bit limiting, and there seems to be a misunderstanding
>> about WARC
>> <https://github.com/WICG/webpackage/blob/fc9b3e75309546c805b5cdb1db74b2d58a8e0b28/explainers/navigation-to-unsigned-bundles.md#warc>:
>> It provides random access to HTTP traffic, but does not contain a built in
>> index necessary for random access (it is assumed the index is maintained
>> separately). To work around this, I've created a new 'bundling format', a
>> 'bespoke zip format', which can contain WARCs (and other types of data,
>> even .wbn bundles), along with other metadata, and a compressed index.
>> This ZIP-based format is explained here:
>> https://github..com/webrecorder/web-archive-collection-format
>> <https://github.com/webrecorder/web-archive-collection-format>
>>
>> Since the ZIP format allows for random access (see: ZipInfo
>> <https://github.com/Rob--W/zipinfo.js/>), it is possible to load all
>> bundled data on-demand via range requests. This allows the format to scale
>> to tens and probably hundreds of GBs.
>>
>> The system also supports referencing URLs via query params in the
>> fragment, for example:
>> https://replayweb.page/?source=/examples/netpreserve-twitter.warc#view=replay&url=https%3A%2F%2Ftwitter.com%2Fnetpreserve&ts=20190603053135 loads
>> the WARC file, then loads the specified URL from the archive/bundle.
>>
>> I wanted to share all of this to see if there's perhaps some way to align
>> with the work you're doing here, though I must admit it is not easy to
>> understand if that is possible or of interest to this group. Again, from my
>> perspective, it seems like you're working on a very similar problem,
>> attempting to standardize this at the browser level, but perhaps for
>> different use cases.
>>
>> One area I'm especially interested in is verification for Saving a
>> Bundle in the Browser
>> <https://github.com/WICG/webpackage/blob/2a78f2930a228ee6872630ecb023fa71151cc164/draft-yasskin-wpack-use-cases.md#save-and-share-a-web-page-snapshot>
>> .
>> Unfortunately, it seems that this use case is currently out of scope. I
>> am especially interested in building tools to solve this problem, so that
>> (to use this example) Casey can save the page in their browser, share it
>> with Dakota, and that *Dakota can verify that this is what Casey saw in
>> their browser*, and it was not forged. I think being able to site a web
>> bundle from a client's perspective would be extremely useful for archival,
>> fact-checking, sharing, etc.. use cases and could make the web more
>> trustworthy.
>>
>> Please let me know if there is any interest in collaborating, or if these
>> existing tools could somehow help this spec move forward.
>>
>> (If anyone is interested, the replayweb.page tool can be found on github
>> at: https://github.com/webrecorder/replayweb.page (UI frontend) and
>> https://github.com/webrecorder/wabac.js (service worker backend)
>>
>> Thank you,
>> Ilya
>> webrecorder.net
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wpack mailing list
>> Wpack@ietf.org
>> https://www.ietf.org/mailman/listinfo/wpack
>>
>