[Wpack] Web Archives, Replaying Web Pages and WPACK

Ilya Kreymer <ikreymer@gmail.com> Fri, 10 July 2020 22:30 UTC

Return-Path: <ikreymer@gmail.com>
X-Original-To: wpack@ietfa.amsl.com
Delivered-To: wpack@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D87E33A098C for <wpack@ietfa.amsl.com>; Fri, 10 Jul 2020 15:30:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gJm_CBPmx586 for <wpack@ietfa.amsl.com>; Fri, 10 Jul 2020 15:30:21 -0700 (PDT)
Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com [IPv6:2a00:1450:4864:20::62d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D73A13A0962 for <wpack@ietf.org>; Fri, 10 Jul 2020 15:30:20 -0700 (PDT)
Received: by mail-ej1-x62d.google.com with SMTP id rk21so7673859ejb.2 for <wpack@ietf.org>; Fri, 10 Jul 2020 15:30:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=jGWPuT2WEaxJ6T9Z4dJA9nbizWSNyT70N/H1JVn8yAw=; b=q0+VxcWRV3/bWa3OCV4/b93xmqWQ4m0WA7O4S9De37hECwZIqIeJTuLOzMiDmj6rFg vP3cQ9CXx4970FiHkrkePUgytI6v3TCE3hafZeISIO6NNULu3W9Uicgj3xpqlqAeJ24m BgFIGeeQCBhm13asJQZ3gNwCl+lguXXhsPLdC+D9XLg+qoZayMJoN6AlkODmVFP4/z42 /qKq1rVAGBPLOLDJAw21afYNzf8UwFUyBQ7WNbzeH37ofa9g2IMcWljl5hOGXrljnOO2 wopjnuaqOYDjwFhcdwREqd+u5w3LDjicYInGcUDw+eDTjNpiIla7/Bzk24pSeMoIrc7f +KMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=jGWPuT2WEaxJ6T9Z4dJA9nbizWSNyT70N/H1JVn8yAw=; b=BkufG37vpdnHJr4MeNf9g57XzPeTL4Eo42KaRaZJKwxF89OfBqD1lCJqnT6ioFdJ6v b29g2tKCnd3l/sPgx55m4FMmPb4lGsnOweBFC99M9jIOLqw8qKValikG1joeK48cS53o szWBgFpCHhs9yP8GF/mWmT5b+Fig0fVqAlz2p1d0g91qUH+2Yae+R6PyB9Zr6XMrnVGz hi4VM8E2LdG9Q4OAJHtf5UVVNV6vySwqstbe/1gLWcRdB+xmKM06Ic9QHvq43rq1PDng vHO0Ejs7Il56FRN5xZnh1gpIXa/6HSm8U1vf0hjOE0u5LHkCqRxDyl9fs1XtI9Uu9rPW rO+w==
X-Gm-Message-State: AOAM532XLvST4u03AjMNcWmizhZZNZgAos/TwXnFh4b5mIjqq0jHtNsL G+V0g99+9QGWCNeZzddfgPwVwI3CpzM3DOg9jkReMOct
X-Google-Smtp-Source: ABdhPJyMQDM9t5e7MDmfC9IXk9OR5Fg0Y7Jgzhh0oOl1NPLWDiXKJxcBw4Mnnedv9cwLvhw2tMLVSOgoiQzlH/AqJcA=
X-Received: by 2002:a17:906:3acd:: with SMTP id z13mr45999010ejd.69.1594420218921; Fri, 10 Jul 2020 15:30:18 -0700 (PDT)
MIME-Version: 1.0
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Fri, 10 Jul 2020 15:30:08 -0700
Message-ID: <CANAUx6juQjKmJZpj+_gzmz6i+SRK3wYDW0g0zmCr7DY2kXKqyA@mail.gmail.com>
To: wpack@ietf.org
Content-Type: multipart/alternative; boundary="000000000000c5866d05aa1de2cc"
Archived-At: <https://mailarchive.ietf.org/arch/msg/wpack/0uRzQkJeOXDv8ieaLJOZAlg3dTI>
Subject: [Wpack] Web Archives, Replaying Web Pages and WPACK
X-BeenThere: wpack@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Web Packaging <wpack.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/wpack>, <mailto:wpack-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/wpack/>
List-Post: <mailto:wpack@ietf.org>
List-Help: <mailto:wpack-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/wpack>, <mailto:wpack-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 10 Jul 2020 22:30:24 -0000

Hi,

I wanted to reach out again to the webpack group, as I believe I am working
on solving some of the same problems as wpack, but from the perspective of
web archiving. It would be great if there was some way to collaborate with
this group, though I am struggling to understand how that could be done.

The overarching goal I believe seems to be the same: to replay HTTP network
traffic in a way that recreates an authentic representation of a website,
and to have a way to verify that the traffic was not forged. It seems a
'web archive' and a 'bundled http exchange' are fundamentally describing
the same type of object, with perhaps different storage requirements and
use cases.

I wanted to share a system, called https://replayweb.page/ which can replay
HTTP network traffic stored in a variety of formats directly in the
browser, using existing web standards, particularly Service Workers,
Fetch and IndexedDB (for caching).

Here are a few examples, which replay bundled HTTP traffic, and can even be
embedded in other pages as iframes:
https://webrecorder.net/embed-demo-1.html - replaying smaller
archives/bundles
https://webrecorder.net/embed-demo-2.html - replaying from a 17GB
archive/bundle
https://webrecorder.net/embed-demo-3.html - replaying more complex web
sites, including one with 3d viewer

These examples are all isolated and rendered independent of each other:
through Javascript rewriting and injection, the original Origin of the page
is emulated so that the site behaves as it is running on its initial
origin. This allows for replaying of complex, interactive web pages, though
is not perfect.

As I come from the web archiving community, I've focused mostly on WARC
format, as that is an existing ISO standard and widely in use, and the
system also supports replaying from HAR and the web bundles created via the
WBN tool.

However, the WARC format alone is a bit limiting, and there seems to
be a misunderstanding
about WARC
<https://github.com/WICG/webpackage/blob/fc9b3e75309546c805b5cdb1db74b2d58a8e0b28/explainers/navigation-to-unsigned-bundles.md#warc>:
It provides random access to HTTP traffic, but does not contain a built in
index necessary for random access (it is assumed the index is maintained
separately). To work around this, I've created a new 'bundling format', a
'bespoke zip format', which can contain WARCs (and other types of data,
even .wbn bundles), along with other metadata, and a compressed index.
This ZIP-based format is explained here:
https://github.com/webrecorder/web-archive-collection-format

Since the ZIP format allows for random access (see: ZipInfo
<https://github.com/Rob--W/zipinfo.js/>), it is possible to load all
bundled data on-demand via range requests. This allows the format to scale
to tens and probably hundreds of GBs.

The system also supports referencing URLs via query params in the fragment,
for example:
https://replayweb.page/?source=/examples/netpreserve-twitter.warc#view=replay&url=https%3A%2F%2Ftwitter.com%2Fnetpreserve&ts=20190603053135
loads
the WARC file, then loads the specified URL from the archive/bundle.

I wanted to share all of this to see if there's perhaps some way to align
with the work you're doing here, though I must admit it is not easy to
understand if that is possible or of interest to this group. Again, from my
perspective, it seems like you're working on a very similar problem,
attempting to standardize this at the browser level, but perhaps for
different use cases.

One area I'm especially interested in is verification for Saving a Bundle
in the Browser
<https://github.com/WICG/webpackage/blob/2a78f2930a228ee6872630ecb023fa71151cc164/draft-yasskin-wpack-use-cases.md#save-and-share-a-web-page-snapshot>
.
Unfortunately, it seems that this use case is currently out of scope. I am
especially interested in building tools to solve this problem, so that (to
use this example) Casey can save the page in their browser, share it with
Dakota, and that *Dakota can verify that this is what Casey saw in their
browser*, and it was not forged. I think being able to site a web bundle
from a client's perspective would be extremely useful for archival,
fact-checking, sharing, etc.. use cases and could make the web more
trustworthy.

Please let me know if there is any interest in collaborating, or if these
existing tools could somehow help this spec move forward.

(If anyone is interested, the replayweb.page tool can be found on github
at: https://github.com/webrecorder/replayweb.page (UI frontend) and
https://github.com/webrecorder/wabac.js (service worker backend)

Thank you,
Ilya
webrecorder.net