Re: [Wpack] Web Archives, Replaying Web Pages and WPACK

Ilya Kreymer <ikreymer@gmail.com> Mon, 13 July 2020 04:24 UTC

MIME-Version: 1.0
References: <CANAUx6juQjKmJZpj+_gzmz6i+SRK3wYDW0g0zmCr7DY2kXKqyA@mail.gmail.com> <023301d6589e$9b4ead30$d1ec0790$@acm.org>
In-Reply-To: <023301d6589e$9b4ead30$d1ec0790$@acm.org>
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Sun, 12 Jul 2020 21:24:40 -0700
Message-ID: <CANAUx6hfm61DBHRu4takYRNXd6iM=_uJzv+Dbgmgb9PUGW1yLA@mail.gmail.com>
To: Larry Masinter <LMM@acm.org>
Cc: wpack@ietf.org
Content-Type: multipart/alternative; boundary="00000000000063c67b05aa4b126b"
Archived-At: <https://mailarchive.ietf.org/arch/msg/wpack/rce8VYfk7ZEO25fReenk66WX5j4>
Subject: Re: [Wpack] Web Archives, Replaying Web Pages and WPACK
Precedence: list

On Sun, Jul 12, 2020 at 3:48 PM Larry Masinter <LMM@acm.org> wrote:

> Most of the replay of HTTP is irrelevant to the archive use case because
> there is no point in trying to reach out to original servers long after
> HTTP/n is obsolete. Mostly it’s a privacy threat to record irrelevant
> transaction metadta.
>

Web archive replay is the replay of HTTP exchanges to recreate the original
page as accurately as possible, but within an isolated context. When
visiting, for example, Internet Archive's Wayback Machine, what it does is
replay (unsigned) HTTP exchanges, with necessary modifications to make the
pages load from a different origin.
I have implemented a similar system entirely in the browser, using service
workers to match a request to an archived response. The exact HTTP protocol
itself is generally not relevant, and the abstraction of matching an HTTP
request to a stored HTTP response is sufficient,
Making the response seem like it was loaded from the original origin, not
the origin of the archive is the harder part, however.

The original servers are never being contacted, as the goal is to load the
archive in an isolated context, much like what is proposed with web bundles
for offline use.

>
>
> Instead you need to define a layer (like PDF/A did for paged documents,
> which preserves the meaning of the original experience without necessarily
> being able to enter in new data and have it recompute. For example, there
> is no good way to archive an empty chat room and preserve the experience of
> saying something new.
>
>
Yes, a way to define boundaries would be nice to have for archives,
currently there is no such spec -- if you hit an archive 'boundary', you'd
end up with the 404 when trying to replay. This is not a requirement for
reasonable replay, though.

>
>
>
>
> The archive use case needs a different security model from the online same
> origin policy.
>
> The model used in PDF is pretty simple:
>
>
>
> Intra-package links are trusted. Links from inside the package to out
> require user verification (once for that package).
>

Yes, I agree, the current security model is unfortunately insufficient,
that's why I was hoping that this spec could help the web archiving use
case,
which today remains an actively used example of replaying HTTP exchanges.

Archives are generally isolated bundles, and should not link outside the
archive. (This is easier to enforce with a CSP policy currently)

In my systems, I have implemented sandboxing and isolation via Javascript,
and it has taken a lot of effort to do so, and probably is not entirely
foolproof:

Generally, this is done via URL prefixes

For example, an archive of 'https://example.com/' at:

https://my-archive.example.com/bundle-A/20200701//https://example.com/

should not have access to:

https://my-archive.example.com/bundle-B/20200701/https://example.com/

But both pages should believe they are loaded from the origin of '
https://example.com/' in order to operate properly. This is currently done
by 'emulating' the origin
via Javascript injection, since there is no other way, but would be great
if instead the browser could support this for a trusted archive directly.

The intent of the above URLs is to say:
'load URL https://example.com/ archived on 2020-06-01 from a
https://my-archive.example.com/bundle-B.bundle' or
'load URL https://example.com/ archived on 2020-07-01 from a
https://my-archive.example.com/bundle-A.bundle'

and the web archiving community settled on this de-facto URL scheme for
expressing such requests.

I think the main difference between the archival use case and some of the
other use cases being considered seems to be the signing/verification and
duration of the signatures. Since the HTTP exchange is a two way exchange,
what if the client browser had equal ability to create a signed exchange?

If the client, rather than only the server, could sign an HTTP exchange and
the signature could be verifiable for longer than 7 days, it could be a
significant help to the archival use case, which includes users making
their own snapshots. Archives like Internet Archive, smaller less known
archives and even individual users could produce verifiable HTTP exchanges
that could be loaded offline, could be more trustworthy than screenshots,
etc..  The other mechanics involved in loading and replaying an HTTP
exchange bundle, such as url fragments, request-->response matching are
also no different for the archival use case than any of the other ones.

Ilya

[Wpack] Web Archives, Replaying Web Pages and WPA… Ilya Kreymer
Re: [Wpack] Web Archives, Replaying Web Pages and… Larry Masinter
Re: [Wpack] Web Archives, Replaying Web Pages and… ehs
Re: [Wpack] Web Archives, Replaying Web Pages and… Ilya Kreymer
Re: [Wpack] Web Archives, Replaying Web Pages and… Jeffrey Yasskin
Re: [Wpack] Web Archives, Replaying Web Pages and… Ilya Kreymer