Re: [Wpack] Web Archives, Replaying Web Pages and WPACK

Ilya Kreymer <ikreymer@gmail.com> Mon, 13 July 2020 04:24 UTC

Return-Path: <ikreymer@gmail.com>
X-Original-To: wpack@ietfa.amsl.com
Delivered-To: wpack@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D1D3A3A0CC9 for <wpack@ietfa.amsl.com>; Sun, 12 Jul 2020 21:24:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.197
X-Spam-Level:
X-Spam-Status: No, score=-0.197 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ROFJ849pOlIT for <wpack@ietfa.amsl.com>; Sun, 12 Jul 2020 21:24:53 -0700 (PDT)
Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F0BDB3A0CC8 for <wpack@ietf.org>; Sun, 12 Jul 2020 21:24:52 -0700 (PDT)
Received: by mail-ed1-x533.google.com with SMTP id a1so5866901edt.10 for <wpack@ietf.org>; Sun, 12 Jul 2020 21:24:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O2gRGeJLae2jMLREqrEFt3WckFCD5oNumxE+2K+ie6Q=; b=ZBR9JhynMCnr3gbywWf9WUr14Bt9X1nygGiJfQoAxqJk5sJd7hGI1VHbjZW9Ju4wyJ LDkphJgW2AB0L1GsEeHr0eGgzQ3aQwMAnb0oD8xD9sIqptSb0xshOTALyo7j9CqgrTKu CRKSz92zRYdus1CVtMmOejLmYbifos1C6X1r+ro3Sz68kFNDNqPhVfZhZIsVOWvBBah8 zuoXe1GnTO2WSjqHNVdgNCHYcz/rQZ+pZOAzyLCB7x8nH0JS1+CKKwSNGH4cP/OOujzL 2M2gZt95970MFCfgfdOULOfNhRLzf9JBfNDALLIEs6ND1hZmonwtU8n9qQ92Lo8bmZfc L+Xw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O2gRGeJLae2jMLREqrEFt3WckFCD5oNumxE+2K+ie6Q=; b=q98sGnfGUTsNd9X9RWtpyq9f2a48rUlUwir5XUCJJS04lsqzdjM+zraj3meb/LynHP obn/lXyXAn6TZU5lktzLEjr+YbifFd/oT1QPDe97R3HLNMq3bylOPGfW+HcvKFHXfFnH rcCA39ohLZrMNqhPYcismkD9r/aD8Afm0CxpYqtQRIcocl1tNz/FWm0knuD+QxfxAFSq jxBd3D/BGJTnwoqRH8DTek3n3FevH/OJlKR/Zq2QxZPHjYQhA/fhqSZlWqtnMhN8pAH6 PB85pE+ah8wpk061Jkf/L57kausUWmWJmruDLuDqDjS5BrDm4yUWnOUqbvzWD5equLCV adoA==
X-Gm-Message-State: AOAM531o2g1OlIXCNka1iTL2X0BEpE4EMIpCg5i+ShRj8t2yqMmaRQSn 6wJaN5sxOCokwwHKhpEr6YAk7Ic8bvsGe5M/78Y=
X-Google-Smtp-Source: ABdhPJwpbSzcrEeStgyE+XnRuc2FCndhGQRKJUfVuOdPjB7VIGvd7lTiP5svggacee+w+tS9UMJxJgG6B+7LLC+rq4A=
X-Received: by 2002:a50:a881:: with SMTP id k1mr87075646edc.12.1594614291350; Sun, 12 Jul 2020 21:24:51 -0700 (PDT)
MIME-Version: 1.0
References: <CANAUx6juQjKmJZpj+_gzmz6i+SRK3wYDW0g0zmCr7DY2kXKqyA@mail.gmail.com> <023301d6589e$9b4ead30$d1ec0790$@acm.org>
In-Reply-To: <023301d6589e$9b4ead30$d1ec0790$@acm.org>
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Sun, 12 Jul 2020 21:24:40 -0700
Message-ID: <CANAUx6hfm61DBHRu4takYRNXd6iM=_uJzv+Dbgmgb9PUGW1yLA@mail.gmail.com>
To: Larry Masinter <LMM@acm.org>
Cc: wpack@ietf.org
Content-Type: multipart/alternative; boundary="00000000000063c67b05aa4b126b"
Archived-At: <https://mailarchive.ietf.org/arch/msg/wpack/rce8VYfk7ZEO25fReenk66WX5j4>
Subject: Re: [Wpack] Web Archives, Replaying Web Pages and WPACK
X-BeenThere: wpack@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Web Packaging <wpack.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/wpack>, <mailto:wpack-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/wpack/>
List-Post: <mailto:wpack@ietf.org>
List-Help: <mailto:wpack-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/wpack>, <mailto:wpack-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Jul 2020 04:24:55 -0000

On Sun, Jul 12, 2020 at 3:48 PM Larry Masinter <LMM@acm.org> wrote:

> Most of the replay of HTTP is irrelevant to the archive use case because
> there is no point in trying to reach out to original servers long after
> HTTP/n is obsolete. Mostly it’s a privacy threat to record irrelevant
> transaction metadta.
>

Web archive replay is the replay of HTTP exchanges to recreate the original
page as accurately as possible, but within an isolated context. When
visiting, for example, Internet Archive's Wayback Machine, what it does is
replay (unsigned) HTTP exchanges, with necessary modifications to make the
pages load from a different origin.
I have implemented a similar system entirely in the browser, using service
workers to match a request to an archived response. The exact HTTP protocol
itself is generally not relevant, and the abstraction of matching an HTTP
request to a stored HTTP response is sufficient,
Making the response seem like it was loaded from the original origin, not
the origin of the archive is the harder part, however.

The original servers are never being contacted, as the goal is to load the
archive in an isolated context, much like what is proposed with web bundles
for offline use.



>
>
> Instead you need to define a layer (like PDF/A did for paged documents,
> which preserves the meaning of the original experience without necessarily
> being able to enter in new data and have it recompute. For example, there
> is no good way to archive an empty chat room and preserve the experience of
> saying something new.
>
>
Yes, a way to define boundaries would be nice to have for archives,
currently there is no such spec -- if you hit an archive 'boundary', you'd
end up with the 404 when trying to replay. This is not a requirement for
reasonable replay, though.


>
>
>
>
> The archive use case needs a different security model from the online same
> origin policy.
>
> The model used in PDF is pretty simple:
>
>
>
> Intra-package links are trusted. Links from inside the package to out
> require user verification (once for that package).
>

Yes, I agree, the current security model is unfortunately insufficient,
that's why I was hoping that this spec could help the web archiving use
case,
which today remains an actively used example of replaying HTTP exchanges.

Archives are generally isolated bundles, and should not link outside the
archive. (This is easier to enforce with a CSP policy currently)

In my systems, I have implemented sandboxing and isolation via Javascript,
and it has taken a lot of effort to do so, and probably is not entirely
foolproof:

Generally, this is done via URL prefixes

For example, an archive of 'https://example.com/' at:

https://my-archive.example.com/bundle-A/20200701//https://example.com/

should not have access to:

https://my-archive.example.com/bundle-B/20200701/https://example.com/

But both pages should believe they are loaded from the origin of '
https://example.com/' in order to operate properly. This is currently done
by 'emulating' the origin
via Javascript injection, since there is no other way, but would be great
if instead the browser could support this for a trusted archive directly.

The intent of the above URLs is to say:
'load URL https://example.com/ archived on 2020-06-01 from a
https://my-archive.example.com/bundle-B.bundle' or
'load URL https://example.com/ archived on 2020-07-01 from a
https://my-archive.example.com/bundle-A.bundle'

and the web archiving community settled on this de-facto URL scheme for
expressing such requests.

I think the main difference between the archival use case and some of the
other use cases being considered seems to be the signing/verification and
duration of the signatures. Since the HTTP exchange is a two way exchange,
what if the client browser had equal ability to create a signed exchange?

If the client, rather than only the server, could sign an HTTP exchange and
the signature could be verifiable for longer than 7 days, it could be a
significant help to the archival use case, which includes users making
their own snapshots. Archives like Internet Archive, smaller less known
archives and even individual users could produce verifiable HTTP exchanges
that could be loaded offline, could be more trustworthy than screenshots,
etc..  The other mechanics involved in loading and replaying an HTTP
exchange bundle, such as url fragments, request-->response matching are
also no different for the archival use case than any of the other ones.

Ilya