Re: [Wpack] Web Archives, Replaying Web Pages and WPACK
Ilya Kreymer <ikreymer@gmail.com> Mon, 13 July 2020 04:24 UTC
Return-Path: <ikreymer@gmail.com>
X-Original-To: wpack@ietfa.amsl.com
Delivered-To: wpack@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D1D3A3A0CC9 for <wpack@ietfa.amsl.com>; Sun, 12 Jul 2020 21:24:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.197
X-Spam-Level:
X-Spam-Status: No, score=-0.197 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ROFJ849pOlIT for <wpack@ietfa.amsl.com>; Sun, 12 Jul 2020 21:24:53 -0700 (PDT)
Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F0BDB3A0CC8 for <wpack@ietf.org>; Sun, 12 Jul 2020 21:24:52 -0700 (PDT)
Received: by mail-ed1-x533.google.com with SMTP id a1so5866901edt.10 for <wpack@ietf.org>; Sun, 12 Jul 2020 21:24:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O2gRGeJLae2jMLREqrEFt3WckFCD5oNumxE+2K+ie6Q=; b=ZBR9JhynMCnr3gbywWf9WUr14Bt9X1nygGiJfQoAxqJk5sJd7hGI1VHbjZW9Ju4wyJ LDkphJgW2AB0L1GsEeHr0eGgzQ3aQwMAnb0oD8xD9sIqptSb0xshOTALyo7j9CqgrTKu CRKSz92zRYdus1CVtMmOejLmYbifos1C6X1r+ro3Sz68kFNDNqPhVfZhZIsVOWvBBah8 zuoXe1GnTO2WSjqHNVdgNCHYcz/rQZ+pZOAzyLCB7x8nH0JS1+CKKwSNGH4cP/OOujzL 2M2gZt95970MFCfgfdOULOfNhRLzf9JBfNDALLIEs6ND1hZmonwtU8n9qQ92Lo8bmZfc L+Xw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O2gRGeJLae2jMLREqrEFt3WckFCD5oNumxE+2K+ie6Q=; b=q98sGnfGUTsNd9X9RWtpyq9f2a48rUlUwir5XUCJJS04lsqzdjM+zraj3meb/LynHP obn/lXyXAn6TZU5lktzLEjr+YbifFd/oT1QPDe97R3HLNMq3bylOPGfW+HcvKFHXfFnH rcCA39ohLZrMNqhPYcismkD9r/aD8Afm0CxpYqtQRIcocl1tNz/FWm0knuD+QxfxAFSq jxBd3D/BGJTnwoqRH8DTek3n3FevH/OJlKR/Zq2QxZPHjYQhA/fhqSZlWqtnMhN8pAH6 PB85pE+ah8wpk061Jkf/L57kausUWmWJmruDLuDqDjS5BrDm4yUWnOUqbvzWD5equLCV adoA==
X-Gm-Message-State: AOAM531o2g1OlIXCNka1iTL2X0BEpE4EMIpCg5i+ShRj8t2yqMmaRQSn 6wJaN5sxOCokwwHKhpEr6YAk7Ic8bvsGe5M/78Y=
X-Google-Smtp-Source: ABdhPJwpbSzcrEeStgyE+XnRuc2FCndhGQRKJUfVuOdPjB7VIGvd7lTiP5svggacee+w+tS9UMJxJgG6B+7LLC+rq4A=
X-Received: by 2002:a50:a881:: with SMTP id k1mr87075646edc.12.1594614291350; Sun, 12 Jul 2020 21:24:51 -0700 (PDT)
MIME-Version: 1.0
References: <CANAUx6juQjKmJZpj+_gzmz6i+SRK3wYDW0g0zmCr7DY2kXKqyA@mail.gmail.com> <023301d6589e$9b4ead30$d1ec0790$@acm.org>
In-Reply-To: <023301d6589e$9b4ead30$d1ec0790$@acm.org>
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Sun, 12 Jul 2020 21:24:40 -0700
Message-ID: <CANAUx6hfm61DBHRu4takYRNXd6iM=_uJzv+Dbgmgb9PUGW1yLA@mail.gmail.com>
To: Larry Masinter <LMM@acm.org>
Cc: wpack@ietf.org
Content-Type: multipart/alternative; boundary="00000000000063c67b05aa4b126b"
Archived-At: <https://mailarchive.ietf.org/arch/msg/wpack/rce8VYfk7ZEO25fReenk66WX5j4>
Subject: Re: [Wpack] Web Archives, Replaying Web Pages and WPACK
X-BeenThere: wpack@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Web Packaging <wpack.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/wpack>, <mailto:wpack-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/wpack/>
List-Post: <mailto:wpack@ietf.org>
List-Help: <mailto:wpack-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/wpack>, <mailto:wpack-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Jul 2020 04:24:55 -0000
On Sun, Jul 12, 2020 at 3:48 PM Larry Masinter <LMM@acm.org> wrote: > Most of the replay of HTTP is irrelevant to the archive use case because > there is no point in trying to reach out to original servers long after > HTTP/n is obsolete. Mostly it’s a privacy threat to record irrelevant > transaction metadta. > Web archive replay is the replay of HTTP exchanges to recreate the original page as accurately as possible, but within an isolated context. When visiting, for example, Internet Archive's Wayback Machine, what it does is replay (unsigned) HTTP exchanges, with necessary modifications to make the pages load from a different origin. I have implemented a similar system entirely in the browser, using service workers to match a request to an archived response. The exact HTTP protocol itself is generally not relevant, and the abstraction of matching an HTTP request to a stored HTTP response is sufficient, Making the response seem like it was loaded from the original origin, not the origin of the archive is the harder part, however. The original servers are never being contacted, as the goal is to load the archive in an isolated context, much like what is proposed with web bundles for offline use. > > > Instead you need to define a layer (like PDF/A did for paged documents, > which preserves the meaning of the original experience without necessarily > being able to enter in new data and have it recompute. For example, there > is no good way to archive an empty chat room and preserve the experience of > saying something new. > > Yes, a way to define boundaries would be nice to have for archives, currently there is no such spec -- if you hit an archive 'boundary', you'd end up with the 404 when trying to replay. This is not a requirement for reasonable replay, though. > > > > > The archive use case needs a different security model from the online same > origin policy. > > The model used in PDF is pretty simple: > > > > Intra-package links are trusted. Links from inside the package to out > require user verification (once for that package). > Yes, I agree, the current security model is unfortunately insufficient, that's why I was hoping that this spec could help the web archiving use case, which today remains an actively used example of replaying HTTP exchanges. Archives are generally isolated bundles, and should not link outside the archive. (This is easier to enforce with a CSP policy currently) In my systems, I have implemented sandboxing and isolation via Javascript, and it has taken a lot of effort to do so, and probably is not entirely foolproof: Generally, this is done via URL prefixes For example, an archive of 'https://example.com/' at: https://my-archive.example.com/bundle-A/20200701//https://example.com/ should not have access to: https://my-archive.example.com/bundle-B/20200701/https://example.com/ But both pages should believe they are loaded from the origin of ' https://example.com/' in order to operate properly. This is currently done by 'emulating' the origin via Javascript injection, since there is no other way, but would be great if instead the browser could support this for a trusted archive directly. The intent of the above URLs is to say: 'load URL https://example.com/ archived on 2020-06-01 from a https://my-archive.example.com/bundle-B.bundle' or 'load URL https://example.com/ archived on 2020-07-01 from a https://my-archive.example.com/bundle-A.bundle' and the web archiving community settled on this de-facto URL scheme for expressing such requests. I think the main difference between the archival use case and some of the other use cases being considered seems to be the signing/verification and duration of the signatures. Since the HTTP exchange is a two way exchange, what if the client browser had equal ability to create a signed exchange? If the client, rather than only the server, could sign an HTTP exchange and the signature could be verifiable for longer than 7 days, it could be a significant help to the archival use case, which includes users making their own snapshots. Archives like Internet Archive, smaller less known archives and even individual users could produce verifiable HTTP exchanges that could be loaded offline, could be more trustworthy than screenshots, etc.. The other mechanics involved in loading and replaying an HTTP exchange bundle, such as url fragments, request-->response matching are also no different for the archival use case than any of the other ones. Ilya
- [Wpack] Web Archives, Replaying Web Pages and WPA… Ilya Kreymer
- Re: [Wpack] Web Archives, Replaying Web Pages and… Larry Masinter
- Re: [Wpack] Web Archives, Replaying Web Pages and… ehs
- Re: [Wpack] Web Archives, Replaying Web Pages and… Ilya Kreymer
- Re: [Wpack] Web Archives, Replaying Web Pages and… Jeffrey Yasskin
- Re: [Wpack] Web Archives, Replaying Web Pages and… Ilya Kreymer