Re: [Wpack] WPACK and Web Archiving-focused bundle format

Larry Masinter <LMM@acm.org> Tue, 05 October 2021 22:36 UTC

Return-Path: <masinter@gmail.com>
X-Original-To: wpack@ietfa.amsl.com
Delivered-To: wpack@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 625DD3A0C0E for <wpack@ietfa.amsl.com>; Tue, 5 Oct 2021 15:36:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.747
X-Spam-Level:
X-Spam-Status: No, score=-1.747 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ELpBGFW3IdgH for <wpack@ietfa.amsl.com>; Tue, 5 Oct 2021 15:36:42 -0700 (PDT)
Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 981423A0C42 for <wpack@ietf.org>; Tue, 5 Oct 2021 15:36:42 -0700 (PDT)
Received: by mail-pl1-x62e.google.com with SMTP id a11so479789plm.0 for <wpack@ietf.org>; Tue, 05 Oct 2021 15:36:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=sender:from:to:cc:references:in-reply-to:subject:date:message-id :mime-version:content-language:thread-index; bh=RsmXhGdBLXiudAawl621DkS5amrP2fLDud7WzDyPN9Y=; b=LMsj8e82j5Fi9hocM23ig9PyxkgZ7OkApT3VJ6kKFftYiO7P5lF1rXyWQ93sot3WKM BKKGaUybWO5WYJFb4Q1eswix3aO83d90RYraRForyyUVDzkQqjBXc/kFCYAHOUcMLGx/ xdrOpLs4VKPqVe4YX76H/IUMpItJAUKvO5gpaBwFA1herbFr4pze+15CxI1z9+KNy7T+ at7UJ4d6+DlcQjiQX9NFsZvpZSxCpo8IunnqRpd5V5QbPhjj5iXZYOAzqoYF6YzU5L47 wzsxa3X2ZTCAGM88+mQ5E7wEsfxUbr6/gred/7ew89hjKaAMAt5i7xPqFVNQj8SwOkos dQqw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:sender:from:to:cc:references:in-reply-to:subject :date:message-id:mime-version:content-language:thread-index; bh=RsmXhGdBLXiudAawl621DkS5amrP2fLDud7WzDyPN9Y=; b=cPPBPJenZCYck4pTFllbE7/HnToLqfgcfjHa8UjpMKX985HOYPNdmU6ByVxQLnIHHN wSAEvx7mj2YLgeg09NVLtyAG80NGsJJnD2BPxBy+bI9H+ra3RzSLlnVgGnYCYZ2MzOit zqTOd6gvEIvCpQidUmOESwep+1W+hG0VXKPl3f/YR7jzccRBaZUWwTxEE12IefwgPgWq SJ2snNdU0lYQbqmuNgvhGw1YZ2Q+DAfkk358uYsJRmWkxUI1JXKqqZH5fDkpDOzTzKCk n92mYmGpwsXha2WK8r5ztc1O8hBugmuX0pLD6GwgmdblQyyx4S4UM8o2upVPEztxirJ3 /Etw==
X-Gm-Message-State: AOAM531FQyjUFiO64BZhcBJkGXjRruX9e9UJfEEQ0P8OPQuYwnzyqEI6 7+TnV/IN8nEiFOWD1SSLUGQ=
X-Google-Smtp-Source: ABdhPJwDl2nvhMBykMTVR0T69951o2UTKid8PEBf1vpBZ0DIHnlJEwdSknAGiEp67KktPLlPGOzvow==
X-Received: by 2002:a17:902:9303:b029:12c:29c:43f9 with SMTP id bc3-20020a1709029303b029012c029c43f9mr7679767plb.5.1633473401678; Tue, 05 Oct 2021 15:36:41 -0700 (PDT)
Received: from TVPC (c-73-158-116-21.hsd1.ca.comcast.net. [73.158.116.21]) by smtp.gmail.com with ESMTPSA id c140sm6649262pfc.31.2021.10.05.15.36.40 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 05 Oct 2021 15:36:40 -0700 (PDT)
Sender: Larry Masinter <masinter@gmail.com>
From: Larry Masinter <LMM@acm.org>
X-Google-Original-From: "Larry Masinter" <lmm@acm.org>
To: 'Ilya Kreymer' <ikreymer@gmail.com>
Cc: 'WPACK List' <wpack@ietf.org>
References: <CANAUx6iHU2ip8af0Z32Hiy_nLw24cNX3GcHQcv7WL4UrJRcrPw@mail.gmail.com>
In-Reply-To: <CANAUx6iHU2ip8af0Z32Hiy_nLw24cNX3GcHQcv7WL4UrJRcrPw@mail.gmail.com>
Date: Tue, 05 Oct 2021 15:36:40 -0700
Message-ID: <00a801d7ba39$777caa80$6675ff80$@acm.org>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_NextPart_000_00A9_01D7B9FE.CB1EBCE0"
X-Mailer: Microsoft Outlook 16.0
Content-Language: en-us
Thread-Index: AQH3Ecn8IULE/INtr29jquhQMNd21auGR3Fg
Archived-At: <https://mailarchive.ietf.org/arch/msg/wpack/U549NRm4XzcpbFLSYDula_JZWRw>
Subject: Re: [Wpack] WPACK and Web Archiving-focused bundle format
X-BeenThere: wpack@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Web Packaging <wpack.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/wpack>, <mailto:wpack-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/wpack/>
List-Post: <mailto:wpack@ietf.org>
List-Help: <mailto:wpack-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/wpack>, <mailto:wpack-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Oct 2021 22:36:48 -0000

A modest proposal missing the 😊 smiley. 

 

--

 <https://LarryMasinter.net> https://LarryMasinter.net  <https://interlisp.org> https://interlisp.org

 

From: Wpack <wpack-bounces@ietf.org> On Behalf Of Ilya Kreymer
Sent: Tuesday, October 5, 2021 12:19 PM
To: WPACK List <wpack@ietf.org>
Subject: [Wpack] WPACK and Web Archiving-focused bundle format

 

Hello,

 

We would like to start the process of standardizing a format that fully supports all the requirements for bundling web data, http request/response pairs, page lists, and associated metadata for web archiving use cases, a 'web archive bundle' format. I wanted to inquire if the WPACK working group would be the right place for introducing such an effort, or if it would be out of scope.

 

Looking at the charter, there is definitely overlap with some, but not all of the goals. In particular, the bullet points #1, 2, 3, 5 from https://datatracker.ietf.org/doc/charter-ietf-wpack/ are shared goals of the web archiving bundle format, while some of the other goals are less important.

 

The current CBOR-based bundle proposal is not sufficient to address the full scope of web archiving requirements, and that's totally fine, as key goals for that format are quite different.

 

I am wondering if this group would be open to accepting a proposal for a different standard specifically geared towards addressing all of the web archiving use cases, or if we should pursue other paths, such as other working groups/new working group within IETF for such standardization efforts.

 

Here is a (very) brief summary of some of the requirements for web archiving that will be fulfilled in this new format:

- forwards and backwards compatibility with existing ISO WARC format, which will be used to store the raw http request/response data.

- an index based on URL + timestamp

- index support for multiple request/response pairs of same URL, at same or different timestamp

- support for storing request body, eg. for POST requests

- random-access based URL+timestamp index that can be partially loaded via random access.

- support for different size web archive bundles, from a single page to very large archive bundles consisting of many GBs of data or hundreds of pages that are loaded entirely via random access.

- support for text index to allow full-text search

- support for combining multiple web archive bundles for a growing web archive collection.

 

We of course plan to elaborate on all of these but first want to understand if this group would be the appropriate place for this work, or not.

 

Any advice/additional guidance on this would be appreciated, or any next steps to pursue this effort, if this is the right place for this work.

 

Thank you,

Ilya

Webrecorder