[Wpack] WPACK and Web Archiving-focused bundle format

Ilya Kreymer <ikreymer@gmail.com> Tue, 05 October 2021 19:18 UTC

Return-Path: <ikreymer@gmail.com>
X-Original-To: wpack@ietfa.amsl.com
Delivered-To: wpack@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DA2953A0953 for <wpack@ietfa.amsl.com>; Tue, 5 Oct 2021 12:18:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WRv4rQUN_H4C for <wpack@ietfa.amsl.com>; Tue, 5 Oct 2021 12:18:50 -0700 (PDT)
Received: from mail-ed1-x52f.google.com (mail-ed1-x52f.google.com [IPv6:2a00:1450:4864:20::52f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 14EF23A0958 for <wpack@ietf.org>; Tue, 5 Oct 2021 12:18:50 -0700 (PDT)
Received: by mail-ed1-x52f.google.com with SMTP id g10so800985edj.1 for <wpack@ietf.org>; Tue, 05 Oct 2021 12:18:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:from:date:message-id:subject:to; bh=VBsT/pdTk8DwlF1FYmUwQmnv8JR4XKlqBpdcW32u7u4=; b=MI9gYqlRyfhdaKCkq1Ta70BwtlmsgshJNfhsuudVYn2ppK4Gd0G079D9kvpVMKMF7l mtJKmXrdazwMVAGnLYJw9deu1/VOYZlBGGd/qFfj1+verE/I1Z76ct0/xotpLwo/ZF35 u70sR++nFsWTHnthgZ7FXvJrzgEfIXGApaQ58aC0KIYArY9Vow2tUE5kkHbol0NCFa2/ H6Yn5jSK3HGrSjwAaA+48kmjBWJ3XB83HfwleeMcyKwS9buaiNDVUyuSkumAOv/kWVjX Mp5OFn2t7EssHk3cRC7gaSzZptFNwMIE0dMcNgnNoo5dOSbjstT5EG3w7PGVG04D/Dd5 PF8A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=VBsT/pdTk8DwlF1FYmUwQmnv8JR4XKlqBpdcW32u7u4=; b=D+jgBMpDfj4yxzvvH64iX0LD3DVYiceKvH3Q5AsK1gm0/62mQRAnWKTiQLEzKcrPTQ FtP/q0PdPID+y5cm58rNgSZb5NTAg4Kk7LQT3LhmriqOR1c1B40mcG6mnFVyJmwI72MV e6j1m2lGJCKsw4br1n1li2mQtYp75K9OvuAU+i40YguRm4NFt4ETfPPrjw7s4S/qHdnp SV4NSL/yO8X064jldfOk/4yZuvOZ1GflODlKcBOBxhbk/DgGKQhLzaOP2FW9mYZWnWVW 64bTwaJXi/D40CL/6OcEJk2slbzj9x3RuhX36t3uAXJV2HewmyDOQxrr9DuWiF2iGB8o hC6w==
X-Gm-Message-State: AOAM531uptGv0mKeaRQ0GTU+KN6Cf99wNU/0zlMt/rSd0URiWikqON8w X4mZKHL044s4My833bosYP1pj7GsaHJU9ymNRatO4faoB2U=
X-Google-Smtp-Source: ABdhPJxtGyk6ASKjrhPv5nugpheu8chGiaxJ5mq5i7bY9MJloAvzmrH8HYmJi7UgbXoJV0/bmqYY76CaxA6aRNpaz4o=
X-Received: by 2002:a17:906:6d0a:: with SMTP id m10mr26476255ejr.90.1633461527944; Tue, 05 Oct 2021 12:18:47 -0700 (PDT)
MIME-Version: 1.0
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Tue, 05 Oct 2021 12:18:37 -0700
Message-ID: <CANAUx6iHU2ip8af0Z32Hiy_nLw24cNX3GcHQcv7WL4UrJRcrPw@mail.gmail.com>
To: WPACK List <wpack@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000020a09e05cd9fe645"
Archived-At: <https://mailarchive.ietf.org/arch/msg/wpack/m4ofhMwnMZq4vLR6WosFDeYAG7o>
Subject: [Wpack] WPACK and Web Archiving-focused bundle format
X-BeenThere: wpack@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Web Packaging <wpack.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/wpack>, <mailto:wpack-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/wpack/>
List-Post: <mailto:wpack@ietf.org>
List-Help: <mailto:wpack-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/wpack>, <mailto:wpack-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Oct 2021 19:18:55 -0000

Hello,

We would like to start the process of standardizing a format that fully
supports all the requirements for bundling web data, http request/response
pairs, page lists, and associated metadata for web archiving use cases, a
'web archive bundle' format. I wanted to inquire if the WPACK working group
would be the right place for introducing such an effort, or if it would be
out of scope.

Looking at the charter, there is definitely overlap with some, but not all
of the goals. In particular, the bullet points #1, 2, 3, 5 from
https://datatracker.ietf.org/doc/charter-ietf-wpack/ are shared goals of
the web archiving bundle format, while some of the other goals are less
important.

The current CBOR-based bundle proposal is not sufficient to address the
full scope of web archiving requirements, and that's totally fine, as key
goals for that format are quite different.

I am wondering if this group would be open to accepting a proposal for a
different standard specifically geared towards addressing all of the web
archiving use cases, or if we should pursue other paths, such as other
working groups/new working group within IETF for such standardization
efforts.

Here is a (very) brief summary of some of the requirements for web
archiving that will be fulfilled in this new format:
- forwards and backwards compatibility with existing ISO WARC format, which
will be used to store the raw http request/response data.
- an index based on URL + timestamp
- index support for multiple request/response pairs of same URL, at same or
different timestamp
- support for storing request body, eg. for POST requests
- random-access based URL+timestamp index that can be partially loaded via
random access.
- support for different size web archive bundles, from a single page to
very large archive bundles consisting of many GBs of data or hundreds of
pages that are loaded entirely via random access.
- support for text index to allow full-text search
- support for combining multiple web archive bundles for a growing web
archive collection.

We of course plan to elaborate on all of these but first want to understand
if this group would be the appropriate place for this work, or not.

Any advice/additional guidance on this would be appreciated, or any next
steps to pursue this effort, if this is the right place for this work.

Thank you,
Ilya
Webrecorder