Re: [Wpack] WPACK and Web Archiving-focused bundle format

Ilya Kreymer <ikreymer@gmail.com> Wed, 06 October 2021 17:43 UTC

Return-Path: <ikreymer@gmail.com>
X-Original-To: wpack@ietfa.amsl.com
Delivered-To: wpack@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1B4783A090D for <wpack@ietfa.amsl.com>; Wed, 6 Oct 2021 10:43:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id E1evFhCVlh3K for <wpack@ietfa.amsl.com>; Wed, 6 Oct 2021 10:43:02 -0700 (PDT)
Received: from mail-io1-xd35.google.com (mail-io1-xd35.google.com [IPv6:2607:f8b0:4864:20::d35]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3E9C63A08D3 for <wpack@ietf.org>; Wed, 6 Oct 2021 10:43:00 -0700 (PDT)
Received: by mail-io1-xd35.google.com with SMTP id b78so3768053iof.2 for <wpack@ietf.org>; Wed, 06 Oct 2021 10:43:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ijFugzZFNSUIvlFQTXHTUKK50yxTRtnF1Nz1ts6+gx8=; b=UofJLukPZwl+Af5Fy8yRgkLjTQgjmdoUntqC1rKUK0cwhgMzRqwY2mgcbOB7ISHRv2 4wDzySCjaVhLrVZIKN3VghVdkGZs8DJe6WLNknwFkikeXPX4J2PDXI9iVz7B6LNZywH7 5epA3/sHKLmebGQd9fxUA0AdvA7Ie+ZEUhCOd/gzpdc1dnH1/nBif5q2NXa8O07ra37Y cl+wlUy02Ehb3OuqRgQGQ/qkn33DE7iR8c11HUqfA3Nt6ssHk8WvTg8+38NXzkaHFOP2 QIkhi1rZh8lny7DNs1M3YownpNzkoF3tvLI0se9R/B+XsOF+Gf/G9q6XHZO3x49p+mPF FG0w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ijFugzZFNSUIvlFQTXHTUKK50yxTRtnF1Nz1ts6+gx8=; b=1FCubo+laKobSXiWVhoKz2jBJwMglaYkjLf7pjCzPecJ63djsLDy43QG+Btr8rSIg6 FiYO0NJ+bCIF34+c6VwLnYfB3jTHRt1FIk/sQfbnoYysRcVbI35+Hi0XXD3dscq6/ewX VjNxnQkWDwcyCeerMMOPDK8yyA8UnNdY29ka2mJGnsEWDlG0OafbyURl5qdwn2/82qJ5 gVEOXJke0liUASCjtaeHtnRG60ziAx2wSNNmJh7msVeU6/HG7/lsI7nTeroqNKH8vfLE 3QKsPMbaIc7Q1Qrum2frWbisL7qhf4PE6B0qlP1UgE9GXrEYS2COn+HAarcaB+8Hnq/U T+vw==
X-Gm-Message-State: AOAM532FoBhicPxe7kk54OXi8vGnh2A6UZ1QLnubwYegcTfZptwq+78/ Ra/jO1kMhTmeuTNBPoW3BeEnbjvKaAbjnVo/ti8kZ09GhiU=
X-Google-Smtp-Source: ABdhPJzlysbVs+NiwZIBIh9wHpJ6CuxIVIJX4fNzU3GfYeuE77nJSaD310Jel2igA8/VpdfWbScVKJGeQ0oI538Z4dc=
X-Received: by 2002:a02:6048:: with SMTP id d8mr8113840jaf.61.1633542179112; Wed, 06 Oct 2021 10:42:59 -0700 (PDT)
MIME-Version: 1.0
References: <CANAUx6iHU2ip8af0Z32Hiy_nLw24cNX3GcHQcv7WL4UrJRcrPw@mail.gmail.com> <E7CE685A-18AA-415E-9393-0A194D694B69@sn3rd.com>
In-Reply-To: <E7CE685A-18AA-415E-9393-0A194D694B69@sn3rd.com>
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 06 Oct 2021 10:42:47 -0700
Message-ID: <CANAUx6j6pnMQM_hAcMcxZjj5FOcP1Ozoe1XdjUir79RcrxVXuw@mail.gmail.com>
To: Sean Turner <sean@sn3rd.com>
Cc: WPACK List <wpack@ietf.org>
Content-Type: multipart/alternative; boundary="0000000000004fca3705cdb2ad47"
Archived-At: <https://mailarchive.ietf.org/arch/msg/wpack/YxPvEnskWx3IWbWDcsxPznTAhsA>
Subject: Re: [Wpack] WPACK and Web Archiving-focused bundle format
X-BeenThere: wpack@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Web Packaging <wpack.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/wpack>, <mailto:wpack-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/wpack/>
List-Post: <mailto:wpack@ietf.org>
List-Help: <mailto:wpack-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/wpack>, <mailto:wpack-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Oct 2021 17:43:16 -0000

Hi Sean,

Thanks for the quick response! My responses below:

On Wed, Oct 6, 2021 at 8:55 AM Sean Turner <sean@sn3rd.com> wrote:

> Illya,
>
> With my chair hat on:
>
> From a purely procedural perspective, adopting work that is beyond what
> WPACK is chartered for would be considered out of scope. To do this new
> work, WPACK we would need to recharter. Rechartering is not out of the
> question, but getting that process kicked off involves some additional
> mechanics.
>
> The process would likely start with a proposal to the DISPATCH WG [0][1].
> DISPATCH is where new work in the ART area is evaluated. After discussing
> the work, DISPATCH can recommend a number of paths from suggesting that the
> new work be added to an existing WG or suggesting that a new WG be formed
> to address the work as well as rejecting the work entirely. The best bet to
> make sure that discussion goes as smoothly as possible is to have an I-D
> ready to go that people can read and comment on.
>
> I hope that helps with respect to the process of getting new work adopted.
>

Thank you for clarifying that it would be out of scope as chartered now (I
wasn't quite sure as there are several overlapping goals) and the steps
needed for moving forward.


>
>
> With my chair hat off:
>
> I do not have access to the WARC IS (ISO 28500:2017), but I found [2], and
> that made me wonder about the following (I have no doubt these questions
> would also come up during dispatch discussions or maybe not because
> everybody but me knows the answer):
>
> - If there is already an ISO IS for the WARC format, then why is an RFC
> needed? I am asking this primarily because the IETF and ISO (through ISOC)
> have a Class A liaison [3]. Part of that agreement is that we will not
> produce duplicative work.


> - If there is already and ISO IS and it is a hard requirement maintain
> compatibility with the ISO WARC format, then what work is left to be done?
>
> - What are the differences between ISO 28500:2017 and [2].
>
> - Is there any reason ISO/IIPC has not already registered the
> application/warc and application/warc-fields media types with IANA. I could
> not find them in the list [4].
>

Apologies if the goals weren't made clear. The intent is not to modify WARC
in any way, but introduce a way of packaging WARC files to make them more
usable.
The goal is to make web archives loadable (and creatable) via web browsers.
The WARC format alone is not suitable for loading in the browser, due to a
lack of index and other metadata (pages, etc...). However, it contains all
of the http request and response data in a well-established format.
There is a well established existing ecosystem of tools around WARC (
https://github.com/iipc/awesome-web-archiving), and many PBs of data in
WARC files.

This idea boils down to the following: to bundle WARC files, along with
other data into a ZIP (or ZIP64) file. This allows for storing WARC files
unaltered, plus adding a random-access index and other requisite data to
the ZIP file. The ZIP format (and ZIP64) are already random access
readable. While ZIP and WARC of course have their flaws, this simply
combines two well-established formats to solve a number of problems related
to web archives in browsers.

Since this bundle format is a ZIP containing WARCs, the 'backwards
compatibility with WARC'  becomes as trivial as: unzip <web archive bundle>
/path/to/warc
and users get a standard WARC that is compatible with existing ecosystem of
tools.

The forwards compatibility is that if/when a new ISO WARC format is
introduced, it can also be bundled in the same way. The WARC format itself
is fully orthogonal to the bundling of the WARC in the ZIP.

This proposal would be about defining the contents and layout of the files
inside the bundled ZIP, one of which will be files of type WARC. Other
formats will need to be more formally defined.

I wanted to offer a very specific, but common example, relating to "2.1.3
Save and Share a web page" that was adopted as part
of draft-yasskin-wpack-use-cases.
Suppose a user wants to save a web page that contains two or more embedded
YouTube videos. YouTube currently uses POST requests to the same URL,
differing only
by request payload. To fully replay such a page, it is necessary to save
the request payload, and match the payload in the lookup when determining
which video to load where.
Unfortunately, the CBOR-based bundle specification will be unable to deal
with this use case, because request payloads are not saved. The web archive
bundle proposal will address this use case (and others) by storing the data
as WARC and specifying the necessary steps to include the POST data in the
lookup index.

Of course, it'd be better if YouTube didn't do this, but we have to work
with what we have, and POST requests are in common use in this way.
This use case also seems to fit in with one of the charter goals, to
support "The ability to create a snapshot of a web page without the
cooperation of its
publisher."

Hopefully this helps clarify the motivation and general structure of this
proposal.  (For anyone interested in more details, an initial rough
proof-of-concept specification that we'd start from exists at:
https://github.com/webrecorder/wacz-format)

I suppose the next steps are to reach out to DISPATCH WG so we can take
this conversation there!

Thanks again,
Ilya




>
> spt
>
> [0] Honestly, the proposal starts with you trying to get folks interested
> in your idea and that’s how I interpret this email ;)
> [1] https://datatracker.ietf.org/wg/dispatch/about/
> [2]
> https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
> [3] https://www.iso.org/organization/9418.html
> [4]
> https://www.iana.org/assignments/media-types/media-types.xhtml#application
>
>
> > On Oct 5, 2021, at 15:18, Ilya Kreymer <ikreymer@gmail.com> wrote:
> >
> > Hello,
> >
> > We would like to start the process of standardizing a format that fully
> supports all the requirements for bundling web data, http request/response
> pairs, page lists, and associated metadata for web archiving use cases, a
> 'web archive bundle' format. I wanted to inquire if the WPACK working group
> would be the right place for introducing such an effort, or if it would be
> out of scope.
> >
> > Looking at the charter, there is definitely overlap with some, but not
> all of the goals. In particular, the bullet points #1, 2, 3, 5 from
> https://datatracker.ietf.org/doc/charter-ietf-wpack/ are shared goals of
> the web archiving bundle format, while some of the other goals are less
> important.
> >
> > The current CBOR-based bundle proposal is not sufficient to address the
> full scope of web archiving requirements, and that's totally fine, as key
> goals for that format are quite different.
> >
> > I am wondering if this group would be open to accepting a proposal for a
> different standard specifically geared towards addressing all of the web
> archiving use cases, or if we should pursue other paths, such as other
> working groups/new working group within IETF for such standardization
> efforts.
> >
> > Here is a (very) brief summary of some of the requirements for web
> archiving that will be fulfilled in this new format:
> > - forwards and backwards compatibility with existing ISO WARC format,
> which will be used to store the raw http request/response data.
> > - an index based on URL + timestamp
> > - index support for multiple request/response pairs of same URL, at same
> or different timestamp
> > - support for storing request body, eg. for POST requests
> > - random-access based URL+timestamp index that can be partially loaded
> via random access.
> > - support for different size web archive bundles, from a single page to
> very large archive bundles consisting of many GBs of data or hundreds of
> pages that are loaded entirely via random access.
> > - support for text index to allow full-text search
> > - support for combining multiple web archive bundles for a growing web
> archive collection.
> >
> > We of course plan to elaborate on all of these but first want to
> understand if this group would be the appropriate place for this work, or
> not.
> >
> > Any advice/additional guidance on this would be appreciated, or any next
> steps to pursue this effort, if this is the right place for this work.
> >
> > Thank you,
> > Ilya
> > Webrecorder
> >
> >
> >
> > _______________________________________________
> > Wpack mailing list
> > Wpack@ietf.org
> > https://www.ietf.org/mailman/listinfo/wpack
>
>