Re: Dictionary Compression for HTTP (at Facebook)

Felix Handte <felixh@fb.com> Fri, 31 August 2018 06:00 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 51ED312872C for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 30 Aug 2018 23:00:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.899
X-Spam-Level:
X-Spam-Status: No, score=-7.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=0.001, MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_HI=-5, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5A0ZT4J-fRjU for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 30 Aug 2018 23:00:31 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BC46C128CF2 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Thu, 30 Aug 2018 23:00:30 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.89) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1fvcRa-0007of-6p for ietf-http-wg-dist@listhub.w3.org; Fri, 31 Aug 2018 05:58:06 +0000
Resent-Message-Id: <E1fvcRa-0007of-6p@frink.w3.org>
Received: from titan.w3.org ([128.30.52.76]) by frink.w3.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <ylafon@w3.org>) id 1fvcRX-0007o0-6s for ietf-http-wg@listhub.w3.org; Fri, 31 Aug 2018 05:58:03 +0000
Received: from raoul.w3.org ([128.30.52.128]) by titan.w3.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <ylafon@w3.org>) id 1fvcR0-0003yC-Da for ietf-http-wg@w3.org; Fri, 31 Aug 2018 05:58:02 +0000
Received: from platy.fdn.fr ([80.67.176.7] helo=[192.168.1.38]) by raoul.w3.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <ylafon@w3.org>) id 1fvcQz-0001MY-NU for ietf-http-wg@w3.org; Fri, 31 Aug 2018 05:57:30 +0000
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
Content-Type: multipart/alternative; boundary="Apple-Mail=_5945A07D-568A-4CC1-BA6E-D9E4200293B1"
From: Felix Handte <felixh@fb.com>
In-Reply-To: <ED51E194-503A-4339-B564-A6543F42D0A1@mnot.net>
Resent-From: Yves Lafon <ylafon@w3.org>
Date: Fri, 24 Aug 2018 10:24:23 +0000
Cc: Charles McCathie-Neville <chaals@yandex-team.ru>, Evgenii Kliuchnikov <eustas@google.com>, Vlad Krasnov <vlad@cloudflare.com>, Nick Terrell <terrelln@fb.com>, Yann Collet <cyan@fb.com>, HTTP Working Group <ietf-http-wg@w3.org>
Resent-Date: Fri, 31 Aug 2018 07:57:28 +0200
Message-Id: <652edc11-2d19-aef9-e3fd-ecb77ab47c1a@fb.com>
Resent-To: HTTP Working Group <ietf-http-wg@w3.org>
References: <18eb0343-640c-8b95-1cc2-273bc72ec134@fb.com> <CAPapA7RLncAsHH5pr5RJSYjvPiNk8JvgBJ8T-tKebnC1C5ptHw@mail.gmail.com> <ED51E194-503A-4339-B564-A6543F42D0A1@mnot.net>
X-Name-Md5: efe3dad792d606410c9cc49cedaffc94
To: Mark Nottingham <mnot@mnot.net>, Jyrki Alakuijala <jyrki@google.com>
X-Mailer: Apple Mail (2.3445.9.1)
X-W3C-Hub-Spam-Status: No, score=-2.1
X-W3C-Hub-Spam-Report: ALL_TRUSTED=-1, AWL=-0.000, BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.25, HTML_MESSAGE=0.001, W3C_NW=0.5
X-W3C-Scan-Sig: titan.w3.org 1fvcR0-0003yC-Da c53c1986d068f11e322e2eb0b5141965
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Dictionary Compression for HTTP (at Facebook)
Archived-At: <https://www.w3.org/mid/652edc11-2d19-aef9-e3fd-ecb77ab47c1a@fb.com>
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/35865
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Jyrki,

Glad to hear it! We're excited too!

We look forward to playing with Brotli's new capabilities when they're available.

Yes, Zstd can accept unformatted buffers as dictionaries, which it uses as you describe. Additionally though, Zstd describes a format for structured dictionaries[1], which includes metadata in addition to an LZ77 prefix.

Mark,

I agree that hard thought is warranted about the security implications. Absolutely, if you have recommendations for who would be best to talk to, I'd welcome being pointed their way.

That was my mistake, I subscribed to the list from my personal email, but sent the message from my work email. since this was my first email to the list, it got captured into a review queue from which it hasn't yet escaped.

I wanted to briefly capture my understanding of where we're at. Broadly, it seems to me that there are fundamentally four questions that are relevant to this discussion:

(1) What are the eligible sources for content to be used as a dictionary by a particular HTTP request/response interaction?
(2) How should the client and server negotiate to use dictionaries at all, and how should they identify the particular content to use?
(3) How should the compressor make use of the dictionary content?
(4) Do these mechanisms introduce new security issues, and if so, how can they be mitigated?

Jyrki, would you mind briefly sharing where you guys stand with respect to these questions? Where do you contemplate proposing dictionaries be allowed to be drawn from? How do you intend to negotiate the use of a dictionary?

In particular, a derivative topic from questions (1) and (2) is whether and how the dictionary selection and negotiation process should be coupled to the negotiation/selection of the compression algorithm. Given that several of the content coding compressors (shared brotli, zstd, and gzip/deflate) support custom dictionaries, it seems desirable to find a mechanism that is agnostic to the algorithm in use. In fact, I can only imagine it making sense to tie a spec in this space to a specific compressor if there were a very strong reason to do so.

My reading of the shared brotli draft[2] is that, to the extent to which specific mechanisms for handling dictionaries are being contemplated by the Brotli team, they are strongly coupled to the use of Brotli and/or the Shared Brotli format. Is that your intent? If so, would you mind explaining why you feel that's advantageous over a generic solution?

For our own part, we find ourselves drawn towards a solution that makes a lot of the same choices as SDCH. That is, one that treats dictionaries as explicit resources that can be dynamically advertised by an origin, fetched and cached by a client, and then negotiated to be used in requests/responses between the two. The ability to treat a previous, cached response as a base on which to apply a "diff" (negotiated by ETag?) is also attractive to us.

Thanks,
Felix

[1] https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format <https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format>
[2] https://tools.ietf.org/html/draft-vandevenne-shared-brotli-format-01#section-9 <https://tools.ietf.org/html/draft-vandevenne-shared-brotli-format-01#section-9>

On 08/22/2018 11:13 PM, Mark Nottingham wrote:
> Hello Felix and Jyrki,
> Shared dictionary compression in various forms has been discussed in the Working Group for a fair amount of time. What's blocking progress is an agreed-to description of its security properties, issues therein, and acceptable mitigations for them.
> Some people have expressed interest in attempting to document that in an Internet-Draft, but to date we haven't seen any progress publicly. If you'd like, I can try to put you in touch with them to see if they need help, etc.
> Regards,
> P.S. Felix, your e-mail didn't make it to me, or into the archives. Are you subscribed to the list?
>> On 22 Aug 2018, at 6:23 pm, Jyrki Alakuijala <jyrki@google.com> wrote:
>> 
>> Fully agree! Sharing dictionaries is an amazing opportunity in making the internet faster and cheaper. SDCH never exploited that opportunity fully and it is great that we all are giving another go on this.
>> 
>> I presume Zstd dictionaries are simple:
>> 	• fill lz77 buffer with bytes
>> ... and I know that Shared brotli dictionaries are relatively complex:
>> 	• fill lz77 buffer with bytes, or,
>> 	• add special meaning for unique (distance, length) pairs (2 % more density than filling lz77 buffer with bytes), or,
>> 	• perform a binary diff on patch data (makes bsdiff obsolete by compressing 5–10 % more than bsdiff+brotli, can by 95+ % more dense than traditional lz77 dictionary for patching).
>> 	• when distance overflows for unique (distance, length) pairs, a customized word transform is applied (gives 2 % more density)
>> 	• context modeling: dictionary ordering of interpretation of (distance, length) pairs may depend on the last two bytes (unknown gains, I anticipate 1 %)
>> For data like the Google search result pages we can see a reduction of ~50 % in data when we go from "br" Brotli to Shared Brotli, and naturally very significant latency wins. Having binary diffing within shared dictionary infrastructure can allow patches for web packaging, Android apps, fonts, or other complex structured data to be efficiently compressed with shared dictionary by just using the previous version of that data as a dictionary.
>> 
>> 
>> 
>> On Wed, Aug 22, 2018 at 2:30 AM, Felix Handte <felixh@fb.com> wrote:
>> Hello all,
>> 
>> Quick introduction: I'm an engineer on the Data Compression team at Facebook. While we partner with other teams here to apply compression internally at Facebook, we primarily maintain the open source Zstandard[1][2] and LZ4[3] libraries.
>> 
>> We've seen enormous success leveraging dictionary-based compression with Zstd internally, and I'm starting to look at how we can apply the same toolkit/approach to compressing our public web traffic. As we're thinking about how to do this, both as a significant origin of HTTP traffic and as maintainers of open source compression tools, we want very much to pursue a course of action that is constructive for the broader community.
>> 
>> There are, by my count, three competing proposals for how this sort of thing might work (SDCH[4], Compression Dictionaries for HTTP/2[5], and Shared Brotli Dictionaries[6]+[7]). With no public consensus around how to do this well, it's tempting for us to simply build on the tooling we've built internally, and apply it to our traffic between our webservers and our mobile apps (where we control both ends of the connection and can do anything we want). However, it would be pretty tragic for Facebook to gin up its own spec and implementation in this space, roll it out, and end up with something mutually incompatible with anyone else's efforts, further fragmenting the community and driving consensus further off.
>> 
>> So I wanted to first resurface this topic with you all. In short, is there anyone still interested in pursuing a standard covering these topics? If so, I would like to work with you and help build something in this space that can actually see adoption.
>> 
>> Thanks,
>> Felix
>> 
>> [1] https://github.com/facebook/zstd
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dkucherawy-2Ddispatch-2Dzstd-2D03&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=0WbAaGBnEaii-Se8oX3hykAULz2ELjd0BpsiWnUtJn8&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dkucherawy-2Ddispatch-2Dzstd-2D03&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=0WbAaGBnEaii-Se8oX3hykAULz2ELjd0BpsiWnUtJn8&e=>
>> [3] https://github.com/lz4/lz4 <https://github.com/lz4/lz4>
>> [4] https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dlee-2Dsdch-2Dspec-2D00&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=xSsC6Y5cdBQH26kF2EsHBo14ro6umSHXxj5isrqUROs&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dlee-2Dsdch-2Dspec-2D00&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=xSsC6Y5cdBQH26kF2EsHBo14ro6umSHXxj5isrqUROs&e=>
>> [5] https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dvkrasnov-2Dh2-2Dcompression-2Ddictionaries-2D03&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=tpEFC6Asr_xaTeEcJr4UWozbPOwBs1LQUVJ-s9eNlYU&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dvkrasnov-2Dh2-2Dcompression-2Ddictionaries-2D03&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=tpEFC6Asr_xaTeEcJr4UWozbPOwBs1LQUVJ-s9eNlYU&e=>
>> [6] https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dvandevenne-2Dshared-2Dbrotli-2Dformat-2D01&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=Y30ZAb-fBmy4gAt6S37X1t7DuD0qZPGX44u9beu8s24&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dvandevenne-2Dshared-2Dbrotli-2Dformat-2D01&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=Y30ZAb-fBmy4gAt6S37X1t7DuD0qZPGX44u9beu8s24&e=>
>> [7] https://github.com/google/brotli/wiki/Fetch-Specification <https://github.com/google/brotli/wiki/Fetch-Specification>
>> 
>> 
> --
> Mark Nottingham   https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mnot.net_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=CVOwcJi5LnOBvYeEnIyqEsxyEavnNj9CB0O7W547ToE&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mnot.net_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=CVOwcJi5LnOBvYeEnIyqEsxyEavnNj9CB0O7W547ToE&e=>