Continuing discussion on Cache Digest

Mark Nottingham <mnot@mnot.net> Sat, 20 August 2016 01:04 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0D0EC12D19F for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Fri, 19 Aug 2016 18:04:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.168
X-Spam-Level:
X-Spam-Status: No, score=-8.168 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.247, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ywg3zxBK34bY for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Fri, 19 Aug 2016 18:04:33 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F019412B02A for <httpbisa-archive-bis2Juki@lists.ietf.org>; Fri, 19 Aug 2016 18:04:32 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.80) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1baudv-0001xF-GK for ietf-http-wg-dist@listhub.w3.org; Sat, 20 Aug 2016 01:00:11 +0000
Resent-Date: Sat, 20 Aug 2016 01:00:11 +0000
Resent-Message-Id: <E1baudv-0001xF-GK@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <mnot@mnot.net>) id 1baudl-0007EQ-0l for ietf-http-wg@listhub.w3.org; Sat, 20 Aug 2016 01:00:01 +0000
Received: from mxout-07.mxes.net ([216.86.168.182]) by maggie.w3.org with esmtps (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from <mnot@mnot.net>) id 1baudj-0000p2-7N for ietf-http-wg@w3.org; Sat, 20 Aug 2016 01:00:00 +0000
Received: from [192.168.3.104] (unknown [124.189.98.244]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 715F722E1F3; Fri, 19 Aug 2016 20:59:35 -0400 (EDT)
From: Mark Nottingham <mnot@mnot.net>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Sat, 20 Aug 2016 10:59:32 +1000
Message-Id: <1C76B7AB-A7B9-4759-AC52-475C3E030137@mnot.net>
Cc: Kazuho Oku <kazuhooku@gmail.com>
To: HTTP Working Group <ietf-http-wg@w3.org>
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
X-Mailer: Apple Mail (2.3124)
Received-SPF: pass client-ip=216.86.168.182; envelope-from=mnot@mnot.net; helo=mxout-07.mxes.net
X-W3C-Hub-Spam-Status: No, score=-8.3
X-W3C-Hub-Spam-Report: AWL=1.351, BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_DB=-1, W3C_IRA=-1, W3C_IRR=-3, W3C_WL=-1
X-W3C-Scan-Sig: maggie.w3.org 1baudj-0000p2-7N cad8fd5c9a21e08796d8a83aabfcf59b
X-Original-To: ietf-http-wg@w3.org
Subject: Continuing discussion on Cache Digest
Archived-At: <http://www.w3.org/mid/1C76B7AB-A7B9-4759-AC52-475C3E030137@mnot.net>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/32335
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

[ with my "cache digest co-author" hat on ]

In discussions about Cache Digest, one of the questions that came up was whether or not it was necessary to use a digest mechanism (e.g., Bloom filter, Golumb compressed set), or whether or not we could just send a list of the cached representations.

Curious about this, I whipped up a script to parse the contents of Chrome's cache, to get some idea as to how many cached responses per origin a browser keeps.

See:
  https://gist.github.com/mnot/793fcfb0d003e87ea7e8035c43eafdb9
and responses to:
  https://twitter.com/mnot/status/766542805980155905

The caveats around this are too numerous to cover, but to mention a few:
  - this is just anecdata, and a very small sample at that
  - it's skewed towards: 
	a) people who follow me on Twitter; 
	b) people who use Chrome; 
	c) people who can easily run a Python program (leaving most Windows users out)
  - it includes both fresh and stale cached responses
  - it assumes that the Chrome URL gives the complete and correct state of the cache

Looking at the responses (five so far) and keeping that in mind, a few observations:

1. Unsurprisingly, the number of cached responses per origin appears to follow (roughly) a Zipf curve, like so many other Web stats do
2. Origins with tens of cached responses appear to be very common
3. Origins with hundreds of cached responses appear to be not uncommon at all
4. Origins with thousands of cached responses are encountered

More data is, of course, welcome.

My early take-away is that if we design a mechanism where the cached responses are enumerated, instead of having the entire cache's contents for the origin digested, there needs to be some mechanism whereby the most relevant cached responses are selected.

The most likely time to do that is when the responses themselves are first cached; e.g., with a cache-control extension. I think the challenges that such a scheme would face are:

a) Keeping the advertisement concise (because it should fit into a navigation request, without bumping into another RT of congestion window)
b) Being able to express the presence of a larger number of URLs (since one of the effects of HTTP/2 is atomisation into a larger number of smaller resources), with bits of state like "fresh/stale" attached
c) Being manageable for the origin (since they'll effectively have to predict what URLs are important to know about ahead of time, and in the face of site changes)

To me, this makes CD more attractive, because we have more confidence that (a) and (b) are in hand, and (c) isn't a worry because the entire origin's cache state will be sent. Provided that the security/privacy issues are in hand, and that it's reasonably implementable by clients, I think CD also has a better chance of success because it decouples the sending of the cache state from its use, making it easier to reuse the data on the server side without close client coordination.

So, I think the things that we do need to work on in CD are:

1) Choosing a more efficient hash algorithm and assuring that it's reasonable to implement in browsers
2) Refining the flags / operation models so that it's as simple and sensible as possible (but we need feedback on how clients want to send it)
3) Defining a way for origins to opt into getting CD, rather than always sending it.

Does this sound reasonable?

--
Mark Nottingham   https://www.mnot.net/