Re: Compression analysis of perfect atom-based compressor

Mark Nottingham <mnot@mnot.net> Fri, 05 April 2013 00:03 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 497C621F8DFB for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 4 Apr 2013 17:03:46 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.266
X-Spam-Level:
X-Spam-Status: No, score=-9.266 tagged_above=-999 required=5 tests=[AWL=1.333, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5lDysvyT1+ZA for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 4 Apr 2013 17:03:45 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id BA57A21F8BF0 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Thu, 4 Apr 2013 17:03:44 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1UNu7h-0005nT-7E for ietf-http-wg-dist@listhub.w3.org; Fri, 05 Apr 2013 00:03:17 +0000
Resent-Date: Fri, 05 Apr 2013 00:03:17 +0000
Resent-Message-Id: <E1UNu7h-0005nT-7E@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <mnot@mnot.net>) id 1UNu7b-0005mn-NP for ietf-http-wg@listhub.w3.org; Fri, 05 Apr 2013 00:03:11 +0000
Received: from mxout-07.mxes.net ([216.86.168.182]) by lisa.w3.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from <mnot@mnot.net>) id 1UNu7W-00008D-6a for ietf-http-wg@w3.org; Fri, 05 Apr 2013 00:03:11 +0000
Received: from [192.168.1.80] (unknown [118.209.42.8]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 66DB122E257; Thu, 4 Apr 2013 20:02:43 -0400 (EDT)
Content-Type: multipart/mixed; boundary="Apple-Mail=_3D73D4B1-E11D-42D7-8273-EC67B905D384"
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
From: Mark Nottingham <mnot@mnot.net>
In-Reply-To: <CAP+FsNew+0ce6q24KRAdut7g4OjU5ysOSxd8Yk-FBmRrx0550w@mail.gmail.com>
Date: Fri, 05 Apr 2013 11:02:38 +1100
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <49A31D0F-3388-459E-A1B1-8ECEAACDD31C@mnot.net>
References: <CAP+FsNew+0ce6q24KRAdut7g4OjU5ysOSxd8Yk-FBmRrx0550w@mail.gmail.com>
To: Roberto Peon <grmocg@gmail.com>
X-Mailer: Apple Mail (2.1503)
Received-SPF: pass client-ip=216.86.168.182; envelope-from=mnot@mnot.net; helo=mxout-07.mxes.net
X-W3C-Hub-Spam-Status: No, score=-3.4
X-W3C-Hub-Spam-Report: AWL=-3.365, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1UNu7W-00008D-6a 7f83e928561a242ced1032eea1f6e0c0
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Compression analysis of perfect atom-based compressor
Archived-At: <http://www.w3.org/mid/49A31D0F-3388-459E-A1B1-8ECEAACDD31C@mnot.net>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/17201
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

How did you handle connection coalescing (the "streamifier" issue)?

Cheers,


On 05/04/2013, at 10:55 AM, Roberto Peon <grmocg@gmail.com> wrote:

> Here is some data based on an analysis of a perfect (i.e. resource unconstrained) atom-based compressor.
> I've removed any serialization sizes from this-- this will hold true for HeaderDiff or Delta or whatever, presupposing it does atom-based compression.
> 
> Feel free to jump to the bottom for take-aways if looking at the data is boring :)
> 
> The top most compressible headers are (key% is the percentage of the compressible bytes where were from the key):
>                        |compressed
> key-name               |  bytes     key%  val%
> -----------------------+-----------------------
> user-agent             |  2443472  9.37% 90.63%
> cookie                 |  1957141  2.88% 97.12%
> referer                |  1340198 11.89% 88.11%
> via                    |   980712  4.53% 95.47%
> accept-charset         |   700174 32.62% 67.38%
> accept-language        |   685850 48.86% 51.14%
> accept-encoding        |   676516 49.53% 50.47%
> cache-control          |   571590 50.27% 49.73%
> date                   |   544049 16.31% 83.69%
> x-cache-lookup         |   523453 37.32% 62.68%
> content-type           |   521956 50.88% 49.12%
> :host                  |   506321 23.66% 76.34%
> accept                 |   416278 32.65% 67.35%
> x-cache                |   414766 28.26% 71.74%
> last-modified          |   398843 58.89% 41.11%
> proxy-connection       |   389619 63.48% 36.52%
> server                 |   369522 32.27% 67.73%
> :path                  |   337031 35.55% 64.45%
> content-length         |   313985 94.79%  5.21%
> expires                |   305406 35.69% 64.31%
> :method                |   239569 70.02% 29.98%
> :status                |   237574 70.61% 29.39%
> p3p                    |   206589  2.93% 97.07%
> accept-ranges          |   198471 73.02% 26.98%
> content-encoding       |   124912 81.30% 18.70%
> vary                   |   110368 22.38% 77.62%
> age                    |    70211 66.89% 33.11%
> set-cookie             |    66181 23.90% 76.10%
> etag                   |    62416 48.63% 51.37%
> x-powered-by           |    54561 46.49% 53.51%
> x-content-type-options |    47171 77.61% 22.39%
> x-varnish-server       |    33947 51.61% 48.39%
> x-varnish              |    33376 89.04% 10.96%
> x-xss-protection       |    28685 58.73% 41.27%
> x-cdn                  |    27428 23.06% 76.94%
> location               |    26152 19.79% 80.21%
> transfer-encoding      |    25147 72.94% 27.06%
> xcache                 |    24800 18.75% 81.25%
> ...
> 
> The distribution of backreferences is falls off extremely quickly as distance from the newest element increases:
> 1       35460
> 2       20774
> 3       16886
> 4       12926
> 5       9601
> 6       6947
> 7       5100
> 8       4304
> 9       3658
> 10      3313
> 11      2789
> 12      2710
> 13      2678
> 14      2354
> 15      2372
> 16      2180
> 17      2066
> 18      2023
> 19      1979
> 20      1885
> 21      1888
> 22      1799
> 23      1724
> 24      1673
> 25      1584
> 26      1530
> 27      1506
> 28      1435
> 29      1395
> 30      1305
> 31      1355
> 32      1283
> 33      1305
> 34      1341
> 35      1339
> 36      1178
> 37      1200
> 38      1223
> 39      1169
> 40      1180
> 41      1111
> 42      1105
> 43      1074
> 44      1079
> 45      1043
> 46      1050
> 47      1030
> 48      972
> 49      963
> 50      942
> 51      935
> 52      916
> ...
> 
> I've attached a png of a graph of this (y axis=frequency, x-axis=dist from newest element).
> The knee in the graph is very nice indeed.
> 
> 
> Take aways?
> 	• We need a better survey of headers from everywhere :)
> 	• Compression over our corpus should scale favorably with small table (and state) size.
> 	• Encoding index as dist-from-newest really works well, and LRU appears to be extremely effective as an expiration policy (the attached graph looks good).
> 	• We're getting substantial compression from both key and value backreferences/tokenization.
> 	• Algorithmically, there isn't a whole lot to do-- the devil is really in the serialization details and the tradeoffs involved in generating/parsing. There are obvious tweaks that compressors could do when space constrained (e.g. looking at the first table, above, as the likely benefit and making decisions based upon that), but the data which suggests that the LRU is so effective also suggests that this benefit is likely limited unless they can predict the future :)
> 
> -=R
> 
> 

--
Mark Nottingham   http://www.mnot.net/