Compression analysis of perfect atom-based compressor

Roberto Peon <grmocg@gmail.com> Thu, 04 April 2013 23:58 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4B74021F862A for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 4 Apr 2013 16:58:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.598
X-Spam-Level:
X-Spam-Status: No, score=-10.598 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HHNani0DGRGf for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 4 Apr 2013 16:58:35 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id A921721F8629 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Thu, 4 Apr 2013 16:58:34 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1UNu16-0001uy-SW for ietf-http-wg-dist@listhub.w3.org; Thu, 04 Apr 2013 23:56:28 +0000
Resent-Date: Thu, 04 Apr 2013 23:56:28 +0000
Resent-Message-Id: <E1UNu16-0001uy-SW@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1UNu0z-0001uI-BL for ietf-http-wg@listhub.w3.org; Thu, 04 Apr 2013 23:56:21 +0000
Received: from mail-ob0-f176.google.com ([209.85.214.176]) by lisa.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1UNu0y-0008K1-4a for ietf-http-wg@w3.org; Thu, 04 Apr 2013 23:56:21 +0000
Received: by mail-ob0-f176.google.com with SMTP id er7so3164638obc.35 for <ietf-http-wg@w3.org>; Thu, 04 Apr 2013 16:55:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=G2A+P3l2hmHcUPgZjjcSPSs+v9deQ27cu6WyHD8dsbY=; b=PkoB2dEjxaX1MjJZ9Jt4D0K9ELucQwVXFcAcK7aeD7vZe6u706wK6teea+xKY+1yV7 EsBZBkURdmeLk/5AhSdlpXpsIFYFthiq/ApvOBLTt05zHjGxOD5XHtisGKqvV/R2WQPf 3G3Jw/KZ7kmqpHRtjMWegUfuUfd7QDKtF4ZNrU5QZMFIvh0JkzTZL6sQmFoeoqCSV95p kzkO8riPfURu+d4ttLkENTCjvzajHPXntdLFTnvm99tXkG17G9/JXC7JhIWyoOtcNw/X 3rOWyfprkXcXo3MgniiKQdI2HbW6yMLZnhnu/D5oAEwh73Lzw/yP6VkoM9Szi095nPSO l7mA==
MIME-Version: 1.0
X-Received: by 10.182.132.43 with SMTP id or11mr6212870obb.67.1365119753850; Thu, 04 Apr 2013 16:55:53 -0700 (PDT)
Received: by 10.76.109.72 with HTTP; Thu, 4 Apr 2013 16:55:53 -0700 (PDT)
Date: Thu, 04 Apr 2013 16:55:53 -0700
Message-ID: <CAP+FsNew+0ce6q24KRAdut7g4OjU5ysOSxd8Yk-FBmRrx0550w@mail.gmail.com>
From: Roberto Peon <grmocg@gmail.com>
To: HTTP Working Group <ietf-http-wg@w3.org>
Content-Type: multipart/mixed; boundary="14dae93a0c1701170d04d991b668"
Received-SPF: pass client-ip=209.85.214.176; envelope-from=grmocg@gmail.com; helo=mail-ob0-f176.google.com
X-W3C-Hub-Spam-Status: No, score=-3.5
X-W3C-Hub-Spam-Report: AWL=-2.681, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1UNu0y-0008K1-4a ab3308f85d13f55241f568d0fabef85b
X-Original-To: ietf-http-wg@w3.org
Subject: Compression analysis of perfect atom-based compressor
Archived-At: <http://www.w3.org/mid/CAP+FsNew+0ce6q24KRAdut7g4OjU5ysOSxd8Yk-FBmRrx0550w@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/17200
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Here is some data based on an analysis of a perfect (i.e. resource
unconstrained) atom-based compressor.
I've removed any serialization sizes from this-- this will hold true for
HeaderDiff or Delta or whatever, presupposing it does atom-based
compression.

Feel free to jump to the bottom for take-aways if looking at the data is
boring :)

The top most compressible headers are (key% is the percentage of the
compressible bytes where were from the key):
                       |compressed
key-name               |  bytes     key%  val%
-----------------------+-----------------------
user-agent             |  2443472  9.37% 90.63%
cookie                 |  1957141  2.88% 97.12%
referer                |  1340198 11.89% 88.11%
via                    |   980712  4.53% 95.47%
accept-charset         |   700174 32.62% 67.38%
accept-language        |   685850 48.86% 51.14%
accept-encoding        |   676516 49.53% 50.47%
cache-control          |   571590 50.27% 49.73%
date                   |   544049 16.31% 83.69%
x-cache-lookup         |   523453 37.32% 62.68%
content-type           |   521956 50.88% 49.12%
:host                  |   506321 23.66% 76.34%
accept                 |   416278 32.65% 67.35%
x-cache                |   414766 28.26% 71.74%
last-modified          |   398843 58.89% 41.11%
proxy-connection       |   389619 63.48% 36.52%
server                 |   369522 32.27% 67.73%
:path                  |   337031 35.55% 64.45%
content-length         |   313985 94.79%  5.21%
expires                |   305406 35.69% 64.31%
:method                |   239569 70.02% 29.98%
:status                |   237574 70.61% 29.39%
p3p                    |   206589  2.93% 97.07%
accept-ranges          |   198471 73.02% 26.98%
content-encoding       |   124912 81.30% 18.70%
vary                   |   110368 22.38% 77.62%
age                    |    70211 66.89% 33.11%
set-cookie             |    66181 23.90% 76.10%
etag                   |    62416 48.63% 51.37%
x-powered-by           |    54561 46.49% 53.51%
x-content-type-options |    47171 77.61% 22.39%
x-varnish-server       |    33947 51.61% 48.39%
x-varnish              |    33376 89.04% 10.96%
x-xss-protection       |    28685 58.73% 41.27%
x-cdn                  |    27428 23.06% 76.94%
location               |    26152 19.79% 80.21%
transfer-encoding      |    25147 72.94% 27.06%
xcache                 |    24800 18.75% 81.25%
...

The distribution of backreferences is falls off extremely quickly as
distance from the newest element increases:
1       35460
2       20774
3       16886
4       12926
5       9601
6       6947
7       5100
8       4304
9       3658
10      3313
11      2789
12      2710
13      2678
14      2354
15      2372
16      2180
17      2066
18      2023
19      1979
20      1885
21      1888
22      1799
23      1724
24      1673
25      1584
26      1530
27      1506
28      1435
29      1395
30      1305
31      1355
32      1283
33      1305
34      1341
35      1339
36      1178
37      1200
38      1223
39      1169
40      1180
41      1111
42      1105
43      1074
44      1079
45      1043
46      1050
47      1030
48      972
49      963
50      942
51      935
52      916
...

I've attached a png of a graph of this (y axis=frequency, x-axis=dist from
newest element).
The knee in the graph is very nice indeed.


Take aways?

   - We need a better survey of headers from everywhere :)
   - Compression over our corpus should scale favorably with small table
   (and state) size.
   - Encoding index as dist-from-newest really works well, and LRU appears
   to be extremely effective as an expiration policy (the attached graph looks
   good).
   - We're getting substantial compression from both key and value
   backreferences/tokenization.
   - Algorithmically, there isn't a whole lot to do-- the devil is really
   in the serialization details and the tradeoffs involved in
   generating/parsing. There are obvious tweaks that compressors could do when
   space constrained (e.g. looking at the first table, above, as the likely
   benefit and making decisions based upon that), but the data which suggests
   that the LRU is so effective also suggests that this benefit is likely
   limited unless they can predict the future :)


-=R