Compression analysis of perfect atom-based compressor
Roberto Peon <grmocg@gmail.com> Thu, 04 April 2013 23:58 UTC
Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4B74021F862A for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 4 Apr 2013 16:58:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.598
X-Spam-Level:
X-Spam-Status: No, score=-10.598 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HHNani0DGRGf for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Thu, 4 Apr 2013 16:58:35 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id A921721F8629 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Thu, 4 Apr 2013 16:58:34 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1UNu16-0001uy-SW for ietf-http-wg-dist@listhub.w3.org; Thu, 04 Apr 2013 23:56:28 +0000
Resent-Date: Thu, 04 Apr 2013 23:56:28 +0000
Resent-Message-Id: <E1UNu16-0001uy-SW@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1UNu0z-0001uI-BL for ietf-http-wg@listhub.w3.org; Thu, 04 Apr 2013 23:56:21 +0000
Received: from mail-ob0-f176.google.com ([209.85.214.176]) by lisa.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1UNu0y-0008K1-4a for ietf-http-wg@w3.org; Thu, 04 Apr 2013 23:56:21 +0000
Received: by mail-ob0-f176.google.com with SMTP id er7so3164638obc.35 for <ietf-http-wg@w3.org>; Thu, 04 Apr 2013 16:55:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=G2A+P3l2hmHcUPgZjjcSPSs+v9deQ27cu6WyHD8dsbY=; b=PkoB2dEjxaX1MjJZ9Jt4D0K9ELucQwVXFcAcK7aeD7vZe6u706wK6teea+xKY+1yV7 EsBZBkURdmeLk/5AhSdlpXpsIFYFthiq/ApvOBLTt05zHjGxOD5XHtisGKqvV/R2WQPf 3G3Jw/KZ7kmqpHRtjMWegUfuUfd7QDKtF4ZNrU5QZMFIvh0JkzTZL6sQmFoeoqCSV95p kzkO8riPfURu+d4ttLkENTCjvzajHPXntdLFTnvm99tXkG17G9/JXC7JhIWyoOtcNw/X 3rOWyfprkXcXo3MgniiKQdI2HbW6yMLZnhnu/D5oAEwh73Lzw/yP6VkoM9Szi095nPSO l7mA==
MIME-Version: 1.0
X-Received: by 10.182.132.43 with SMTP id or11mr6212870obb.67.1365119753850; Thu, 04 Apr 2013 16:55:53 -0700 (PDT)
Received: by 10.76.109.72 with HTTP; Thu, 4 Apr 2013 16:55:53 -0700 (PDT)
Date: Thu, 04 Apr 2013 16:55:53 -0700
Message-ID: <CAP+FsNew+0ce6q24KRAdut7g4OjU5ysOSxd8Yk-FBmRrx0550w@mail.gmail.com>
From: Roberto Peon <grmocg@gmail.com>
To: HTTP Working Group <ietf-http-wg@w3.org>
Content-Type: multipart/mixed; boundary="14dae93a0c1701170d04d991b668"
Received-SPF: pass client-ip=209.85.214.176; envelope-from=grmocg@gmail.com; helo=mail-ob0-f176.google.com
X-W3C-Hub-Spam-Status: No, score=-3.5
X-W3C-Hub-Spam-Report: AWL=-2.681, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1UNu0y-0008K1-4a ab3308f85d13f55241f568d0fabef85b
X-Original-To: ietf-http-wg@w3.org
Subject: Compression analysis of perfect atom-based compressor
Archived-At: <http://www.w3.org/mid/CAP+FsNew+0ce6q24KRAdut7g4OjU5ysOSxd8Yk-FBmRrx0550w@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/17200
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>
Here is some data based on an analysis of a perfect (i.e. resource unconstrained) atom-based compressor. I've removed any serialization sizes from this-- this will hold true for HeaderDiff or Delta or whatever, presupposing it does atom-based compression. Feel free to jump to the bottom for take-aways if looking at the data is boring :) The top most compressible headers are (key% is the percentage of the compressible bytes where were from the key): |compressed key-name | bytes key% val% -----------------------+----------------------- user-agent | 2443472 9.37% 90.63% cookie | 1957141 2.88% 97.12% referer | 1340198 11.89% 88.11% via | 980712 4.53% 95.47% accept-charset | 700174 32.62% 67.38% accept-language | 685850 48.86% 51.14% accept-encoding | 676516 49.53% 50.47% cache-control | 571590 50.27% 49.73% date | 544049 16.31% 83.69% x-cache-lookup | 523453 37.32% 62.68% content-type | 521956 50.88% 49.12% :host | 506321 23.66% 76.34% accept | 416278 32.65% 67.35% x-cache | 414766 28.26% 71.74% last-modified | 398843 58.89% 41.11% proxy-connection | 389619 63.48% 36.52% server | 369522 32.27% 67.73% :path | 337031 35.55% 64.45% content-length | 313985 94.79% 5.21% expires | 305406 35.69% 64.31% :method | 239569 70.02% 29.98% :status | 237574 70.61% 29.39% p3p | 206589 2.93% 97.07% accept-ranges | 198471 73.02% 26.98% content-encoding | 124912 81.30% 18.70% vary | 110368 22.38% 77.62% age | 70211 66.89% 33.11% set-cookie | 66181 23.90% 76.10% etag | 62416 48.63% 51.37% x-powered-by | 54561 46.49% 53.51% x-content-type-options | 47171 77.61% 22.39% x-varnish-server | 33947 51.61% 48.39% x-varnish | 33376 89.04% 10.96% x-xss-protection | 28685 58.73% 41.27% x-cdn | 27428 23.06% 76.94% location | 26152 19.79% 80.21% transfer-encoding | 25147 72.94% 27.06% xcache | 24800 18.75% 81.25% ... The distribution of backreferences is falls off extremely quickly as distance from the newest element increases: 1 35460 2 20774 3 16886 4 12926 5 9601 6 6947 7 5100 8 4304 9 3658 10 3313 11 2789 12 2710 13 2678 14 2354 15 2372 16 2180 17 2066 18 2023 19 1979 20 1885 21 1888 22 1799 23 1724 24 1673 25 1584 26 1530 27 1506 28 1435 29 1395 30 1305 31 1355 32 1283 33 1305 34 1341 35 1339 36 1178 37 1200 38 1223 39 1169 40 1180 41 1111 42 1105 43 1074 44 1079 45 1043 46 1050 47 1030 48 972 49 963 50 942 51 935 52 916 ... I've attached a png of a graph of this (y axis=frequency, x-axis=dist from newest element). The knee in the graph is very nice indeed. Take aways? - We need a better survey of headers from everywhere :) - Compression over our corpus should scale favorably with small table (and state) size. - Encoding index as dist-from-newest really works well, and LRU appears to be extremely effective as an expiration policy (the attached graph looks good). - We're getting substantial compression from both key and value backreferences/tokenization. - Algorithmically, there isn't a whole lot to do-- the devil is really in the serialization details and the tradeoffs involved in generating/parsing. There are obvious tweaks that compressors could do when space constrained (e.g. looking at the first table, above, as the likely benefit and making decisions based upon that), but the data which suggests that the LRU is so effective also suggests that this benefit is likely limited unless they can predict the future :) -=R
- Compression analysis of perfect atom-based compre… Roberto Peon
- Re: Compression analysis of perfect atom-based co… Mark Nottingham
- Re: Compression analysis of perfect atom-based co… Roberto Peon
- RE: Compression analysis of perfect atom-based co… RUELLAN Herve
- Re: Compression analysis of perfect atom-based co… Martin Nilsson
- Re: Compression analysis of perfect atom-based co… Mark Nottingham