bohe and delta experimentation...

James M Snell <jasnell@gmail.com> Wed, 16 January 2013 22:10 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3299411E809A for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 16 Jan 2013 14:10:14 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.601
X-Spam-Level:
X-Spam-Status: No, score=-8.601 tagged_above=-999 required=5 tests=[AWL=1.997, BAYES_00=-2.599, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id U8agosukIVpZ for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 16 Jan 2013 14:10:13 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id D0C6621F88C8 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Wed, 16 Jan 2013 14:10:08 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1Tvb9a-0004x4-EC for ietf-http-wg-dist@listhub.w3.org; Wed, 16 Jan 2013 22:08:14 +0000
Resent-Date: Wed, 16 Jan 2013 22:08:14 +0000
Resent-Message-Id: <E1Tvb9a-0004x4-EC@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <jasnell@gmail.com>) id 1Tvb9W-0004wO-RP for ietf-http-wg@listhub.w3.org; Wed, 16 Jan 2013 22:08:10 +0000
Received: from mail-ie0-f178.google.com ([209.85.223.178]) by maggie.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <jasnell@gmail.com>) id 1Tvb9V-0007i6-R0 for ietf-http-wg@w3.org; Wed, 16 Jan 2013 22:08:10 +0000
Received: by mail-ie0-f178.google.com with SMTP id c12so3516857ieb.23 for <ietf-http-wg@w3.org>; Wed, 16 Jan 2013 14:07:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:from:date:message-id:subject:to :content-type; bh=oUfOyCdXdnqTjbLDK37ZunARtqSn3v16J3V5q8XYTtI=; b=DYj6qq5vWYc6WYydZXpL6PYngCn4v4ddFFudX7Dl++tn68GIfh/J3pq5g40gdAVBFu dVxybMLbFUHTQhrYIhdR/khtEgwNBV/s0FFJvjFTrDBhZUZZjLFnTFsdJrHVUh8BFhIU uU0SS7PoB72LMQ5/3xGW6o2DSkedXd/ounlPx9gwSXde+0b/zqu8xrOsrPsErclut3Be Y6QApovWEYXNN2NpOz1l3+0Yw/QQ5vh8Ch5uAnfWUocZsZ6vEy7NjxgvlVFEUApjU6kY q2x08qT79SYiUVPAT8zprK8IBcEQ0H3gHQs6nPNnELqL8TE3uv0YpTGkITy9cTtDa9xa urgg==
X-Received: by 10.50.158.170 with SMTP id wv10mr2022497igb.75.1358374063852; Wed, 16 Jan 2013 14:07:43 -0800 (PST)
MIME-Version: 1.0
Received: by 10.64.26.137 with HTTP; Wed, 16 Jan 2013 14:07:23 -0800 (PST)
From: James M Snell <jasnell@gmail.com>
Date: Wed, 16 Jan 2013 14:07:23 -0800
Message-ID: <CABP7RbeNFm3ZHdtDBUJb3idJjFj0q+fxDPzxKZBhSJqXw8zWaQ@mail.gmail.com>
To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: multipart/alternative; boundary="14dae9340f218c4d7b04d36f1b9b"
Received-SPF: pass client-ip=209.85.223.178; envelope-from=jasnell@gmail.com; helo=mail-ie0-f178.google.com
X-W3C-Hub-Spam-Status: No, score=-3.5
X-W3C-Hub-Spam-Report: AWL=-2.710, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1Tvb9V-0007i6-R0 dd09182270206ef48bd172d5b965d89c
X-Original-To: ietf-http-wg@w3.org
Subject: bohe and delta experimentation...
Archived-At: <http://www.w3.org/mid/CABP7RbeNFm3ZHdtDBUJb3idJjFj0q+fxDPzxKZBhSJqXw8zWaQ@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/15911
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

After going a number of scenarios with bohe using a variety of
stream-compression scenarios it's painfully obvious that there is really no
way around the CRIME issue when using stream-compression. So with that, I'm
turning my attention to the use of Roberto's delta encoding and exploring
whether or not binary optimized values can make a significant difference
(as opposed to simply dropping in huffman-encoded text everywhere).

I'm starting with dates first...

Right now, dates in http/1 requests are rather inefficient. The existing
date-time format wastes a significant amount of space, albeit across only a
relatively few headers. On the plus side, these tend to compress well, but
given that the dates change frequently request-to-request, they will be
short-lived in the delta context.

Given this, I decided to run a test scenario for compressing RFC3999 dates
as text vs. using a compact binary encoding. I generated a sample of 100k
randomly generated RFC3999 timestamps that variably include milliseconds
and timezone offsets, I then used that to generate a date-time specific
symbol map and used a static huffman coding. Then, given a sample of 100k
more randomly generated timestamps, the average compression was 12-13 bytes
for the date value. (average length of the uncompressed timestamp is 24
bytes).. so pretty good compression using a symbol tree specifically
optimized for date-times.

By comparison, I devised a simple binary coding for dates using the
following format:

+-+---+---+-------------------+
|M|TZH|TZM|   year (16-bit)   |
+-+---+---+-----+-------------+
| month (4-bit) | day (5-bit) |
+---------------+-------------+
| hour (5-bit)  | minute (6)  |
+---------------+-------------+
| second (6 bit)| millis (31) |
+---------------+-------------+
|d|tz hrs (5 bit)| tz min (6) |
+-----------------------------+

M, TZH and TZM are single bit flags. When M is set, the value includes a
31-bit millisecond field. When TZH is set, it includes timezone offset
hours, and when TZM is set, it includes timezone offset minutes. The d
field (last row) is a single bit indicating positive or negative timezone
offset.

The minimum possible binary encoding is 6-bytes, which includes the first
three flag bits, year, month, day, hour, minute and second. The maximum
possible encoding is 11-bytes which includes full timezone offset and
milliseconds. Giving an average encoding of 8-bytes over any sample size of
randomly generated timestamps.

While the binary encoding is certainly more efficient, I'm not yet certain
if those 4-bytes are worth the effort, but it does improve the overall
compression ratio for the message as a whole.

Either way, regardless of whether we huffman code or binary code the date
values, we should require that RFC3339/ISO8601 timestamps be used for all
date headers within the http/2 header encoding as those are going to
compress much better than the current http/1 date format.

Entity Tags are another area where binary values may be useful. Currently,
ETag values generally tend to be hex or base64 encoded binary data. By
simply allowing the etag to be dropped in as a set of bytes in the encoded
header we can cut the transmitted size of those tags in half. The format
I'm considering for these is:

   +-+------+-----------+
   |W|len(7)| octets... |
   +-+------+-----------+

Where W is a single bit flag indicating weak or not, len is the number of
encoded octets for the entity tag. (I'm wondering, tho, whether or not we
could get away with dropping the entire concept of a "weak entity tag")

By optimizing dates and entity tags this way, we end up with optimized
encodings for a good number of commonly used headers (date, last-modified,
expires, etag, if-none-match, if-match, if-modified-since, etc), and we can
eliminate the need for doing any compression on those values at all.

Another set of headers we can optimize within delta are the numeric values
for Content-Length, :status, Expires, etc. Rather than encoding those as
ascii strings, we would simply encode them as their numeric value.

Will be turning my attention to cookie values next. I'm considering whether
or not we should produce a code-tree that is specific to cookie headers
and/or allow for purely binary values.

- James