Re: bohe and delta experimentation...

Mark Nottingham <mnot@mnot.net> Wed, 16 January 2013 22:30 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2AD2C11E80A6 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 16 Jan 2013 14:30:30 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.572
X-Spam-Level:
X-Spam-Status: No, score=-8.572 tagged_above=-999 required=5 tests=[AWL=2.027, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LB7LeXY9kGNH for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 16 Jan 2013 14:30:29 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 1BF6E11E809C for <httpbisa-archive-bis2Juki@lists.ietf.org>; Wed, 16 Jan 2013 14:30:29 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1TvbUD-0005Ae-2r for ietf-http-wg-dist@listhub.w3.org; Wed, 16 Jan 2013 22:29:33 +0000
Resent-Date: Wed, 16 Jan 2013 22:29:33 +0000
Resent-Message-Id: <E1TvbUD-0005Ae-2r@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <mnot@mnot.net>) id 1TvbU7-00059B-Uo for ietf-http-wg@listhub.w3.org; Wed, 16 Jan 2013 22:29:27 +0000
Received: from mxout-07.mxes.net ([216.86.168.182]) by maggie.w3.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from <mnot@mnot.net>) id 1TvbU7-0008SP-1p for ietf-http-wg@w3.org; Wed, 16 Jan 2013 22:29:27 +0000
Received: from [192.168.1.80] (unknown [118.209.240.13]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 23E9222E255; Wed, 16 Jan 2013 17:29:03 -0500 (EST)
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
From: Mark Nottingham <mnot@mnot.net>
In-Reply-To: <CABP7RbeNFm3ZHdtDBUJb3idJjFj0q+fxDPzxKZBhSJqXw8zWaQ@mail.gmail.com>
Date: Thu, 17 Jan 2013 09:28:59 +1100
Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <2FD0BBE1-59C6-4E49-ACCE-60C1A895FB7D@mnot.net>
References: <CABP7RbeNFm3ZHdtDBUJb3idJjFj0q+fxDPzxKZBhSJqXw8zWaQ@mail.gmail.com>
To: James M Snell <jasnell@gmail.com>
X-Mailer: Apple Mail (2.1499)
Received-SPF: pass client-ip=216.86.168.182; envelope-from=mnot@mnot.net; helo=mxout-07.mxes.net
X-W3C-Hub-Spam-Status: No, score=-3.3
X-W3C-Hub-Spam-Report: AWL=-3.267, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1TvbU7-0008SP-1p 9395ac207f29135c695f37e8417456f4
X-Original-To: ietf-http-wg@w3.org
Subject: Re: bohe and delta experimentation...
Archived-At: <http://www.w3.org/mid/2FD0BBE1-59C6-4E49-ACCE-60C1A895FB7D@mnot.net>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/15912
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On 17/01/2013, at 9:07 AM, James M Snell <jasnell@gmail.com> wrote:

> After going a number of scenarios with bohe using a variety of stream-compression scenarios it's painfully obvious that there is really no way around the CRIME issue when using stream-compression. So with that, I'm turning my attention to the use of Roberto's delta encoding and exploring whether or not binary optimized values can make a significant difference (as opposed to simply dropping in huffman-encoded text everywhere).
> 
> I'm starting with dates first...
> 
> Right now, dates in http/1 requests are rather inefficient. The existing date-time format wastes a significant amount of space, albeit across only a relatively few headers. On the plus side, these tend to compress well, but given that the dates change frequently request-to-request, they will be short-lived in the delta context. 
> 
> Given this, I decided to run a test scenario for compressing RFC3999 dates as text vs. using a compact binary encoding. I generated a sample of 100k randomly generated RFC3999 timestamps that variably include milliseconds and timezone offsets, I then used that to generate a date-time specific symbol map and used a static huffman coding. Then, given a sample of 100k more randomly generated timestamps, the average compression was 12-13 bytes for the date value. (average length of the uncompressed timestamp is 24 bytes).. so pretty good compression using a symbol tree specifically optimized for date-times.
> 
> By comparison, I devised a simple binary coding for dates using the following format:
> 
> +-+---+---+-------------------+
> |M|TZH|TZM|   year (16-bit)   |
> +-+---+---+-----+-------------+
> | month (4-bit) | day (5-bit) |
> +---------------+-------------+
> | hour (5-bit)  | minute (6)  |
> +---------------+-------------+
> | second (6 bit)| millis (31) |
> +---------------+-------------+
> |d|tz hrs (5 bit)| tz min (6) |
> +-----------------------------+
> 
> M, TZH and TZM are single bit flags. When M is set, the value includes a 31-bit millisecond field. When TZH is set, it includes timezone offset hours, and when TZM is set, it includes timezone offset minutes. The d field (last row) is a single bit indicating positive or negative timezone offset.
> 
> The minimum possible binary encoding is 6-bytes, which includes the first three flag bits, year, month, day, hour, minute and second. The maximum possible encoding is 11-bytes which includes full timezone offset and milliseconds. Giving an average encoding of 8-bytes over any sample size of randomly generated timestamps.
> 
> While the binary encoding is certainly more efficient, I'm not yet certain if those 4-bytes are worth the effort, but it does improve the overall compression ratio for the message as a whole.

Just for comparison - in the simple encoding, I get it to 8 bytes (at least before year 2038), and it's textual (just the hex for the number of seconds since the epoch).

E.g.,

HTTP/1: Sat, 03 Nov 2012 13:05:03 GMT
Simple: 5095167f


> Either way, regardless of whether we huffman code or binary code the date values, we should require that RFC3339/ISO8601 timestamps be used for all date headers within the http/2 header encoding as those are going to compress much better than the current http/1 date format.
> 
> Entity Tags are another area where binary values may be useful. Currently, ETag values generally tend to be hex or base64 encoded binary data.

That's a big assumption!


> By simply allowing the etag to be dropped in as a set of bytes in the encoded header we can cut the transmitted size of those tags in half. The format I'm considering for these is:
> 
>    +-+------+-----------+
>    |W|len(7)| octets... |
>    +-+------+-----------+
> 
> Where W is a single bit flag indicating weak or not, len is the number of encoded octets for the entity tag. (I'm wondering, tho, whether or not we could get away with dropping the entire concept of a "weak entity tag")

We decided not to drop it in HTTPbis, so assume that it'll stay (given we need to be able to convert 1 to 2 and back).


> By optimizing dates and entity tags this way, we end up with optimized encodings for a good number of commonly used headers (date, last-modified, expires, etag, if-none-match, if-match, if-modified-since, etc), and we can eliminate the need for doing any compression on those values at all.
> 
> Another set of headers we can optimize within delta are the numeric values for Content-Length, :status, Expires, etc. Rather than encoding those as ascii strings, we would simply encode them as their numeric value.
> 
> Will be turning my attention to cookie values next. I'm considering whether or not we should produce a code-tree that is specific to cookie headers and/or allow for purely binary values.

I could imagine setting a parameter on Set-Cookie that indicates its content is encoded in a certain way, which can be replayed as binary data. However, that information would also need to be in Cookie, which I *think* necessitates a new request header -- maybe Bookie?


--
Mark Nottingham   http://www.mnot.net/