Re: Header Serialization Discussion

James M Snell <jasnell@gmail.com> Tue, 16 April 2013 21:02 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6E13A21F978F for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Tue, 16 Apr 2013 14:02:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.349
X-Spam-Level:
X-Spam-Status: No, score=-9.349 tagged_above=-999 required=5 tests=[AWL=0.450, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8, SARE_BAYES_5x8=0.8]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zPZwd0Lm1pcg for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Tue, 16 Apr 2013 14:02:15 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 0D9B721F977D for <httpbisa-archive-bis2Juki@lists.ietf.org>; Tue, 16 Apr 2013 14:02:14 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1USCzy-0006bg-Nn for ietf-http-wg-dist@listhub.w3.org; Tue, 16 Apr 2013 21:01:06 +0000
Resent-Date: Tue, 16 Apr 2013 21:01:06 +0000
Resent-Message-Id: <E1USCzy-0006bg-Nn@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <jasnell@gmail.com>) id 1USCzu-0006ai-KV for ietf-http-wg@listhub.w3.org; Tue, 16 Apr 2013 21:01:02 +0000
Received: from mail-ve0-f173.google.com ([209.85.128.173]) by maggie.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <jasnell@gmail.com>) id 1USCzs-0000D6-Sq for ietf-http-wg@w3.org; Tue, 16 Apr 2013 21:01:02 +0000
Received: by mail-ve0-f173.google.com with SMTP id ox1so855140veb.18 for <ietf-http-wg@w3.org>; Tue, 16 Apr 2013 14:00:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=MZHlsAzIKGTLyn3Dgi8M5EP4WEQ66LJeRhkGWfkCurg=; b=FckYSz+I+az368HAHrjcZSU998R+L9L+A/DnqHWWtV7YYGNHChZrtL6LpwBRsamFyx mgB2Whrn1Vckcv+826/5NdPHNTwLlkSKAV8P5sQ/0rlzAwTHPL4O+BDgYK+kkM8M6iIi bQ7XUHbpUMhp0dCMjg4GsWPVPParcQZNqD98iDm2vuuN502v35pf6cUZsuHQQ9fniqBB lTkzAWU6xUJYky6McvdA2eMci879RueF/6kBm9XXdNtesYgWR05bZKa6QUoIJ8rAYBQQ PbbwAtdM6KR/kkDCh8eRTB8Cs1j+8G/0UiRvlXRLtFBPxVkSIXUqBYEikkMxG9YM6jPd Jcsg==
X-Received: by 10.52.97.131 with SMTP id ea3mr2394729vdb.71.1366146035218; Tue, 16 Apr 2013 14:00:35 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.58.143.11 with HTTP; Tue, 16 Apr 2013 14:00:15 -0700 (PDT)
In-Reply-To: <CABP7RbfUH=U0hjcmEXKO1jJzy7pPffqFDE4TmAs-ahBX04qwJw@mail.gmail.com>
References: <CABP7RbfUH=U0hjcmEXKO1jJzy7pPffqFDE4TmAs-ahBX04qwJw@mail.gmail.com>
From: James M Snell <jasnell@gmail.com>
Date: Tue, 16 Apr 2013 14:00:15 -0700
Message-ID: <CABP7RbdMwbORi+-LkLKUM6zAWDkkdKngZWrUj72=OFQK9sxuGg@mail.gmail.com>
To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: text/plain; charset="UTF-8"
Received-SPF: pass client-ip=209.85.128.173; envelope-from=jasnell@gmail.com; helo=mail-ve0-f173.google.com
X-W3C-Hub-Spam-Status: No, score=-3.5
X-W3C-Hub-Spam-Report: AWL=-2.660, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1USCzs-0000D6-Sq bd32b1823bbed1cd7ef3764dfb8e55fa
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Header Serialization Discussion
Archived-At: <http://www.w3.org/mid/CABP7RbdMwbORi+-LkLKUM6zAWDkkdKngZWrUj72=OFQK9sxuGg@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/17271
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Ok... I have a working implementation of the revised compression and
typed codecs I proposed. It is checked into github here [1]

[1] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/dhe

After pulling, you can test it with the compression_test suite using
the included dhe_compression_test.sh script

./compare_compressors.py -c
fork="{path_to_repo}/dhe_compression_test.sh"
../http_samples/mnot/amazon.com.har

Here are some sample numbers using the amazon set...

* TOTAL: 366 req messages
               size  time | ratio min   max   std
     http1  200,876  0.00 | 1.00  1.00  1.00  0.00
       dhe   51,427  0.02 | 0.26  0.00  0.71  0.16
    delta2   35,850  0.40 | 0.18  0.03  0.67  0.12
headerdiff   38,673  0.23 | 0.19  0.03  0.88  0.16

* TOTAL: 366 res messages
               size  time | ratio min   max   std
     http1  160,435  0.03 | 1.00  1.00  1.00  0.00
       dhe   51,487  0.05 | 0.32  0.01  0.80  0.16
    delta2   42,764  0.42 | 0.27  0.02  0.65  0.10
headerdiff   46,321  0.62 | 0.29  0.04  0.84  0.13

Using the yahoo test set, the numbers are...

* TOTAL: 142 req messages
               size  time | ratio min   max   std
     http1  107,164  0.00 | 1.00  1.00  1.00  0.00
       dhe   45,208  0.01 | 0.42  0.00  1.14  0.25
    delta2   46,529  0.27 | 0.43  0.05  0.79  0.22
headerdiff   52,739  0.05 | 0.49  0.04  0.94  0.26

* TOTAL: 142 res messages
               size  time | ratio min   max   std
     http1   59,718  0.00 | 1.00  1.00  1.00  0.00
       dhe   14,180  0.05 | 0.24  0.00  0.69  0.17
    delta2   16,148  0.17 | 0.27  0.02  0.69  0.17
headerdiff   18,599  0.02 | 0.31  0.04  0.85  0.23

- James

On Sat, Apr 13, 2013 at 3:01 PM, James M Snell <jasnell@gmail.com> wrote:
> Ok... so I've implemented HeaderDiff serialization [1] in java [2]...
> just the serialization so far, will do deserialization early next
> week. The implementation is very rough and definitely needs
> improvement / optimization but it's functional enough to start from.
>
> [1] http://tools.ietf.org/html/draft-ruellan-headerdiff-00
> [2] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/headerdiff
>
> I know that Roberto is working on refinements to his own concept of
> the delta header serialization and hopefully he'll be sharing those
> thoughts soon, but I wanted to get my current thoughts captured for
> discussion.
>
> The HeaderDiff approach is straightforward and effective and makes
> efficient use of available bits. It's variable length integer syntax
> is a bit complicated with the notion of bit-prefix lengths but once
> you've got it it's easy to understand what's going on. The
> implementation is a bit less complicated than delta, which is good and
> I like that there's no "Header Group" notion. Headers are either on or
> off per request. There are several concerns, however.
>
>   - The true utility of the common prefix length mechanism is
> questionable. Aside from the potential security risks, I questioning
> just how effective it's going to be in practice. (What header fields
> do we expect to actually use it in practice?)
>
>   - The fact that items are never removed from the Name Table is
> concerning and poses a potential security risk. The the decompressor
> is forced to maintain a symbol table for every header name it
> encounters, a malicious compressor could cause overruns by sending a
> high number of junk header names. The compressor ought to be able to
> treat the Name Table as an LRU, and ought to be able to place strict
> limits on it's size, just as it does with the Header Table. Delta does
> not have this issue because it's name indices are already tied to the
> LRU.
>
> With HeaderDiff in mind, I'm thinking about how we can bring
> HeaderDiff and Delta together and roll in the additional concepts of
> Typed Codecs. Here's what I have in mind.
>
> Syntax:
>
> The serialization syntax borrows concepts from both HeaderDiff and
> Delta and is divided into two parts: Basic Header Serialization and
> Value Serialization.
>
> As with Delta currently, there are four operation codes (opcodes),
> represented by two-bit identifiers. Each op code has a single-bit
> "ephemeral flag". Operations are serialized into batches of no more
> than 32 instances.
>
>   00 0 00000 { HEADER }
>
> The first two most significant bits...
>
>   00 = Index (what Delta calls "Toggl")
>   01 = Index Range (what Delta calls "Trang")
>   10 = Clone
>   11 = Literal (what Delta calls "KVSto")
>
> The third most significant bit is the ephemeral toggle
>
>   0 = Persist (the header value is to be persisted in the LRU table)
>   1 = Ephemeral (the header value is not to be persisted in the LRU table)
>
> The remaining five bits specify the number of distinct instances in
> this batch. This counter is 0-based, so 0x0 = 1 item in the batch,
> 0x1F = 32 items in the batch.
>
> The ephemeral flag is ignored for opcodes 00 and 01.
>
> Indexed headers are identified by a single byte index value.
>
>   0 0000000
>
> The first most significant bit indicates whether or not the index is
> part of the static header dictionary or lru cache.
>
>   0 = LRU
>   1 = Static
>
> This gives us a strict maximum of 128 storage slots for the LRU.
> That's limited *on purpose* to place strictly finite constraints on
> the amount of state any single compressor can require. This also
> ensures that we are always able to encode indices using only a single
> byte.
>
> As for state management, we would essentially use the Header Diff
> model minus the separate Name Table. That is, there is no persistent
> Header Group in use; just the Header Table. State management depends
> on the specific opcodes used. State management occurs as the headers
> are encountered in the serialized header block.
>
>   00 = Use an indexed header instance from the header table for this request
>   01 = Use multiple sequentially indexed header instances from the
> header table for this request
>   10 = Use a new value for an already existing header name, index
> references an existing header from the table, persist the new value if
> the ephemeral bit is not set.
>   11 = Use a new key and value for this request. Persist the new pair
> if the ephemeral bit is not set.
>
> So, for example, the serialization:
>
>   00000000 10000000 =
>     Use index #0 from the static dictionary
>
>   00000001 10000000 00000001 =
>     Use index #0 from the static dictionary and index #1 from the lru cache
>
>   01000001 10000000 10000001 00000001 00000010 =
>     Use indices #0-#1 from the static dictionary and indices #1-#2
> from the lru cache
>
>   10000000 10000000 {VALUE}
>     Provide a new value for the header name identified by index #0 in
> the static dictionary and persist the value in the lru cache
>
>   10100000 10000000 {VALUE}
>     Provide a new value for the header name identified by index #0 in
> the static dictionary but do not persist the value in the lru cache
>
> Non-ephemeral items are pushed into the next available slot in the
> LRU. Once all available slots have been used, the indices rotate. The
> LRU can also be limited by storage space. As storage space is
> required, the least recently items are dropped. Keeping in mind that
> there are only 128 available storage slots, it will be important for
> compressors to be selective about which fields are persisted.
>
> (note.. the LRU is really a "least recently written" cache.. not
> "least recently used"...)
>
> These values are tracked only for the current message. There is no
> Header Group as currently defined by Delta so there is no need to
> track whether or not any given header is currently on or off. It's
> either referenced for this request or it's not.
>
> If there are more than 32 instances of any single opcode per
> serialized header block (which is unlikely), a single opcode may
> appear multiple times within the serialized block.
>
> As far as Value serialization is concerned... As I have discussed
> previously, I have been looking at support for four distinct value
> types, each represented by their own two-bit identifier. As I will
> explain further in a second, these types are represented using the two
> most significant bits in a byte:
>
>   00 = Text
>   01 = Number
>   10 = Date Time
>   11 = Binary
>
> Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag
> following the type code. All text strings are prefixed by it's length
> given as an unsigned variant length integer
> Numbers are encoded as unsigned variable length integers.
> Date Times are encoded as Numbers representing the number of seconds
> that have passed since the epoch (GMT).
> Binary data is encoded as raw binary octets prefixed by an unsigned
> variable length integer specifying the number of bytes.
>
> Any single header value instance can consist of up to 32 distinct
> value tokens. The number of tokens is specified using the 5 least
> significant bytes of type type flag.
>
> For instance:
>
>   00000000 00000001 01100001 =
>     A single ISO-8859-1 Text value with length = 1
>
>   00100000 00000001 01100001  =
>     A single UTF-8 Text value with length = 1
>
>   00100001 00000001 01100001 00000001 01100010 =
>     Two UTF-8 Text values, each with length = 1
>
>   01000000 00000001 =
>     A single Number value
>
>   10000000 00000001 =
>     A single Date value (1 second after the epoch)
>
>   11000000 00000001 00000001 =
>     A single Raw Binary value
>
> For ISO-8859-1 Text, the Static Huffman Code used by Delta would be
> used for the value. If we can develop an approach to effectively
> handling Huffman coding for arbitrary UTF-8, then we can apply Huffman
> coding to that as well.
>
> For the Number and DateTime serialization:
>
>   - 16-bit numbers serialize with a strict maximum of 3 bytes
>   - 32-bit numbers serialize with a strict maximum of 5 bytes.
>   - 64-bit numbers serialize with a strict maximum of 10 bytes.
>   - Date Times will serialize with five bytes for the reasonably
> relevant future, then six bytes for quite some time after that. Dates
> prior to the epoch cannot be represented.
>
> In order to properly deal with the backwards compatibility concerns
> for HTTP/1, there are several important rules for use of Typed Codecs
> in HTTP headers:
>
> 1. Headers must be explicitly defined to use the new header types. All
> existing HTTP/1 headers, then, will continue to be required to be
> represented as ISO-8859-1 Text unless their standard definitions are
> updated. The HTTP/2 specification would update the definition of
> specific known headers (e.g. content-length, date, if-modified-since,
> etc).
>
> 2. Extension headers that use the typed codecs will have specific
> normative transformations to ISO-8859-1 defined.
>     a. UTF-8 Text will be converted to ISO-8859-1 with extended
> characters pct-encoded
>     b. Numbers will be converted to their ASCII equivalent values.
>     c. Date Times will be converted to their HTTP-Date equivalent values.
>     d. Binary fields will be Base64-encoded.
>
> 3. There will be no normative transformation from ISO-8859-1 values
> into the typed codecs. Implementations are free to apply
> transformation where those impls determine it is appropriate, but it
> will be perfectly legal for an implementation to pass a text value
> through even if it is known that a given header type has a typed codec
> equivalent (for instance, Content-Length may come through as a number
> or a text value, either will be valid). This means that when
> translating from HTTP/1 -> HTTP/2, receiving implementations need to
> be prepared to handle either value form.
>
> This approach combines what I feel are the best concepts from
> HeaderDiff and Delta and provides simple, straightforward header
> serialization and state management. Obviously, lots of testing is
> required, however.
>
> As always, thoughts/opinions/gripes are appreciated.
>
> - James