Re: Header Serialization Discussion

James M Snell <jasnell@gmail.com> Wed, 17 April 2013 02:03 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F069B21F8A74 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Tue, 16 Apr 2013 19:03:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.413
X-Spam-Level:
X-Spam-Status: No, score=-9.413 tagged_above=-999 required=5 tests=[AWL=0.386, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8, SARE_BAYES_5x8=0.8]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rYsyQP5K2TxG for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Tue, 16 Apr 2013 19:03:11 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 0F52321F8A1B for <httpbisa-archive-bis2Juki@lists.ietf.org>; Tue, 16 Apr 2013 19:03:10 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1USHhW-0003PK-A2 for ietf-http-wg-dist@listhub.w3.org; Wed, 17 Apr 2013 02:02:22 +0000
Resent-Date: Wed, 17 Apr 2013 02:02:22 +0000
Resent-Message-Id: <E1USHhW-0003PK-A2@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <jasnell@gmail.com>) id 1USHhS-0003Oe-Ae for ietf-http-wg@listhub.w3.org; Wed, 17 Apr 2013 02:02:18 +0000
Received: from mail-ob0-f176.google.com ([209.85.214.176]) by maggie.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <jasnell@gmail.com>) id 1USHhQ-0001L3-Dk for ietf-http-wg@w3.org; Wed, 17 Apr 2013 02:02:18 +0000
Received: by mail-ob0-f176.google.com with SMTP id wd20so990167obb.7 for <ietf-http-wg@w3.org>; Tue, 16 Apr 2013 19:01:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=66j5hY8uUntHqrEPE9XlgtYZ9b+IlPm4tps5bBy1pII=; b=F/GQkdjyni+w4PBA7bAoxzYkWPP5Rj7shk07N9lB2Bat1CX5h3aVHeQmtt0zrutLss 0F8rCKTHSk/3ghI7JAivYFJ2X+35/nLWmeryP5G52YV5umAkqdvkz22U2slpAo5VYaKz 40p9EjqHQEulkecUVkHb93VEY8OmB3NPS9OfvdiLCTrTQR1ArI/k3ZoLTIFalu2MoEUm csNYtPcsp+Q07hPIycS2sTeX0eEI6YmQQwp8gsumF2FzmJSRfmwYvzPSmmmMc8fQnJhe 3EMRsElfVpJuYYwLSsSrwr8uq0WW55weiwAJMHAKctahiyt4E5/pquB9NCh2rvC2Ug4s 2ZXw==
X-Received: by 10.182.166.10 with SMTP id zc10mr1734126obb.80.1366164110376; Tue, 16 Apr 2013 19:01:50 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.60.3.137 with HTTP; Tue, 16 Apr 2013 19:01:30 -0700 (PDT)
In-Reply-To: <CABP7RbfUH=U0hjcmEXKO1jJzy7pPffqFDE4TmAs-ahBX04qwJw@mail.gmail.com>
References: <CABP7RbfUH=U0hjcmEXKO1jJzy7pPffqFDE4TmAs-ahBX04qwJw@mail.gmail.com>
From: James M Snell <jasnell@gmail.com>
Date: Tue, 16 Apr 2013 19:01:30 -0700
Message-ID: <CABP7RbeEkOCkWAHf_WYVi0wNKuFpapL-N-2TT9FL_ZhJzQG85w@mail.gmail.com>
To: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: text/plain; charset="UTF-8"
Received-SPF: pass client-ip=209.85.214.176; envelope-from=jasnell@gmail.com; helo=mail-ob0-f176.google.com
X-W3C-Hub-Spam-Status: No, score=-3.4
X-W3C-Hub-Spam-Report: AWL=-2.642, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1USHhQ-0001L3-Dk 2ed4cb909973f6524ff7108f94decb9d
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Header Serialization Discussion
Archived-At: <http://www.w3.org/mid/CABP7RbeEkOCkWAHf_WYVi0wNKuFpapL-N-2TT9FL_ZhJzQG85w@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/17276
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Another update... I just checked in preliminary support for UTF-8
support in the static huffman code. The basics are...

Building off of Roberto's dictionary, I added additional symbols to
the static huffman code tree, each representing a possible UTF-8
encoded length. Right now I have symbols for 2, 3, 4, 5 and 6 byte
lengths (yes, I know we don't need all those). For instance, in the
response table I have:

      (257) |110101 [6]
      (258) |110110 [6]
      (259) |01111 [5]
      (260) |10000 [5]
      (261) |10001 [5]

Where 257 is for 2 byte characters, 258 is for 3 byte, etc (I'm going
to be tweaking the tables so that this is better later).

As I iterate through each codepoint in the source text, I check to see
if the codepoint is < 256, if it is, I encode it normally. If the
codepoint > 256, I determine the appropriate escape code from the
table, then write the exact number of bits for the codepoint. So, for
instance, the supplementary character U+10400 is encoded out as [78 41
00 00], consuming a total of 26 bits.

This is purely experimental at this point, but the code is functional
and checked in to github. There are obviously a number of improvements
that could be made and I'm definitely open to more suggestions! One
problem with this, of course, is that it's not very efficient if the
field value consists primarily of extended characters -- that is, this
approach is optimized specifically for the case where we have the
random odd UTF-8 character thrown in to a mostly ASCII string.

One important side effect of this is that we ought to be able to "Just
Use UTF-8" across the board now, without requiring the ISO-8859-1
escape hatch.

- James

On Sat, Apr 13, 2013 at 3:01 PM, James M Snell <jasnell@gmail.com> wrote:
> Ok... so I've implemented HeaderDiff serialization [1] in java [2]...
> just the serialization so far, will do deserialization early next
> week. The implementation is very rough and definitely needs
> improvement / optimization but it's functional enough to start from.
>
> [1] http://tools.ietf.org/html/draft-ruellan-headerdiff-00
> [2] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/headerdiff
>
> I know that Roberto is working on refinements to his own concept of
> the delta header serialization and hopefully he'll be sharing those
> thoughts soon, but I wanted to get my current thoughts captured for
> discussion.
>
> The HeaderDiff approach is straightforward and effective and makes
> efficient use of available bits. It's variable length integer syntax
> is a bit complicated with the notion of bit-prefix lengths but once
> you've got it it's easy to understand what's going on. The
> implementation is a bit less complicated than delta, which is good and
> I like that there's no "Header Group" notion. Headers are either on or
> off per request. There are several concerns, however.
>
>   - The true utility of the common prefix length mechanism is
> questionable. Aside from the potential security risks, I questioning
> just how effective it's going to be in practice. (What header fields
> do we expect to actually use it in practice?)
>
>   - The fact that items are never removed from the Name Table is
> concerning and poses a potential security risk. The the decompressor
> is forced to maintain a symbol table for every header name it
> encounters, a malicious compressor could cause overruns by sending a
> high number of junk header names. The compressor ought to be able to
> treat the Name Table as an LRU, and ought to be able to place strict
> limits on it's size, just as it does with the Header Table. Delta does
> not have this issue because it's name indices are already tied to the
> LRU.
>
> With HeaderDiff in mind, I'm thinking about how we can bring
> HeaderDiff and Delta together and roll in the additional concepts of
> Typed Codecs. Here's what I have in mind.
>
> Syntax:
>
> The serialization syntax borrows concepts from both HeaderDiff and
> Delta and is divided into two parts: Basic Header Serialization and
> Value Serialization.
>
> As with Delta currently, there are four operation codes (opcodes),
> represented by two-bit identifiers. Each op code has a single-bit
> "ephemeral flag". Operations are serialized into batches of no more
> than 32 instances.
>
>   00 0 00000 { HEADER }
>
> The first two most significant bits...
>
>   00 = Index (what Delta calls "Toggl")
>   01 = Index Range (what Delta calls "Trang")
>   10 = Clone
>   11 = Literal (what Delta calls "KVSto")
>
> The third most significant bit is the ephemeral toggle
>
>   0 = Persist (the header value is to be persisted in the LRU table)
>   1 = Ephemeral (the header value is not to be persisted in the LRU table)
>
> The remaining five bits specify the number of distinct instances in
> this batch. This counter is 0-based, so 0x0 = 1 item in the batch,
> 0x1F = 32 items in the batch.
>
> The ephemeral flag is ignored for opcodes 00 and 01.
>
> Indexed headers are identified by a single byte index value.
>
>   0 0000000
>
> The first most significant bit indicates whether or not the index is
> part of the static header dictionary or lru cache.
>
>   0 = LRU
>   1 = Static
>
> This gives us a strict maximum of 128 storage slots for the LRU.
> That's limited *on purpose* to place strictly finite constraints on
> the amount of state any single compressor can require. This also
> ensures that we are always able to encode indices using only a single
> byte.
>
> As for state management, we would essentially use the Header Diff
> model minus the separate Name Table. That is, there is no persistent
> Header Group in use; just the Header Table. State management depends
> on the specific opcodes used. State management occurs as the headers
> are encountered in the serialized header block.
>
>   00 = Use an indexed header instance from the header table for this request
>   01 = Use multiple sequentially indexed header instances from the
> header table for this request
>   10 = Use a new value for an already existing header name, index
> references an existing header from the table, persist the new value if
> the ephemeral bit is not set.
>   11 = Use a new key and value for this request. Persist the new pair
> if the ephemeral bit is not set.
>
> So, for example, the serialization:
>
>   00000000 10000000 =
>     Use index #0 from the static dictionary
>
>   00000001 10000000 00000001 =
>     Use index #0 from the static dictionary and index #1 from the lru cache
>
>   01000001 10000000 10000001 00000001 00000010 =
>     Use indices #0-#1 from the static dictionary and indices #1-#2
> from the lru cache
>
>   10000000 10000000 {VALUE}
>     Provide a new value for the header name identified by index #0 in
> the static dictionary and persist the value in the lru cache
>
>   10100000 10000000 {VALUE}
>     Provide a new value for the header name identified by index #0 in
> the static dictionary but do not persist the value in the lru cache
>
> Non-ephemeral items are pushed into the next available slot in the
> LRU. Once all available slots have been used, the indices rotate. The
> LRU can also be limited by storage space. As storage space is
> required, the least recently items are dropped. Keeping in mind that
> there are only 128 available storage slots, it will be important for
> compressors to be selective about which fields are persisted.
>
> (note.. the LRU is really a "least recently written" cache.. not
> "least recently used"...)
>
> These values are tracked only for the current message. There is no
> Header Group as currently defined by Delta so there is no need to
> track whether or not any given header is currently on or off. It's
> either referenced for this request or it's not.
>
> If there are more than 32 instances of any single opcode per
> serialized header block (which is unlikely), a single opcode may
> appear multiple times within the serialized block.
>
> As far as Value serialization is concerned... As I have discussed
> previously, I have been looking at support for four distinct value
> types, each represented by their own two-bit identifier. As I will
> explain further in a second, these types are represented using the two
> most significant bits in a byte:
>
>   00 = Text
>   01 = Number
>   10 = Date Time
>   11 = Binary
>
> Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag
> following the type code. All text strings are prefixed by it's length
> given as an unsigned variant length integer
> Numbers are encoded as unsigned variable length integers.
> Date Times are encoded as Numbers representing the number of seconds
> that have passed since the epoch (GMT).
> Binary data is encoded as raw binary octets prefixed by an unsigned
> variable length integer specifying the number of bytes.
>
> Any single header value instance can consist of up to 32 distinct
> value tokens. The number of tokens is specified using the 5 least
> significant bytes of type type flag.
>
> For instance:
>
>   00000000 00000001 01100001 =
>     A single ISO-8859-1 Text value with length = 1
>
>   00100000 00000001 01100001  =
>     A single UTF-8 Text value with length = 1
>
>   00100001 00000001 01100001 00000001 01100010 =
>     Two UTF-8 Text values, each with length = 1
>
>   01000000 00000001 =
>     A single Number value
>
>   10000000 00000001 =
>     A single Date value (1 second after the epoch)
>
>   11000000 00000001 00000001 =
>     A single Raw Binary value
>
> For ISO-8859-1 Text, the Static Huffman Code used by Delta would be
> used for the value. If we can develop an approach to effectively
> handling Huffman coding for arbitrary UTF-8, then we can apply Huffman
> coding to that as well.
>
> For the Number and DateTime serialization:
>
>   - 16-bit numbers serialize with a strict maximum of 3 bytes
>   - 32-bit numbers serialize with a strict maximum of 5 bytes.
>   - 64-bit numbers serialize with a strict maximum of 10 bytes.
>   - Date Times will serialize with five bytes for the reasonably
> relevant future, then six bytes for quite some time after that. Dates
> prior to the epoch cannot be represented.
>
> In order to properly deal with the backwards compatibility concerns
> for HTTP/1, there are several important rules for use of Typed Codecs
> in HTTP headers:
>
> 1. Headers must be explicitly defined to use the new header types. All
> existing HTTP/1 headers, then, will continue to be required to be
> represented as ISO-8859-1 Text unless their standard definitions are
> updated. The HTTP/2 specification would update the definition of
> specific known headers (e.g. content-length, date, if-modified-since,
> etc).
>
> 2. Extension headers that use the typed codecs will have specific
> normative transformations to ISO-8859-1 defined.
>     a. UTF-8 Text will be converted to ISO-8859-1 with extended
> characters pct-encoded
>     b. Numbers will be converted to their ASCII equivalent values.
>     c. Date Times will be converted to their HTTP-Date equivalent values.
>     d. Binary fields will be Base64-encoded.
>
> 3. There will be no normative transformation from ISO-8859-1 values
> into the typed codecs. Implementations are free to apply
> transformation where those impls determine it is appropriate, but it
> will be perfectly legal for an implementation to pass a text value
> through even if it is known that a given header type has a typed codec
> equivalent (for instance, Content-Length may come through as a number
> or a text value, either will be valid). This means that when
> translating from HTTP/1 -> HTTP/2, receiving implementations need to
> be prepared to handle either value form.
>
> This approach combines what I feel are the best concepts from
> HeaderDiff and Delta and provides simple, straightforward header
> serialization and state management. Obviously, lots of testing is
> required, however.
>
> As always, thoughts/opinions/gripes are appreciated.
>
> - James