Re: Delta Compression and UTF-8 Header Values

Roberto Peon <grmocg@gmail.com> Sun, 10 February 2013 07:04 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C6D7621F861B for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 23:04:43 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.456
X-Spam-Level:
X-Spam-Status: No, score=-10.456 tagged_above=-999 required=5 tests=[AWL=-0.010, BAYES_00=-2.599, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PXWa91h+UjZp for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 23:04:42 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 930A021F8617 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 9 Feb 2013 23:04:42 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4QwX-0000d8-0X for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 07:03:17 +0000
Resent-Date: Sun, 10 Feb 2013 07:03:17 +0000
Resent-Message-Id: <E1U4QwX-0000d8-0X@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1U4QwP-0000bi-Rq for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 07:03:09 +0000
Received: from mail-oa0-f51.google.com ([209.85.219.51]) by lisa.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1U4QwP-0004V6-03 for ietf-http-wg@w3.org; Sun, 10 Feb 2013 07:03:09 +0000
Received: by mail-oa0-f51.google.com with SMTP id h2so5340068oag.38 for <ietf-http-wg@w3.org>; Sat, 09 Feb 2013 23:02:42 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=tRWjUD2n2vh0Lt/WJIJsFnSjJPjsbROLe0DYpj61FsI=; b=cOBNNYbu2woqdUtbkcLVLdwshTWFn4jGDSeTlxVyqA+EtcVmbzgTcJHu7rZ8MviE1B jACBo4NOxQAAoD6B5+rLgZqcWldg8wFHz0g0nri3EbyDzyHm+KXBJnDDric3jR2mbNeb edWok2vT5epqrpQhabOepDZkBb8TJ5+awzPwLG7twP4ofr/4RaQXyZ8hfRUu0Y0QJN5O R757OH4NOOhZXs3V4UhnLRh6uHxwpJXVGUW8p8lzMVah3bW0/Zl62w93N0lGA+ZCLUwB bflvS/5qkr5o0bkX0dYejMTkAEUODhfXp0xK69RTUXpNbrgiHdiVOXpbZfiiR3V6RZ1X XmWg==
MIME-Version: 1.0
X-Received: by 10.182.188.101 with SMTP id fz5mr8014296obc.0.1360479762653; Sat, 09 Feb 2013 23:02:42 -0800 (PST)
Received: by 10.76.167.193 with HTTP; Sat, 9 Feb 2013 23:02:42 -0800 (PST)
In-Reply-To: <CABP7RbdBi5-FBnhumaOJ48ioKFSpoTYhaW6V3NdJHxGGJcfkUg@mail.gmail.com>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <511722AB.1040403@it.aoyama.ac.jp> <CAP+FsNc0Owd9eRR91fNEiECb7Kxe=UhZMoKS-DphZ94xYB6KXw@mail.gmail.com> <CABP7RbdBi5-FBnhumaOJ48ioKFSpoTYhaW6V3NdJHxGGJcfkUg@mail.gmail.com>
Date: Sat, 09 Feb 2013 23:02:42 -0800
Message-ID: <CAP+FsNd4WtWw2U_UbCZ0Lf-iMm_ocGP5LRy641CcYPqVb=_-8w@mail.gmail.com>
From: Roberto Peon <grmocg@gmail.com>
To: James M Snell <jasnell@gmail.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Content-Type: multipart/alternative; boundary="f46d04462f8afa239804d5596028"
Received-SPF: pass client-ip=209.85.219.51; envelope-from=grmocg@gmail.com; helo=mail-oa0-f51.google.com
X-W3C-Hub-Spam-Status: No, score=-3.5
X-W3C-Hub-Spam-Report: AWL=-2.695, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1U4QwP-0004V6-03 c9fb58d62520d1de6df91f01c11b7047
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/CAP+FsNd4WtWw2U_UbCZ0Lf-iMm_ocGP5LRy641CcYPqVb=_-8w@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16499
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

You'll want to check that the table is full for entries above 127, and that
the EOF terminal is 256 as opposed to 128. I've gone back and forth so many
times on that part... :)
-=R


On Sat, Feb 9, 2013 at 10:51 PM, James M Snell <jasnell@gmail.com> wrote:

> When I initially tested it locally with a range of extended characters it
> blew up on me. I will put together another set of tests and track down the
> problem.
> On Feb 9, 2013 9:18 PM, "Roberto Peon" <grmocg@gmail.com> wrote:
>
>> The code checked in for the delta encoder either already encodes binary
>> values, or can be set to do so easily. The c++ version did-- the "unused"
>> symbols ended up with 32-bit outputs.
>> -=R
>>
>>
>> On Sat, Feb 9, 2013 at 8:31 PM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp
>> > wrote:
>>
>>> Hello James, others,
>>>
>>>
>>> On 2013/02/09 4:28, James M Snell wrote:
>>>
>>>> One key challenge with allowing UTF-8 values, however, is that it
>>>> conflicts with the use of the static huffman encoding in the proposed
>>>> Delta Encoding for header compression. If we allow for non-ascii
>>>> characters, the static huffman coding simply becomes too inefficient
>>>> and unmanageable to be useful. There are a few ways around it but none
>>>> of the strategies are all that attractive.
>>>>
>>>
>>> [If somebody has pointers to actual code, that would be appreciated. I
>>> can't work on it for the next two weeks, but after that, I should be able
>>> to use a day or two to see what's possible.]
>>>
>>> For a static Huffman encoding, you have to decide what symbols you
>>> accept as input, give every symbol a probability (these have to add up to
>>> 1) and then you get the 'optimal' "comma-free" encoding using the algorithm
>>> devised by Huffman. Optimal is under the assumptions that the probabilities
>>> are correct (and independent) and that you have to use an integral number
>>> of bits per symbol. Arithmetic coding gets rid of the second restriction,
>>> to get rid of the first, one creates a more complex model. Comma-free just
>>> means you don't have to guess where the bits for one symbol end and those
>>> for the next symbol start.
>>>
>>> I'm not sure how the proposals came up with the probabilities, but I
>>> guess that was based on some data. Even though HTTP headers aren't English,
>>> I'd assume that 'e' has significantly higher probability than 'Q' :-).
>>>
>>> To make sure this works for UTF-8, the easiest way is to take bytes as
>>> input symbols. Some byte values with the high bit set can't appear in
>>> UTF-8, so these can be excluded. I'm not sure what the current proposals
>>> do, but I guess they also exclude many of the byte values in the C0 range
>>> (0x00-0x1F), or at least give them very low probabilities.
>>>
>>> A somewhat more complicated, but more efficient model would take the
>>> state of the UTF-8 'engine' into account, and have separate Huffman
>>> encoders for the various states, i.e. first byte of a character, second
>>> byte of a two-byte character, second byte of a three-byte character, third
>>> byte of a three-byte character, second/third/fourth byte of a four-byte
>>> character. Most if not all of the later ones can use the same Huffman
>>> encoder, and most probably can just assume a flat probability distribution
>>> and encode six raw bits (i.e. just eliminate the leading 0b10 in trailing
>>> UTF-8 bytes).
>>>
>>> If we think that ASCII is way more probable than UTF-8, then the only
>>> thing that will happen to ASCII is that one of the longest bit sequences,
>>> the one for the least probable ASCII byte, will get one bit longer. Of
>>> course, if we think that UTF-8 data is more probable, then there will be
>>> more changes, but these changes would be duly deserved.
>>>
>>> Regards,    Martin.
>>>
>>>
>>