Re: Delta Compression and UTF-8 Header Values

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Sun, 10 February 2013 04:35 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 194BE21F84B2 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 20:35:00 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.183
X-Spam-Level:
X-Spam-Status: No, score=-9.183 tagged_above=-999 required=5 tests=[AWL=0.964, BAYES_00=-2.599, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id K9wjTIGP0WFp for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 20:34:59 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 606F121F8499 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 9 Feb 2013 20:34:59 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4OaY-0004KO-DN for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 04:32:26 +0000
Resent-Date: Sun, 10 Feb 2013 04:32:26 +0000
Resent-Message-Id: <E1U4OaY-0004KO-DN@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <duerst@it.aoyama.ac.jp>) id 1U4OaN-0004Ja-EK for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 04:32:15 +0000
Received: from scintmta01.scbb.aoyama.ac.jp ([133.2.253.33]) by lisa.w3.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from <duerst@it.aoyama.ac.jp>) id 1U4OaL-0001U3-PQ for ietf-http-wg@w3.org; Sun, 10 Feb 2013 04:32:15 +0000
Received: from scmse01.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta01.scbb.aoyama.ac.jp (secret/secret) with SMTP id r1A4Vifb000874 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 13:31:44 +0900
Received: from (unknown [133.2.206.133]) by scmse01.scbb.aoyama.ac.jp with smtp id 758b_6644_c61ea476_733a_11e2_a044_001d096c566a; Sun, 10 Feb 2013 13:31:44 +0900
Received: from [IPv6:::1] ([133.2.210.1]:52524) by itmail.it.aoyama.ac.jp with [XMail 1.22 ESMTP Server] id <S16348FC> for <ietf-http-wg@w3.org> from <duerst@it.aoyama.ac.jp>; Sun, 10 Feb 2013 13:31:45 +0900
Message-ID: <511722AB.1040403@it.aoyama.ac.jp>
Date: Sun, 10 Feb 2013 13:31:39 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: James M Snell <jasnell@gmail.com>
CC: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com>
In-Reply-To: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Received-SPF: none client-ip=133.2.253.33; envelope-from=duerst@it.aoyama.ac.jp; helo=scintmta01.scbb.aoyama.ac.jp
X-W3C-Hub-Spam-Status: No, score=-4.4
X-W3C-Hub-Spam-Report: AWL=-2.464, BAYES_00=-1.9, RP_MATCHES_RCVD=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1U4OaL-0001U3-PQ ff6a8f15aac84a86052ee6c38fa40a2c
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/511722AB.1040403@it.aoyama.ac.jp>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16492
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Hello James, others,

On 2013/02/09 4:28, James M Snell wrote:
> One key challenge with allowing UTF-8 values, however, is that it
> conflicts with the use of the static huffman encoding in the proposed
> Delta Encoding for header compression. If we allow for non-ascii
> characters, the static huffman coding simply becomes too inefficient
> and unmanageable to be useful. There are a few ways around it but none
> of the strategies are all that attractive.

[If somebody has pointers to actual code, that would be appreciated. I 
can't work on it for the next two weeks, but after that, I should be 
able to use a day or two to see what's possible.]

For a static Huffman encoding, you have to decide what symbols you 
accept as input, give every symbol a probability (these have to add up 
to 1) and then you get the 'optimal' "comma-free" encoding using the 
algorithm devised by Huffman. Optimal is under the assumptions that the 
probabilities are correct (and independent) and that you have to use an 
integral number of bits per symbol. Arithmetic coding gets rid of the 
second restriction, to get rid of the first, one creates a more complex 
model. Comma-free just means you don't have to guess where the bits for 
one symbol end and those for the next symbol start.

I'm not sure how the proposals came up with the probabilities, but I 
guess that was based on some data. Even though HTTP headers aren't 
English, I'd assume that 'e' has significantly higher probability than 
'Q' :-).

To make sure this works for UTF-8, the easiest way is to take bytes as 
input symbols. Some byte values with the high bit set can't appear in 
UTF-8, so these can be excluded. I'm not sure what the current proposals 
do, but I guess they also exclude many of the byte values in the C0 
range (0x00-0x1F), or at least give them very low probabilities.

A somewhat more complicated, but more efficient model would take the 
state of the UTF-8 'engine' into account, and have separate Huffman 
encoders for the various states, i.e. first byte of a character, second 
byte of a two-byte character, second byte of a three-byte character, 
third byte of a three-byte character, second/third/fourth byte of a 
four-byte character. Most if not all of the later ones can use the same 
Huffman encoder, and most probably can just assume a flat probability 
distribution and encode six raw bits (i.e. just eliminate the leading 
0b10 in trailing UTF-8 bytes).

If we think that ASCII is way more probable than UTF-8, then the only 
thing that will happen to ASCII is that one of the longest bit 
sequences, the one for the least probable ASCII byte, will get one bit 
longer. Of course, if we think that UTF-8 data is more probable, then 
there will be more changes, but these changes would be duly deserved.

Regards,    Martin.