Re: Delta Compression and UTF-8 Header Values
"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Sun, 10 February 2013 04:35 UTC
Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 194BE21F84B2 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 20:35:00 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.183
X-Spam-Level:
X-Spam-Status: No, score=-9.183 tagged_above=-999 required=5 tests=[AWL=0.964, BAYES_00=-2.599, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id K9wjTIGP0WFp for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 20:34:59 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 606F121F8499 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 9 Feb 2013 20:34:59 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4OaY-0004KO-DN for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 04:32:26 +0000
Resent-Date: Sun, 10 Feb 2013 04:32:26 +0000
Resent-Message-Id: <E1U4OaY-0004KO-DN@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <duerst@it.aoyama.ac.jp>) id 1U4OaN-0004Ja-EK for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 04:32:15 +0000
Received: from scintmta01.scbb.aoyama.ac.jp ([133.2.253.33]) by lisa.w3.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from <duerst@it.aoyama.ac.jp>) id 1U4OaL-0001U3-PQ for ietf-http-wg@w3.org; Sun, 10 Feb 2013 04:32:15 +0000
Received: from scmse01.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta01.scbb.aoyama.ac.jp (secret/secret) with SMTP id r1A4Vifb000874 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 13:31:44 +0900
Received: from (unknown [133.2.206.133]) by scmse01.scbb.aoyama.ac.jp with smtp id 758b_6644_c61ea476_733a_11e2_a044_001d096c566a; Sun, 10 Feb 2013 13:31:44 +0900
Received: from [IPv6:::1] ([133.2.210.1]:52524) by itmail.it.aoyama.ac.jp with [XMail 1.22 ESMTP Server] id <S16348FC> for <ietf-http-wg@w3.org> from <duerst@it.aoyama.ac.jp>; Sun, 10 Feb 2013 13:31:45 +0900
Message-ID: <511722AB.1040403@it.aoyama.ac.jp>
Date: Sun, 10 Feb 2013 13:31:39 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: James M Snell <jasnell@gmail.com>
CC: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com>
In-Reply-To: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Received-SPF: none client-ip=133.2.253.33; envelope-from=duerst@it.aoyama.ac.jp; helo=scintmta01.scbb.aoyama.ac.jp
X-W3C-Hub-Spam-Status: No, score=-4.4
X-W3C-Hub-Spam-Report: AWL=-2.464, BAYES_00=-1.9, RP_MATCHES_RCVD=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1U4OaL-0001U3-PQ ff6a8f15aac84a86052ee6c38fa40a2c
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/511722AB.1040403@it.aoyama.ac.jp>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16492
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>
Hello James, others, On 2013/02/09 4:28, James M Snell wrote: > One key challenge with allowing UTF-8 values, however, is that it > conflicts with the use of the static huffman encoding in the proposed > Delta Encoding for header compression. If we allow for non-ascii > characters, the static huffman coding simply becomes too inefficient > and unmanageable to be useful. There are a few ways around it but none > of the strategies are all that attractive. [If somebody has pointers to actual code, that would be appreciated. I can't work on it for the next two weeks, but after that, I should be able to use a day or two to see what's possible.] For a static Huffman encoding, you have to decide what symbols you accept as input, give every symbol a probability (these have to add up to 1) and then you get the 'optimal' "comma-free" encoding using the algorithm devised by Huffman. Optimal is under the assumptions that the probabilities are correct (and independent) and that you have to use an integral number of bits per symbol. Arithmetic coding gets rid of the second restriction, to get rid of the first, one creates a more complex model. Comma-free just means you don't have to guess where the bits for one symbol end and those for the next symbol start. I'm not sure how the proposals came up with the probabilities, but I guess that was based on some data. Even though HTTP headers aren't English, I'd assume that 'e' has significantly higher probability than 'Q' :-). To make sure this works for UTF-8, the easiest way is to take bytes as input symbols. Some byte values with the high bit set can't appear in UTF-8, so these can be excluded. I'm not sure what the current proposals do, but I guess they also exclude many of the byte values in the C0 range (0x00-0x1F), or at least give them very low probabilities. A somewhat more complicated, but more efficient model would take the state of the UTF-8 'engine' into account, and have separate Huffman encoders for the various states, i.e. first byte of a character, second byte of a two-byte character, second byte of a three-byte character, third byte of a three-byte character, second/third/fourth byte of a four-byte character. Most if not all of the later ones can use the same Huffman encoder, and most probably can just assume a flat probability distribution and encode six raw bits (i.e. just eliminate the leading 0b10 in trailing UTF-8 bytes). If we think that ASCII is way more probable than UTF-8, then the only thing that will happen to ASCII is that one of the longest bit sequences, the one for the least probable ASCII byte, will get one bit longer. Of course, if we think that UTF-8 data is more probable, then there will be more changes, but these changes would be duly deserved. Regards, Martin.
- Re: Delta Compression and UTF-8 Header Values Mark Nottingham
- Re: Delta Compression and UTF-8 Header Values James M Snell
- Re: Delta Compression and UTF-8 Header Values Adrien W. de Croy
- Delta Compression and UTF-8 Header Values James M Snell
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values James M Snell
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Roberto Peon
- Re: Delta Compression and UTF-8 Header Values James M Snell
- Re: Delta Compression and UTF-8 Header Values Bjoern Hoehrmann
- Re: Delta Compression and UTF-8 Header Values Martin J. Dürst
- Re: Delta Compression and UTF-8 Header Values Martin J. Dürst
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Willy Tarreau
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Martin Nilsson
- Re: Delta Compression and UTF-8 Header Values Martin Nilsson
- Re: Delta Compression and UTF-8 Header Values Albert Lunde
- Re: Delta Compression and UTF-8 Header Values Willy Tarreau
- Re: Delta Compression and UTF-8 Header Values Willy Tarreau
- Re: Delta Compression and UTF-8 Header Values Nico Williams
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Adrien W. de Croy
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Martin J. Dürst
- Re: Delta Compression and UTF-8 Header Values Martin J. Dürst
- Re: Delta Compression and UTF-8 Header Values Martin J. Dürst
- Re: Delta Compression and UTF-8 Header Values Roberto Peon
- Re: Delta Compression and UTF-8 Header Values Frédéric Kayser
- Re: Delta Compression and UTF-8 Header Values James M Snell
- Re: Delta Compression and UTF-8 Header Values Frédéric Kayser
- Re: Delta Compression and UTF-8 Header Values Roberto Peon
- Re: Delta Compression and UTF-8 Header Values Willy Tarreau
- Re: Delta Compression and UTF-8 Header Values James M Snell
- Re: Delta Compression and UTF-8 Header Values Frédéric Kayser
- Re: Delta Compression and UTF-8 Header Values Roberto Peon
- Re: Delta Compression and UTF-8 Header Values Nico Williams
- Re: Delta Compression and UTF-8 Header Values Roberto Peon
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Julian Reschke
- Re: Delta Compression and UTF-8 Header Values Julian Reschke
- Re: Delta Compression and UTF-8 Header Values Julian Reschke
- Re: Delta Compression and UTF-8 Header Values Willy Tarreau
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Willy Tarreau
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Mark Nottingham
- Re: Delta Compression and UTF-8 Header Values Roberto Peon
- Re: Delta Compression and UTF-8 Header Values Zhong Yu
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Zhong Yu
- Re: Delta Compression and UTF-8 Header Values Zhong Yu
- Re: Delta Compression and UTF-8 Header Values Zhong Yu
- Re: Delta Compression and UTF-8 Header Values Nico Williams
- Re: Delta Compression and UTF-8 Header Values Nico Williams
- Re: Delta Compression and UTF-8 Header Values Poul-Henning Kamp
- Re: Delta Compression and UTF-8 Header Values Nico Williams
- Re: Delta Compression and UTF-8 Header Values Nico Williams
- Re: Delta Compression and UTF-8 Header Values Phillip Hallam-Baker
- Re: Delta Compression and UTF-8 Header Values James Cloos
- Re: Delta Compression and UTF-8 Header Values Roberto Peon
- Re: Delta Compression and UTF-8 Header Values James Cloos
- Re: Delta Compression and UTF-8 Header Values Roberto Peon