Re: bohe and delta experimentation...

Frédéric Kayser <f.kayser@free.fr> Fri, 18 January 2013 08:14 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3C0B621F8955 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Fri, 18 Jan 2013 00:14:00 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.849
X-Spam-Level:
X-Spam-Status: No, score=-10.849 tagged_above=-999 required=5 tests=[AWL=1.100, BAYES_00=-2.599, GB_I_LETTER=-2, HELO_EQ_FR=0.35, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_HI=-8]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OdjCgKKR5SVV for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Fri, 18 Jan 2013 00:13:59 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 80C1821F894E for <httpbisa-archive-bis2Juki@lists.ietf.org>; Fri, 18 Jan 2013 00:13:59 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1Tw73l-0004DA-I0 for ietf-http-wg-dist@listhub.w3.org; Fri, 18 Jan 2013 08:12:21 +0000
Resent-Date: Fri, 18 Jan 2013 08:12:21 +0000
Resent-Message-Id: <E1Tw73l-0004DA-I0@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <f.kayser@free.fr>) id 1Tw73i-0004CV-3o for ietf-http-wg@listhub.w3.org; Fri, 18 Jan 2013 08:12:18 +0000
Received: from smtp5-g21.free.fr ([212.27.42.5]) by maggie.w3.org with esmtp (Exim 4.72) (envelope-from <f.kayser@free.fr>) id 1Tw73g-0001A8-R6 for ietf-http-wg@w3.org; Fri, 18 Jan 2013 08:12:18 +0000
Received: from [192.168.0.1] (unknown [81.56.127.176]) by smtp5-g21.free.fr (Postfix) with ESMTP id ED1FDD480C9; Fri, 18 Jan 2013 09:11:49 +0100 (CET)
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset="iso-8859-1"
From: Frédéric Kayser <f.kayser@free.fr>
In-Reply-To: <CAK3OfOj=k+eG6-p54NpivLUz2ER_e1Oh=gt33uxgHJvgJioZdw@mail.gmail.com>
Date: Fri, 18 Jan 2013 09:11:48 +0100
Cc: Roberto Peon <grmocg@gmail.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <4E80B9B4-1B3D-4452-A1B1-D7C1959C2755@free.fr>
References: <CAP+FsNcMRHV241N9z9aVPmqqipXQ5pV_+GHs3J+tUU+pRZ7SPw@mail.gmail.com> <1189053143.221044883.1358453785232.JavaMail.root@zimbra71-e12.priv.proxad.net> <CAK3OfOj=k+eG6-p54NpivLUz2ER_e1Oh=gt33uxgHJvgJioZdw@mail.gmail.com>
To: ietf-http-wg@w3.org, Nico Williams <nico@cryptonector.com>
X-Mailer: Apple Mail (2.1084)
Received-SPF: none client-ip=212.27.42.5; envelope-from=f.kayser@free.fr; helo=smtp5-g21.free.fr
X-W3C-Hub-Spam-Status: No, score=-4.3
X-W3C-Hub-Spam-Report: AWL=-2.429, BAYES_00=-1.9, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001
X-W3C-Scan-Sig: maggie.w3.org 1Tw73g-0001A8-R6 e72e44b14b0727a54bb133a7e2f76802
X-Original-To: ietf-http-wg@w3.org
Subject: Re: bohe and delta experimentation...
Archived-At: <http://www.w3.org/mid/4E80B9B4-1B3D-4452-A1B1-D7C1959C2755@free.fr>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/15987
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Well, it's basic information theory, the header in its text form can be seen as an information source (with it's own alphabet and own probability distribution), binary data is a different one (bigger alphabet) with a different probability distribution (for instance with a bias toward zero if references to common strings are numbered starting from zero).

If you treat these two different information sources as a single source (since they end up in the same file -header in this case-) you'll have to cope with a bigger alphabet and a probability distribution that never really fits either of the sources (this can be seen like two players speaking different languages trying to play Scrabble forming words in their mother tongue but the game set is in fact designed for a third language or rather something like Esperanto, it will work but only up to a point and with a limited gaming experience).

For better compression both sources should be handled separately and this could be done with BOHE since fields are being tagged as text or binary (we could even imagine handling it as 3 sources:  text fields, binary fields and BOHE structure), but for relatively small data samples (hopefully headers are smaller than 4 kB) providing 2 or 3 Huffman tables could prove itself inefficient because of the overhead involved.
Dynamic Huffman could be a solution but it implies more computation and may take to much time to reflect the underlying probability distribution (slow start).
Another way would be to mimic one information source characteristics with the other one, this works if data coming from one source outnumbers the other one (for instance 2 or 3 times more text than binary data) in this case binary data could be transformed to look like text (limited to 7-bits, avoid C0 control codes) it works even better if a bias in one source can be remapped in the other one (i.e. binary values near zero become lowercase letters).

If the compression method cannot use LZ77/LZW/BWT (to hinder CRIME attacks) we should pay special attention to data modeling and the entropy coders to achieve interesting gains.

Frédéric

Le 17 janv. 2013 à 21:50, Nico Williams a écrit :

> On Thu, Jan 17, 2013 at 2:16 PM, Frédéric Kayser <f.kayser@free.fr> wrote:
>> I find BOHE interesting but I don't like the idea of mixing what is basically ASCII text (7 bits values with most of the C0 control codes never used) and binary data (full 8 bits wide values, the way it's used in BOHE it also means a lot of values in the low end -near zero-).
> 
> This is silly: we already have text and binary mixed all over the
> place, such as in DNS, PKIX, Kerberos, SSHv2, and so on, and that's
> never been a problem.  And how else should we encode textual values in
> binary protocols anyways?  (I mean, besides using a Unicode encoding
> rather than ASCII.)
> 
>> An entropy coder will have to cope with a larger alphabet than just ASCII since it is "polluted" by binary data (if it appears infrequently it may end up taking more than 8-bits).
> 
> Well, yes, so what?  We already apply generic compression algorithms
> to so much mixed text and binary.  Clearly information theory does not
> say it's impossible.
> 
> Nico
> --
>