Re: Delta Compression and UTF-8 Header Values

Nico Williams <nico@cryptonector.com> Sun, 10 February 2013 08:19 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A86D321F8206 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 10 Feb 2013 00:19:24 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.791
X-Spam-Level:
X-Spam-Status: No, score=-8.791 tagged_above=-999 required=5 tests=[AWL=2.733, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, GB_I_LETTER=-2, HTML_MESSAGE=0.001, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3+6KoRx8dfbx for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 10 Feb 2013 00:19:23 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id A489221F851F for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sun, 10 Feb 2013 00:19:23 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4S6S-00070m-8Z for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 08:17:36 +0000
Resent-Date: Sun, 10 Feb 2013 08:17:36 +0000
Resent-Message-Id: <E1U4S6S-00070m-8Z@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <nico@cryptonector.com>) id 1U4S6L-0006zc-6N for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 08:17:29 +0000
Received: from caiajhbdcahe.dreamhost.com ([208.97.132.74] helo=homiemail-a87.g.dreamhost.com) by lisa.w3.org with esmtp (Exim 4.72) (envelope-from <nico@cryptonector.com>) id 1U4S6K-0005vH-8y for ietf-http-wg@w3.org; Sun, 10 Feb 2013 08:17:29 +0000
Received: from homiemail-a87.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a87.g.dreamhost.com (Postfix) with ESMTP id D4EE026C063 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 00:17:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h= mime-version:in-reply-to:references:date:message-id:subject:from :to:cc:content-type; s=cryptonector.com; bh=MppCbSoqan9LrXvPRzB8 01efGlU=; b=cE7B9PCuh2ZaOKH8aFGGeDP0iAjat8eqUj82EZf5VNmh+qV+0jtD +IoKCPLwAwPx23hayfDpoEg2qpQlpOKUvJbMO/D0N7JQgpyLoZ7a2iQfMwhN8H4N XutG+FFiuy2wUIBdX9FrdkdMo4BTaaFDr6KCx40lfDaNQ9htkxfX0qM=
Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by homiemail-a87.g.dreamhost.com (Postfix) with ESMTPSA id 62DA226C057 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 00:17:06 -0800 (PST)
Received: by mail-wi0-f178.google.com with SMTP id o1so2159028wic.17 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 00:17:05 -0800 (PST)
MIME-Version: 1.0
X-Received: by 10.180.90.147 with SMTP id bw19mr10221025wib.28.1360484224936; Sun, 10 Feb 2013 00:17:04 -0800 (PST)
Received: by 10.217.39.133 with HTTP; Sun, 10 Feb 2013 00:17:04 -0800 (PST)
In-Reply-To: <511722AB.1040403@it.aoyama.ac.jp>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <511722AB.1040403@it.aoyama.ac.jp>
Date: Sun, 10 Feb 2013 02:17:04 -0600
Message-ID: <CAK3OfOhFFHymH1x7t7bAnTEzE34PyWO1moOC5p3opC4qcHzA2Q@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Cc: James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: multipart/alternative; boundary="f46d04389251f32b7f04d55a6ada"
Received-SPF: none client-ip=208.97.132.74; envelope-from=nico@cryptonector.com; helo=homiemail-a87.g.dreamhost.com
X-W3C-Hub-Spam-Status: No, score=-3.5
X-W3C-Hub-Spam-Report: AWL=-3.449, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001
X-W3C-Scan-Sig: lisa.w3.org 1U4S6K-0005vH-8y 3e2afa88c869c6097e3d52c4b697a624
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/CAK3OfOhFFHymH1x7t7bAnTEzE34PyWO1moOC5p3opC4qcHzA2Q@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16506
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On Saturday, February 9, 2013, "Martin J. Dürst" wrote:

> Hello James, others,
>
> On 2013/02/09 4:28, James M Snell wrote:
>
>> One key challenge with allowing UTF-8 values, however, is that it
>> conflicts with the use of the static huffman encoding in the proposed
>> Delta Encoding for header compression. If we allow for non-ascii
>> characters, the static huffman coding simply becomes too inefficient
>> and unmanageable to be useful. There are a few ways around it but none
>> of the strategies are all that attractive.
>
>
Wait, what?  If you have non-English (worse, non-European) text in some
ASCII encoding like punycode, or base64-encoded UTF-8, then static Huffman
will not be useful for compression anyways (assuming Huffman coding is
based on English -say- letter frequencies).


> [If somebody has pointers to actual code, that would be appreciated. I
> can't work on it for the next two weeks, but after that, I should be able
> to use a day or two to see what's possible.]
>
> For a static Huffman encoding, you have to decide what symbols you accept
> as input, give every symbol a probability (these have to add up to 1) and
> then you get the 'optimal' "comma-free" encoding using the algorithm
> devised by Huffman. Optimal is under the assumptions that the probabilities
> are correct (and independent) and that you have to use an integral number
> of bits per symbol. Arithmetic coding gets rid of the second restriction,
> to get rid of the first, one creates a more complex model. Comma-free just
> means you don't have to guess where the bits for one symbol end and those
> for the next symbol start.


Right.  i hope i put it more succintly above.

The fact is that Huffman coding for all our scripts at once just isn't
possible.  Static Huffman coding is not a good reason to not want UTF-8 or
any other Unicode encoding.

Nico
--