Re: Delta Compression and UTF-8 Header Values

Roberto Peon <grmocg@gmail.com> Sun, 10 February 2013 09:03 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8222921F84B2 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 10 Feb 2013 01:03:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -11.452
X-Spam-Level:
X-Spam-Status: No, score=-11.452 tagged_above=-999 required=5 tests=[AWL=0.994, BAYES_00=-2.599, GB_I_LETTER=-2, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id q7-gl+hZ2neL for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 10 Feb 2013 01:03:09 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 8136C21F8499 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sun, 10 Feb 2013 01:03:09 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4SmM-0000ex-18 for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 09:00:54 +0000
Resent-Date: Sun, 10 Feb 2013 09:00:54 +0000
Resent-Message-Id: <E1U4SmM-0000ex-18@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1U4SmF-0000eD-7E for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 09:00:47 +0000
Received: from mail-oa0-f41.google.com ([209.85.219.41]) by maggie.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1U4SmE-0003au-70 for ietf-http-wg@w3.org; Sun, 10 Feb 2013 09:00:47 +0000
Received: by mail-oa0-f41.google.com with SMTP id i10so5363404oag.28 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 01:00:20 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=zVJDjVcs5qCgVPxaY+lguAA+FJOlrdH50uPyXb9ob+8=; b=oa0YGwXPQam2FU8/mjP5CkV6DHZ3g3h/W6s0LweW6M9zTjWV0ePu/MRcmVKBaDxaGL 4twiMN1z/lyTbbXMqmograrkMR3jhhB2ruNno9fUMNZB6HCmuRvBb2vLl7NHpfOYxVt6 FOOojdTtK1s9qYvjJ2xlO4O+XuOIQOBoGZobJGOVW/BuAEtiEBV/R6i616qc73YfDYv0 V21W6lvJyqeWzCAFAP/iFwUtp6vUYUGhhQxWkn7B5js/8o0VdgL6iiXxgEBy7d2mvkBn EE3ZSc1zlvvOqofWn+FAezygfKmYgaajNHkjE8TkSnb/3tTHvio02MdK7rYKp+KgSx+7 vTqg==
MIME-Version: 1.0
X-Received: by 10.60.172.237 with SMTP id bf13mr8288609oec.83.1360486820013; Sun, 10 Feb 2013 01:00:20 -0800 (PST)
Received: by 10.76.167.193 with HTTP; Sun, 10 Feb 2013 01:00:19 -0800 (PST)
In-Reply-To: <CAK3OfOhFFHymH1x7t7bAnTEzE34PyWO1moOC5p3opC4qcHzA2Q@mail.gmail.com>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <511722AB.1040403@it.aoyama.ac.jp> <CAK3OfOhFFHymH1x7t7bAnTEzE34PyWO1moOC5p3opC4qcHzA2Q@mail.gmail.com>
Date: Sun, 10 Feb 2013 01:00:19 -0800
Message-ID: <CAP+FsNeXQy_69b=B-8nbzfW4UtQDDK7LWg6VQ3hoNMJo3Q47ig@mail.gmail.com>
From: Roberto Peon <grmocg@gmail.com>
To: Nico Williams <nico@cryptonector.com>
Cc: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: multipart/alternative; boundary="bcaec54b4aa2a0e54104d55b052c"
Received-SPF: pass client-ip=209.85.219.41; envelope-from=grmocg@gmail.com; helo=mail-oa0-f41.google.com
X-W3C-Hub-Spam-Status: No, score=-4.4
X-W3C-Hub-Spam-Report: AWL=-1.745, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1U4SmE-0003au-70 0a91dd2b2f0038c9704d99b1ee897142
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/CAP+FsNeXQy_69b=B-8nbzfW4UtQDDK7LWg6VQ3hoNMJo3Q47ig@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16507
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

I'll point out that the only reason that we're talking about static
huffmans is because it isn't safe to have a constantly mutating huffman
encoding over the lifetime of the session.
Perhaps it is reasonable to negotiate a different table in the first
exchange, dunno, but it is certainly techincailly feasible.

That being said, dealing with utf-8 and unicode in metadata that isn't
being shown to the user seems silly to me.
For data which is, however, then picking one single encoding for it would
be nice, even if it is utf-8 :)
-=R


On Sun, Feb 10, 2013 at 12:17 AM, Nico Williams <nico@cryptonector.com>wrote:

> On Saturday, February 9, 2013, "Martin J. Dürst" wrote:
>
>> Hello James, others,
>>
>> On 2013/02/09 4:28, James M Snell wrote:
>>
>>> One key challenge with allowing UTF-8 values, however, is that it
>>> conflicts with the use of the static huffman encoding in the proposed
>>> Delta Encoding for header compression. If we allow for non-ascii
>>> characters, the static huffman coding simply becomes too inefficient
>>> and unmanageable to be useful. There are a few ways around it but none
>>> of the strategies are all that attractive.
>>
>>
> Wait, what?  If you have non-English (worse, non-European) text in some
> ASCII encoding like punycode, or base64-encoded UTF-8, then static Huffman
> will not be useful for compression anyways (assuming Huffman coding is
> based on English -say- letter frequencies).
>
>
>> [If somebody has pointers to actual code, that would be appreciated. I
>> can't work on it for the next two weeks, but after that, I should be able
>> to use a day or two to see what's possible.]
>>
>> For a static Huffman encoding, you have to decide what symbols you accept
>> as input, give every symbol a probability (these have to add up to 1) and
>> then you get the 'optimal' "comma-free" encoding using the algorithm
>> devised by Huffman. Optimal is under the assumptions that the probabilities
>> are correct (and independent) and that you have to use an integral number
>> of bits per symbol. Arithmetic coding gets rid of the second restriction,
>> to get rid of the first, one creates a more complex model. Comma-free just
>> means you don't have to guess where the bits for one symbol end and those
>> for the next symbol start.
>
>
> Right.  i hope i put it more succintly above.
>
> The fact is that Huffman coding for all our scripts at once just isn't
> possible.  Static Huffman coding is not a good reason to not want UTF-8 or
> any other Unicode encoding.
>
> Nico
> --
>