Re: Delta Compression and UTF-8 Header Values

Roberto Peon <grmocg@gmail.com> Mon, 11 February 2013 22:54 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D2C6C21F8A66 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 11 Feb 2013 14:54:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.159
X-Spam-Level:
X-Spam-Status: No, score=-10.159 tagged_above=-999 required=5 tests=[AWL=-0.313, BAYES_00=-2.599, HTML_MESSAGE=0.001, J_CHICKENPOX_65=0.6, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qs3i3C-vyx0v for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 11 Feb 2013 14:54:12 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id B141821F8A84 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Mon, 11 Feb 2013 14:54:12 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U52Ft-0008OE-1X for ietf-http-wg-dist@listhub.w3.org; Mon, 11 Feb 2013 22:53:45 +0000
Resent-Date: Mon, 11 Feb 2013 22:53:45 +0000
Resent-Message-Id: <E1U52Ft-0008OE-1X@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1U52Fd-0008MO-9a for ietf-http-wg@listhub.w3.org; Mon, 11 Feb 2013 22:53:29 +0000
Received: from mail-ob0-f182.google.com ([209.85.214.182]) by lisa.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <grmocg@gmail.com>) id 1U52Fc-00074o-6a for ietf-http-wg@w3.org; Mon, 11 Feb 2013 22:53:29 +0000
Received: by mail-ob0-f182.google.com with SMTP id va7so6530394obc.41 for <ietf-http-wg@w3.org>; Mon, 11 Feb 2013 14:53:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=1jLDnDZLt/8AM7zSicGHIakYBnzunAZ+GQzoN3BrES8=; b=q0tuvb5ZPJv7WNEG6XWBzxuUdfPmPOUFB43KrbJLfPOgMAuscR2utlhZyeILciJaeL DWQhp8WGHHk0o3Xm50EIbQ74g9QmRq+YjPVpGsvYnhrPZWiyvmVSIkCAVo81jIBqaaDe PRmmEODxOe9T3DO+abWNaymqEfgTYl6s0sqnnMgB4l7Hhcd0OC1jc6WQjfmtpHs/O+GM tuAy11PXNCWzjd3g5POV/hcDFBxyTAYYqH3aCp6JCPe4G7UNX8wLG25dFDoGzoKG38sF q/ReZ+zlA6fLor2sOPONmbJdMXknKhR60EJ2u73wpsaBfZRHQvh5N9669aMTUQ7HUAhJ ZgHA==
MIME-Version: 1.0
X-Received: by 10.182.183.2 with SMTP id ei2mr11740182obc.84.1360623181690; Mon, 11 Feb 2013 14:53:01 -0800 (PST)
Received: by 10.76.167.193 with HTTP; Mon, 11 Feb 2013 14:53:01 -0800 (PST)
In-Reply-To: <m3sj52a61n.fsf@carbon.jhcloos.org>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <CE65E38D-A482-4EA9-BAF4-F6498F643A78@mnot.net> <CABP7RbcRrjV7EhwoGbkWbYJEXeWOwH4gQuaCG7N0siQqeMtcag@mail.gmail.com> <m3sj52a61n.fsf@carbon.jhcloos.org>
Date: Mon, 11 Feb 2013 14:53:01 -0800
Message-ID: <CAP+FsNf2x-K0OFQVLOKsc+ZM+BUDJGygcnUH=buQm4yA2Su2cw@mail.gmail.com>
From: Roberto Peon <grmocg@gmail.com>
To: James Cloos <cloos@jhcloos.com>
Cc: James M Snell <jasnell@gmail.com>, Mark Nottingham <mnot@mnot.net>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: multipart/alternative; boundary="f46d04479f296af5fa04d57ac53a"
Received-SPF: pass client-ip=209.85.214.182; envelope-from=grmocg@gmail.com; helo=mail-ob0-f182.google.com
X-W3C-Hub-Spam-Status: No, score=-3.5
X-W3C-Hub-Spam-Report: AWL=-2.668, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1U52Fc-00074o-6a f4489ddcbebbc172f70b1eb1b65dd20a
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/CAP+FsNf2x-K0OFQVLOKsc+ZM+BUDJGygcnUH=buQm4yA2Su2cw@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16572
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

The header names are almost completely handled with the pre-seeded
dictionary, so they really don't affect the character frequency count
and/or thus the huffman encoding.

Arithmetic coding gets better compression ratios, at the expense of gobs of
CPU and complexity. I don't think that is a good tradeoff :/
We're proposing thus far that we encode with the static huffman, and if the
end-result is larger than the original text, just use the original text. Of
course, one could skip the huffman-encoding step if one had a good idea
that this would be the case, but hopefully we get close enough that the
static huffman is still of benefit. The way of doing this selection is
exactly what you propose-- use up a bit to indicate that the encoding isn't
done with huffman. There are a couple obvious ways of doing this:
1) Use a flag in the opcode byte. The main advantage of doing this is that
it saves bits elsewhere, but there is a disadvantage: If you end up wanting
to encode strings in two different ways, you must emit two different
opcodes of the same type, and each opcode ends up consuming 2-bytes (one
for opcode+flags, one for number of operations of that type).

-=R


On Mon, Feb 11, 2013 at 1:34 PM, James Cloos <cloos@jhcloos.com> wrote:

> >>>>> "JMS" == James M Snell <jasnell@gmail.com> writes:
>
> JMS> we'll be able to huffman code anything that is flagged
> JMS> as ASCII, and won't be able to touch the rest.
>
> Would that really be an issue?  The static huffman can only really be
> for the common strings, yes?  Which mostly means the header names and
> not the header values?  So even if the headers were limited to ascii
> the tables wouldn't help much for most of the values?
>
> (As an aside, Would arithmetic be of any better value than huffman, here?)
>
> Using one bit for each string to specify utf8-text blob vs binary blob,
> and using the former for everthing know to be text, seems the best
> overall choice.  And if any non-ascii utf8 sequences become common
> enough, they can be added to future revisions of the static table just
> as easily as 7-bit strings can be.
>
> -JimC
> --
> James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6
>
>