Re: Delta Compression and UTF-8 Header Values

Willy Tarreau <w@1wt.eu> Sat, 09 February 2013 15:00 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B2AC821F8887 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 07:00:37 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.31
X-Spam-Level:
X-Spam-Status: No, score=-10.31 tagged_above=-999 required=5 tests=[AWL=0.137, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5LnVqd4WyM0I for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 07:00:37 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 4026E21F884A for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 9 Feb 2013 07:00:37 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4Btl-00085x-0H for ietf-http-wg-dist@listhub.w3.org; Sat, 09 Feb 2013 14:59:25 +0000
Resent-Date: Sat, 09 Feb 2013 14:59:25 +0000
Resent-Message-Id: <E1U4Btl-00085x-0H@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <w@1wt.eu>) id 1U4Bte-00085H-ED for ietf-http-wg@listhub.w3.org; Sat, 09 Feb 2013 14:59:18 +0000
Received: from 1wt.eu ([62.212.114.60]) by lisa.w3.org with esmtp (Exim 4.72) (envelope-from <w@1wt.eu>) id 1U4Btd-0007BQ-7G for ietf-http-wg@w3.org; Sat, 09 Feb 2013 14:59:18 +0000
Received: (from willy@localhost) by mail.home.local (8.14.4/8.14.4/Submit) id r19EwYM1009341; Sat, 9 Feb 2013 15:58:34 +0100
Date: Sat, 09 Feb 2013 15:58:34 +0100
From: Willy Tarreau <w@1wt.eu>
To: Martin Nilsson <nilsson@opera.com>
Cc: ietf-http-wg@w3.org
Message-ID: <20130209145834.GB8712@1wt.eu>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <CE65E38D-A482-4EA9-BAF4-F6498F643A78@mnot.net> <511642E9.9010607@it.aoyama.ac.jp> <20130209133341.GA8712@1wt.eu> <op.wr8se6rpiw9drz@uranium.westinmy-starwoodgp.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <op.wr8se6rpiw9drz@uranium.westinmy-starwoodgp.com>
User-Agent: Mutt/1.4.2.3i
Received-SPF: pass client-ip=62.212.114.60; envelope-from=w@1wt.eu; helo=1wt.eu
X-W3C-Hub-Spam-Status: No, score=-4.0
X-W3C-Hub-Spam-Report: AWL=-2.080, BAYES_00=-1.9, RP_MATCHES_RCVD=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1U4Btd-0007BQ-7G 4b24bdb170f6077b569a62bbf165176f
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/20130209145834.GB8712@1wt.eu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16486
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On Sat, Feb 09, 2013 at 03:12:32PM +0100, Martin Nilsson wrote:
> On Sat, 09 Feb 2013 14:33:41 +0100, Willy Tarreau <w@1wt.eu> wrote:
> 
> >Also, processing it is
> >particularly inefficient as you have to parse each and every byte to find
> >a length, making string comparisons quite slow.
> 
> You don't need to know the length in characters to compare strings. Just  
> comparing byte on byte works fine.

This is exactly what you want to avoid when comparing with lots of strings.
It's generally more efficient to first compare lengths, then byte per byte
only if lengths match. This is equally true when checking for some regex
patterns such as "/cache/dir/../..../" where "." denotes a character. And
last but not least, the Boyer-Moore search is much less efficient with
UTF-8 encoding than what it is with non-encoded data.

I'm really all for just transporting raw data as much as possible, that
only the two ends need to understand and agree upon when it comes to the
encoding. However, if some data come from commonly UTF-8 encoded sources,
then I'd rather keep them as-is than having to re-encode them.

Willy