Re: Delta Compression and UTF-8 Header Values

Willy Tarreau <w@1wt.eu> Sun, 10 February 2013 07:29 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4B4DA21F842E for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 23:29:24 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.177
X-Spam-Level:
X-Spam-Status: No, score=-10.177 tagged_above=-999 required=5 tests=[AWL=-0.030, BAYES_00=-2.599, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Y6P3wjcWKN02 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 23:29:23 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 3D3E321F883E for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 9 Feb 2013 23:29:23 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4RJr-000138-KK for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 07:27:23 +0000
Resent-Date: Sun, 10 Feb 2013 07:27:23 +0000
Resent-Message-Id: <E1U4RJr-000138-KK@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <w@1wt.eu>) id 1U4RJk-00011h-Vm for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 07:27:16 +0000
Received: from 1wt.eu ([62.212.114.60]) by lisa.w3.org with esmtp (Exim 4.72) (envelope-from <w@1wt.eu>) id 1U4RJk-0004yR-2k for ietf-http-wg@w3.org; Sun, 10 Feb 2013 07:27:16 +0000
Received: (from willy@localhost) by mail.home.local (8.14.4/8.14.4/Submit) id r1A7QgSp011648; Sun, 10 Feb 2013 08:26:42 +0100
Date: Sun, 10 Feb 2013 08:26:42 +0100
From: Willy Tarreau <w@1wt.eu>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Cc: Mark Nottingham <mnot@mnot.net>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <20130210072642.GN8712@1wt.eu>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <CE65E38D-A482-4EA9-BAF4-F6498F643A78@mnot.net> <511642E9.9010607@it.aoyama.ac.jp> <20130209133341.GA8712@1wt.eu> <511729F6.6000201@it.aoyama.ac.jp>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <511729F6.6000201@it.aoyama.ac.jp>
User-Agent: Mutt/1.4.2.3i
Received-SPF: pass client-ip=62.212.114.60; envelope-from=w@1wt.eu; helo=1wt.eu
X-W3C-Hub-Spam-Status: No, score=-3.1
X-W3C-Hub-Spam-Report: AWL=-3.059, RP_MATCHES_RCVD=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1U4RJk-0004yR-2k ee4d5fe7b9d013db33e20fccfb45e620
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/20130210072642.GN8712@1wt.eu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16500
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Hello Martin,

On Sun, Feb 10, 2013 at 02:02:46PM +0900, "Martin J. Dürst" wrote:
> >The encoding can
> >become inefficient to transport for other charsets by inflating data by up
> >to 50%
> 
> Well, that's actually an urban myth. The 50% is for CJK 
> (Chinese/Japanese/Korean).

With the fast development of China, it is perfectly imaginable that
in 10 years, a significant portion of the web traffic is made with
Chineese URLs, so we must not ignore that.

> For the languages/scripts of India, South 
> East Asia, and a few more places, it can be 200%. (For texts purely in 
> an alphabet in the Supplemental planes such as Old Italic, Shavian, 
> Osmanya,..., it can be 300%, but I guess we can ignore these.) But these 
> numbers only apply to cases that don't contain any ASCII at all.

I don't see how this is possible since you have 6 bits of data per byte
plus a few bits on the first byte, and you need 3 bytes to transport 16
bits, which is 50% for me :-)

> >and may make compression less efficient.
> 
> That depends very much on the method of compression that's used.

I agree, but adding unused bits or entropy in general will make compression
algorithms less efficient.

> >I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate using
> >it), I'm saying that it's not *THE* solution to every problem. It's just 
> >*A*
> >solution to *A* problem : "how to extend character sets in existing 
> >documents
> >without having to re-encode them all". I don't think this specific problem 
> >is
> >related to the scope of the HTTP/2 work, so at first glance, I'd say that
> >UTF-8 doesn't seem to solve a known problem here.
> 
> The fact that I mentioned Websockets may have lead to a 
> misunderstanding. I'm not proposing to use UTF-8 only in bodies, just in 
> headers (I wouldn't object, though). My understanding was that James was 
> talking about headers, and I was doing so, too.

I was talking about header values too. As a developer of intermediaries,
I'm not interested in the body at all. I'm seeing people do ugly things
all the time, like regex-matching hosts with ".*\.example\.com" without
being aware how slow it is to do that on each and every Host header field.
Typically doing that with an UTF-8 aware library is even slower.

That's why I'm having some concerns.

Ideally, everything we transport should be in its original form. If hosts
come from DNS, they should appear encoded as they were returned by the DNS
server (even with the ugly IDN format). If paths are supposed to be UTF-8,
let them be sent in their raw original UTF-8 form without changing the
format. But then we don't want to mix Host and path, and we want to put as
a first rule that only the shortest forms are allowed. If most header fields
are pure ASCII (eg: encodings), declare them as such. If some header fields
are enums, use enums and not text. Etc...

Regards,
Willy