Re: Delta Compression and UTF-8 Header Values

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Sun, 10 February 2013 05:04 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B1CF721F847B for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 21:04:16 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.424
X-Spam-Level:
X-Spam-Status: No, score=-9.424 tagged_above=-999 required=5 tests=[AWL=0.723, BAYES_00=-2.599, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BOocTgDcQ14p for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 21:04:16 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id E80B021F8484 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 9 Feb 2013 21:04:15 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4P4Z-0000r6-ML for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 05:03:27 +0000
Resent-Date: Sun, 10 Feb 2013 05:03:27 +0000
Resent-Message-Id: <E1U4P4Z-0000r6-ML@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <duerst@it.aoyama.ac.jp>) id 1U4P4T-0000qH-7T for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 05:03:21 +0000
Received: from scintmta01.scbb.aoyama.ac.jp ([133.2.253.33]) by maggie.w3.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from <duerst@it.aoyama.ac.jp>) id 1U4P4R-0007RX-It for ietf-http-wg@w3.org; Sun, 10 Feb 2013 05:03:21 +0000
Received: from scmse02.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta01.scbb.aoyama.ac.jp (secret/secret) with SMTP id r1A52p9M022332 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 14:02:51 +0900
Received: from (unknown [133.2.206.133]) by scmse02.scbb.aoyama.ac.jp with smtp id 4990_aa71_1f11c686_733f_11e2_8150_001d096c5782; Sun, 10 Feb 2013 14:02:51 +0900
Received: from [IPv6:::1] ([133.2.210.1]:46732) by itmail.it.aoyama.ac.jp with [XMail 1.22 ESMTP Server] id <S1634926> for <ietf-http-wg@w3.org> from <duerst@it.aoyama.ac.jp>; Sun, 10 Feb 2013 14:02:52 +0900
Message-ID: <511729F6.6000201@it.aoyama.ac.jp>
Date: Sun, 10 Feb 2013 14:02:46 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: Willy Tarreau <w@1wt.eu>
CC: Mark Nottingham <mnot@mnot.net>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <CE65E38D-A482-4EA9-BAF4-F6498F643A78@mnot.net> <511642E9.9010607@it.aoyama.ac.jp> <20130209133341.GA8712@1wt.eu>
In-Reply-To: <20130209133341.GA8712@1wt.eu>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Received-SPF: none client-ip=133.2.253.33; envelope-from=duerst@it.aoyama.ac.jp; helo=scintmta01.scbb.aoyama.ac.jp
X-W3C-Hub-Spam-Status: No, score=-3.3
X-W3C-Hub-Spam-Report: AWL=-3.345, RP_MATCHES_RCVD=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1U4P4R-0007RX-It 7fb56695f32aa4678a8882d59b61a38c
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/511729F6.6000201@it.aoyama.ac.jp>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16494
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Hello Willy,

On 2013/02/09 22:33, Willy Tarreau wrote:
> On Sat, Feb 09, 2013 at 09:36:57PM +0900, "Martin J. Dürst" wrote:

>> It would be a good idea to try hard to make the new protocol forward
>> looking (or actually just acknowledge the present, rather than stay
>> frozen in the past) for the next 20 years or so in terms of character
>> encoding, too, and not only in terms of CPU/network performance.
>
> Well, don't confuse UTF-8 and Unicode.

As the main author of http://www.w3.org/TR/charmod/, I sure won't.

> UTF-8 is just a space-efficient way
> of transporting Unicode characters for western countries.

And for transporting ASCII-based commands/headers/markup together with 
non-ASCII data. That's the main reason the IETF adopted it.

> The encoding can
> become inefficient to transport for other charsets by inflating data by up
> to 50%

Well, that's actually an urban myth. The 50% is for CJK 
(Chinese/Japanese/Korean). For the languages/scripts of India, South 
East Asia, and a few more places, it can be 200%. (For texts purely in 
an alphabet in the Supplemental planes such as Old Italic, Shavian, 
Osmanya,..., it can be 300%, but I guess we can ignore these.) But these 
numbers only apply to cases that don't contain any ASCII at all.

> and may make compression less efficient.

That depends very much on the method of compression that's used.


> Also, processing it is
> particularly inefficient as you have to parse each and every byte to find
> a length, making string comparisons quite slow.

[See the follow-up mails in this thread.]

> I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate using
> it), I'm saying that it's not *THE* solution to every problem. It's just *A*
> solution to *A* problem : "how to extend character sets in existing documents
> without having to re-encode them all". I don't think this specific problem is
> related to the scope of the HTTP/2 work, so at first glance, I'd say that
> UTF-8 doesn't seem to solve a known problem here.

The fact that I mentioned Websockets may have lead to a 
misunderstanding. I'm not proposing to use UTF-8 only in bodies, just in 
headers (I wouldn't object, though). My understanding was that James was 
talking about headers, and I was doing so, too.

Regards,   Martin.