Re: Delta Compression and UTF-8 Header Values

James M Snell <jasnell@gmail.com> Sun, 10 February 2013 07:40 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AE7A521F8A99 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 23:40:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.416
X-Spam-Level:
X-Spam-Status: No, score=-10.416 tagged_above=-999 required=5 tests=[AWL=0.030, BAYES_00=-2.599, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ga0WvmlS7QMa for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sat, 9 Feb 2013 23:40:00 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id B57DB21F8A96 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sat, 9 Feb 2013 23:40:00 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4RUC-0005Nv-22 for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 07:38:04 +0000
Resent-Date: Sun, 10 Feb 2013 07:38:04 +0000
Resent-Message-Id: <E1U4RUC-0005Nv-22@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <jasnell@gmail.com>) id 1U4RU5-0005Me-1G for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 07:37:57 +0000
Received: from mail-oa0-f49.google.com ([209.85.219.49]) by maggie.w3.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.72) (envelope-from <jasnell@gmail.com>) id 1U4RU3-0001yp-TJ for ietf-http-wg@w3.org; Sun, 10 Feb 2013 07:37:56 +0000
Received: by mail-oa0-f49.google.com with SMTP id j6so5285908oag.22 for <ietf-http-wg@w3.org>; Sat, 09 Feb 2013 23:37:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=bHssHKhPSTpDBvEoo9AzO9Mg0RociBEkzxtmecTtlko=; b=JM01ucbKce5wrw3RvyHaml+roN1Pqwh+ioTBwjyYexx9Rdf7pdXQLswOrXPJj7suQN bS4wz3n64H4kQC6h1k13oRhTX2G2/lmSub4MaEgz0P7OFuMhPyqDLSsTHZ6svRTChxxQ R4fIYuHn0VGiZ1W/fyMMxsJaCwvN9cALWb9Rya20/N8CRdlg89s/QG7FZYXQAa9Om1uT edN0D3H0Whsqa4dM92VfRCf33AAYutuOCp1LE+sDiCFY2qm0nJI3cCB+ba2inzFSFQcO R+7pHjo9dXzXP0RhxUnVfchnkBUto8wPRmzSHDLEXCzsKrnDyxDQ8VBqnZLNIwo732vT pOvw==
MIME-Version: 1.0
X-Received: by 10.182.48.37 with SMTP id i5mr7926486obn.18.1360481849989; Sat, 09 Feb 2013 23:37:29 -0800 (PST)
Received: by 10.76.23.35 with HTTP; Sat, 9 Feb 2013 23:37:29 -0800 (PST)
Received: by 10.76.23.35 with HTTP; Sat, 9 Feb 2013 23:37:29 -0800 (PST)
In-Reply-To: <20130210072642.GN8712@1wt.eu>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <CE65E38D-A482-4EA9-BAF4-F6498F643A78@mnot.net> <511642E9.9010607@it.aoyama.ac.jp> <20130209133341.GA8712@1wt.eu> <511729F6.6000201@it.aoyama.ac.jp> <20130210072642.GN8712@1wt.eu>
Date: Sat, 09 Feb 2013 23:37:29 -0800
Message-ID: <CABP7RbfgR4u+n9_K1DqYqf8HUPuXWGLyHOOAPGwWxKs7M_dmKw@mail.gmail.com>
From: James M Snell <jasnell@gmail.com>
To: Willy Tarreau <w@1wt.eu>
Cc: Mark Nottingham <mnot@mnot.net>, Martin Dürst <duerst@it.aoyama.ac.jp>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: multipart/alternative; boundary="f46d0447a059645b2704d559ddb0"
Received-SPF: pass client-ip=209.85.219.49; envelope-from=jasnell@gmail.com; helo=mail-oa0-f49.google.com
X-W3C-Hub-Spam-Status: No, score=-4.5
X-W3C-Hub-Spam-Report: AWL=-1.761, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001
X-W3C-Scan-Sig: maggie.w3.org 1U4RU3-0001yp-TJ 217c532e882f78702dc5e2ec1225e5bd
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/CABP7RbfgR4u+n9_K1DqYqf8HUPuXWGLyHOOAPGwWxKs7M_dmKw@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16501
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Keep in mind that allowing headers to potentially contain utf-8 does not
change the definitions of existing headers. Those that are currently
defined with ASCII only values would likely remain ASCII only; we would
need to either update the definitions of those existing headers (breaking
backwards compatibility) or define new utf-8 compatible replacements.

All we need to decide at this point, really, is a) are utf-8 header values
important to us and b) does/will our basic header encoding allow for utf-8
if the answer to (a) is yes.
On Feb 9, 2013 11:26 PM, "Willy Tarreau" <w@1wt.eu> wrote:

> Hello Martin,
>
> On Sun, Feb 10, 2013 at 02:02:46PM +0900, "Martin J. Dürst" wrote:
> > >The encoding can
> > >become inefficient to transport for other charsets by inflating data by
> up
> > >to 50%
> >
> > Well, that's actually an urban myth. The 50% is for CJK
> > (Chinese/Japanese/Korean).
>
> With the fast development of China, it is perfectly imaginable that
> in 10 years, a significant portion of the web traffic is made with
> Chineese URLs, so we must not ignore that.
>
> > For the languages/scripts of India, South
> > East Asia, and a few more places, it can be 200%. (For texts purely in
> > an alphabet in the Supplemental planes such as Old Italic, Shavian,
> > Osmanya,..., it can be 300%, but I guess we can ignore these.) But these
> > numbers only apply to cases that don't contain any ASCII at all.
>
> I don't see how this is possible since you have 6 bits of data per byte
> plus a few bits on the first byte, and you need 3 bytes to transport 16
> bits, which is 50% for me :-)
>
> > >and may make compression less efficient.
> >
> > That depends very much on the method of compression that's used.
>
> I agree, but adding unused bits or entropy in general will make compression
> algorithms less efficient.
>
> > >I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate
> using
> > >it), I'm saying that it's not *THE* solution to every problem. It's just
> > >*A*
> > >solution to *A* problem : "how to extend character sets in existing
> > >documents
> > >without having to re-encode them all". I don't think this specific
> problem
> > >is
> > >related to the scope of the HTTP/2 work, so at first glance, I'd say
> that
> > >UTF-8 doesn't seem to solve a known problem here.
> >
> > The fact that I mentioned Websockets may have lead to a
> > misunderstanding. I'm not proposing to use UTF-8 only in bodies, just in
> > headers (I wouldn't object, though). My understanding was that James was
> > talking about headers, and I was doing so, too.
>
> I was talking about header values too. As a developer of intermediaries,
> I'm not interested in the body at all. I'm seeing people do ugly things
> all the time, like regex-matching hosts with ".*\.example\.com" without
> being aware how slow it is to do that on each and every Host header field.
> Typically doing that with an UTF-8 aware library is even slower.
>
> That's why I'm having some concerns.
>
> Ideally, everything we transport should be in its original form. If hosts
> come from DNS, they should appear encoded as they were returned by the DNS
> server (even with the ugly IDN format). If paths are supposed to be UTF-8,
> let them be sent in their raw original UTF-8 form without changing the
> format. But then we don't want to mix Host and path, and we want to put as
> a first rule that only the shortest forms are allowed. If most header
> fields
> are pure ASCII (eg: encodings), declare them as such. If some header fields
> are enums, use enums and not text. Etc...
>
> Regards,
> Willy
>
>