Re: Delta Compression and UTF-8 Header Values

Nico Williams <nico@cryptonector.com> Mon, 11 February 2013 00:11 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5D6E021F886C for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 10 Feb 2013 16:11:50 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.039
X-Spam-Level:
X-Spam-Status: No, score=-8.039 tagged_above=-999 required=5 tests=[AWL=1.786, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id csWU8Q29fFmo for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 10 Feb 2013 16:11:49 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 77D9C21F8804 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sun, 10 Feb 2013 16:11:49 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4gxt-000256-Sy for ietf-http-wg-dist@listhub.w3.org; Mon, 11 Feb 2013 00:09:45 +0000
Resent-Date: Mon, 11 Feb 2013 00:09:45 +0000
Resent-Message-Id: <E1U4gxt-000256-Sy@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <nico@cryptonector.com>) id 1U4gxm-00024K-KA for ietf-http-wg@listhub.w3.org; Mon, 11 Feb 2013 00:09:38 +0000
Received: from caiajhbdccac.dreamhost.com ([208.97.132.202] helo=homiemail-a88.g.dreamhost.com) by maggie.w3.org with esmtp (Exim 4.72) (envelope-from <nico@cryptonector.com>) id 1U4gxl-00068c-LF for ietf-http-wg@w3.org; Mon, 11 Feb 2013 00:09:38 +0000
Received: from homiemail-a88.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a88.g.dreamhost.com (Postfix) with ESMTP id 9373F264058 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 16:09:16 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h= mime-version:in-reply-to:references:date:message-id:subject:from :to:cc:content-type; s=cryptonector.com; bh=sY8tXDWPXtBfSFFSRi2o AkO1bjY=; b=A3gY4Rkug0JJl3q70U2Z8XZoUcgbZCIN2fjEovNVyDfFwfumiooq kndF5H7vVYjpPnZHo1pw4V+UV9buktlPoq2CoHZa8eO53l6EC88QXXMJpdWn/Z/R km28SqUOoA5cvrxzrFckvk2ghN5KcVXlb0Itn/JRMuJbNO0MI11+G20=
Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by homiemail-a88.g.dreamhost.com (Postfix) with ESMTPSA id 46B6D264057 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 16:09:16 -0800 (PST)
Received: by mail-wi0-f171.google.com with SMTP id hn17so2610642wib.4 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 16:09:14 -0800 (PST)
MIME-Version: 1.0
X-Received: by 10.194.60.5 with SMTP id d5mr3667186wjr.4.1360541354885; Sun, 10 Feb 2013 16:09:14 -0800 (PST)
Received: by 10.217.39.133 with HTTP; Sun, 10 Feb 2013 16:09:14 -0800 (PST)
In-Reply-To: <CACuKZqHMQdktfOU3PJC=X-G8R=BQ40bhFJw=ZTfeSpem9L=GEw@mail.gmail.com>
References: <CABP7RbfRLXPpL4=wip=FvqD3DM7BM8PXi7uRswHAusXUmPO_xw@mail.gmail.com> <6372.1360352116@critter.freebsd.dk> <51164503.2030709@it.aoyama.ac.jp> <58832.1360414202@critter.freebsd.dk> <511726A5.5030302@it.aoyama.ac.jp> <79576.1360488507@critter.freebsd.dk> <51176C95.1040308@gmx.de> <79780.1360491855@critter.freebsd.dk> <CACuKZqHMQdktfOU3PJC=X-G8R=BQ40bhFJw=ZTfeSpem9L=GEw@mail.gmail.com>
Date: Sun, 10 Feb 2013 18:09:14 -0600
Message-ID: <CAK3OfOi+cXMLGsMCpD1cRBxzz46wVYYj8nz021fhqhM7fTDMWA@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
To: Zhong Yu <zhong.j.yu@gmail.com>
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Julian Reschke <julian.reschke@gmx.de>, "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: text/plain; charset="UTF-8"
Received-SPF: none client-ip=208.97.132.202; envelope-from=nico@cryptonector.com; helo=homiemail-a88.g.dreamhost.com
X-W3C-Hub-Spam-Status: No, score=-4.5
X-W3C-Hub-Spam-Report: AWL=-2.499, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001
X-W3C-Scan-Sig: maggie.w3.org 1U4gxl-00068c-LF f4fc5d38b325902f336a7062443c32a2
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Delta Compression and UTF-8 Header Values
Archived-At: <http://www.w3.org/mid/CAK3OfOi+cXMLGsMCpD1cRBxzz46wVYYj8nz021fhqhM7fTDMWA@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16537
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On Sun, Feb 10, 2013 at 4:49 PM, Zhong Yu <zhong.j.yu@gmail.com> wrote:
> On Sun, Feb 10, 2013 at 4:24 AM, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
>>>1) Filenames in Content-Disposition
>>
>> These only have meaning to the ultimate destinations, and if their
>> filesystems don't support UTF-8, they'll have to do $something anyway.

The filesystems pretty much do support either UTF-8 or "just-use-8".
In general "just-use-8" only really interops if everyone uses the same
codeset, and the only codeset we have that can be used universally
is... Unicode.

>> Nobody in the HTTP/2 protocol-chain can do anything but treat this
>> as an opaque bytestring.
>
> But how does the 2 ends agree on which encoding to use? It might be
> easier if HTTP just dictate UTF-8.

Not might be.  Will be.

We've done this in many other protocols.  In general we must either
tag text with codeset metadata or declare that Unicode (UTF-8,
generally) SHALL be used in the middle (and pushing codeset
conversions to the edge.  No character set other than Unicode is
suitable for use "in the middle", and tagging strings with codeset
metadata is particularly difficult.

It might be useful to go over what we've done in filesystems and
remote/distributed filesystem protocols.  Very briefly, in ZFS we
implemented fast normalization-insensitive string comparison and
hashing functionality; the filesystem has an option to reject any
non-UTF-8 byte sequences, but otherwise never normalizes on CREATE
(compare to HFS+).  Meanwhile NFSv4 calls for using only UTF-8 on the
wire.  This works.  It works *really* well.  The code is even open
source.  Filesystems are a great example of an application where
tagging strings with codeset metadata doesn't work: we'd need to push
process setlocale information into the kernel, and tag strings all the
way from the system call boundary -through the VFS- to the filesystem
driver -- with consequent impact on stable interfaces up and down the
stack, and massive code modifications requirements.

Filesystems are not the only example of this, but because filesystems
cross so many layers in our stacks (user-land APIs, kernel-land APIs,
on-the-wire protocols, on-disk formats) they are perhaps the best
example.

UTF-8 in the middle.

Nico
--