Unicode sucks, get over it (Re: Delta Compression and UTF-8 Header Values)

Nico Williams <nico@cryptonector.com> Sun, 10 February 2013 22:46 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 376C221F86F5 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 10 Feb 2013 14:46:18 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.625
X-Spam-Level:
X-Spam-Status: No, score=-7.625 tagged_above=-999 required=5 tests=[AWL=2.200, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1meCEEREVoLp for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Sun, 10 Feb 2013 14:46:17 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 4C08C21F84C9 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Sun, 10 Feb 2013 14:46:17 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4feO-0005Ha-UQ for ietf-http-wg-dist@listhub.w3.org; Sun, 10 Feb 2013 22:45:32 +0000
Resent-Date: Sun, 10 Feb 2013 22:45:32 +0000
Resent-Message-Id: <E1U4feO-0005Ha-UQ@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <nico@cryptonector.com>) id 1U4feH-0005Gr-US for ietf-http-wg@listhub.w3.org; Sun, 10 Feb 2013 22:45:25 +0000
Received: from caiajhbdcahe.dreamhost.com ([208.97.132.74] helo=homiemail-a84.g.dreamhost.com) by lisa.w3.org with esmtp (Exim 4.72) (envelope-from <nico@cryptonector.com>) id 1U4feH-0002VF-46 for ietf-http-wg@w3.org; Sun, 10 Feb 2013 22:45:25 +0000
Received: from homiemail-a84.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a84.g.dreamhost.com (Postfix) with ESMTP id 992D11DE059 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 14:45:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=cryptonector.com; h= mime-version:date:message-id:subject:from:to:cc:content-type; s= cryptonector.com; bh=Aq0AxlNZj8fMZ1BHuCXqlcd/3u0=; b=OwlQA8Km/aE O5pfxmHavAJOOgmRdzsoGZbTBH2l10hhfKU5veAZx4JbU9Udfm/VKiHy13nY7gDw 4chIiYaKqVzUUFVUcOPi0J8X7fW8ygwzVntdNseJe0dG6ZbPOo1yQThfsLa5Drw8 R6CBVzvfKjA9SUg50YJlkBGDCz1rGDF0=
Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: nico@cryptonector.com) by homiemail-a84.g.dreamhost.com (Postfix) with ESMTPSA id 4BD131DE058 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 14:45:03 -0800 (PST)
Received: by mail-wg0-f53.google.com with SMTP id fn15so4335677wgb.20 for <ietf-http-wg@w3.org>; Sun, 10 Feb 2013 14:45:01 -0800 (PST)
MIME-Version: 1.0
X-Received: by 10.194.103.163 with SMTP id fx3mr8288134wjb.58.1360536301804; Sun, 10 Feb 2013 14:45:01 -0800 (PST)
Received: by 10.217.39.133 with HTTP; Sun, 10 Feb 2013 14:45:01 -0800 (PST)
Date: Sun, 10 Feb 2013 16:45:01 -0600
Message-ID: <CAK3OfOgYi-=W_QGJywf3hQbFMkfWv-ceXiJbYEdWM3-iaefP4Q@mail.gmail.com>
From: Nico Williams <nico@cryptonector.com>
To: Roberto Peon <grmocg@gmail.com>
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Julian Reschke <julian.reschke@gmx.de>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: text/plain; charset="UTF-8"
Received-SPF: none client-ip=208.97.132.74; envelope-from=nico@cryptonector.com; helo=homiemail-a84.g.dreamhost.com
X-W3C-Hub-Spam-Status: No, score=-3.5
X-W3C-Hub-Spam-Report: AWL=-3.448, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001
X-W3C-Scan-Sig: lisa.w3.org 1U4feH-0002VF-46 42f340c634dae55bd6e44c541e4a0a0a
X-Original-To: ietf-http-wg@w3.org
Subject: Unicode sucks, get over it (Re: Delta Compression and UTF-8 Header Values)
Archived-At: <http://www.w3.org/mid/CAK3OfOgYi-=W_QGJywf3hQbFMkfWv-ceXiJbYEdWM3-iaefP4Q@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16531
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On Sun, Feb 10, 2013 at 3:04 PM, Roberto Peon <grmocg@gmail.com> wrote:
> Another place where we may need to know about normalization is for caching.
> Does the lookup, etc. occur on the normalized form, or on the given data?
>
> All in all, utf-8 without addendum sucks for protocol work.

Normalization is not a UTF-8 thing, it's a Unicode thing, and it's not
really a Unicode thing either, but a result of our stupid, human
scripts and their stupid collation and other rules.

There is *nothing* that we can do for dealing with text that would do
both of: a) meet the needs of our users, and b) not suck for string
comparison, collation, and other such operations.

In other words: all these arguments about how it sucks to deal with
UTF-8 or Unicode are not useful arguments.  We have to deal with text
in at least some parts of our protocols, and that means we have to
deal with I18N.

Worse, much worse than the problems Unicode brings with it, are the
problems of having either no clue what codeset some text is in
(interop failures result), or having to support many, many codesets
(trade one set of complexities for a bigger one).  Clearly it is
better to just use Unicode for text in Internet protocols.

It is also clear that we can't really have a decent one-size-fits-all
static Huffman coding table for text that may be written in any of
tens of scripts spanning ~100k codepoints.  Now, perhaps we can
encourage the world to use URIs not IRIs and so on, but really, that
would be a step backwards.

My proposal:

 - All text values in HTTP/2.0 that are also present in HTTP/1.1
should be sent as either UTF-8 or ISO8859-1, with a one-bit tag to
indicate which it is.

   This pushes re-encoding to the ends, but it lets middle boxes
re-encode as well where they want or need to, and it gives us a nice
upgrade path.

 - All text values in HTTP/2.0 that are NOT also present in HTTP/1.1
should be sent *only* as UTF-8.

Why UTF-8 and not some other encoding of Unicode?  Because I don't see
how UTF-16 or UTF-32 could help us here.  Other encodings seem even
less likely to be useful: sure, punycode would be all ASCII, but it
wouldn't actually cause static Huffman coding to be useful.  At best
one can argue that UTF-8 penalizes some scripts with a 50% or 100%
expansion relative to script-specific codesets, so that we should
prefer UTF-16 or -32 for fairness reasons; let's not.

Nico
--