Re: Unicode sucks, get over it (Re: Delta Compression and UTF-8 Header Values)

Julian Reschke <julian.reschke@gmx.de> Mon, 11 February 2013 08:38 UTC

Return-Path: <ietf-http-wg-request@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4E0EF21F854C for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 11 Feb 2013 00:38:05 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.316
X-Spam-Level:
X-Spam-Status: No, score=-10.316 tagged_above=-999 required=5 tests=[AWL=0.131, BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ov4Nr0DJC+xg for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 11 Feb 2013 00:38:04 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) by ietfa.amsl.com (Postfix) with ESMTP id 969DD21F8521 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Mon, 11 Feb 2013 00:38:04 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.72) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1U4osC-0002Hl-4G for ietf-http-wg-dist@listhub.w3.org; Mon, 11 Feb 2013 08:36:24 +0000
Resent-Date: Mon, 11 Feb 2013 08:36:24 +0000
Resent-Message-Id: <E1U4osC-0002Hl-4G@frink.w3.org>
Received: from lisa.w3.org ([128.30.52.41]) by frink.w3.org with esmtp (Exim 4.72) (envelope-from <julian.reschke@gmx.de>) id 1U4os5-0002H2-4N for ietf-http-wg@listhub.w3.org; Mon, 11 Feb 2013 08:36:17 +0000
Received: from mout.gmx.net ([212.227.17.20]) by lisa.w3.org with esmtp (Exim 4.72) (envelope-from <julian.reschke@gmx.de>) id 1U4os4-00010k-Dy for ietf-http-wg@w3.org; Mon, 11 Feb 2013 08:36:17 +0000
Received: from mailout-de.gmx.net ([10.1.76.33]) by mrigmx.server.lan (mrigmx002) with ESMTP (Nemesis) id 0M1TnD-1Utc4K35aL-00tPvn for <ietf-http-wg@w3.org>; Mon, 11 Feb 2013 09:35:49 +0100
Received: (qmail invoked by alias); 11 Feb 2013 08:35:49 -0000
Received: from p5DD94BA1.dip.t-dialin.net (EHLO [192.168.2.117]) [93.217.75.161] by mail.gmx.net (mp033) with SMTP; 11 Feb 2013 09:35:49 +0100
X-Authenticated: #1915285
X-Provags-ID: V01U2FsdGVkX18dz0xXSOwdR4+WFWk+O4KWF7TA87H41ucjxzAKGj OuBPkeooHocJyg
Message-ID: <5118AD61.6030003@gmx.de>
Date: Mon, 11 Feb 2013 09:35:45 +0100
From: Julian Reschke <julian.reschke@gmx.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: Nico Williams <nico@cryptonector.com>
CC: Roberto Peon <grmocg@gmail.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
References: <CAK3OfOgYi-=W_QGJywf3hQbFMkfWv-ceXiJbYEdWM3-iaefP4Q@mail.gmail.com>
In-Reply-To: <CAK3OfOgYi-=W_QGJywf3hQbFMkfWv-ceXiJbYEdWM3-iaefP4Q@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Y-GMX-Trusted: 0
Received-SPF: pass client-ip=212.227.17.20; envelope-from=julian.reschke@gmx.de; helo=mout.gmx.net
X-W3C-Hub-Spam-Status: No, score=-3.4
X-W3C-Hub-Spam-Report: AWL=-3.425, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001
X-W3C-Scan-Sig: lisa.w3.org 1U4os4-00010k-Dy ce4823cb22c323a479a1e08a7cca05b0
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Unicode sucks, get over it (Re: Delta Compression and UTF-8 Header Values)
Archived-At: <http://www.w3.org/mid/5118AD61.6030003@gmx.de>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/16550
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On 2013-02-10 23:45, Nico Williams wrote:
> On Sun, Feb 10, 2013 at 3:04 PM, Roberto Peon <grmocg@gmail.com> wrote:
>> Another place where we may need to know about normalization is for caching.
>> Does the lookup, etc. occur on the normalized form, or on the given data?
>>
>> All in all, utf-8 without addendum sucks for protocol work.
>
> Normalization is not a UTF-8 thing, it's a Unicode thing, and it's not
> really a Unicode thing either, but a result of our stupid, human
> scripts and their stupid collation and other rules.
>
> There is *nothing* that we can do for dealing with text that would do
> both of: a) meet the needs of our users, and b) not suck for string
> comparison, collation, and other such operations.
>
> In other words: all these arguments about how it sucks to deal with
> UTF-8 or Unicode are not useful arguments.  We have to deal with text
> in at least some parts of our protocols, and that means we have to
> deal with I18N.
>
> Worse, much worse than the problems Unicode brings with it, are the
> problems of having either no clue what codeset some text is in
> (interop failures result), or having to support many, many codesets
> (trade one set of complexities for a bigger one).  Clearly it is
> better to just use Unicode for text in Internet protocols.
>
> It is also clear that we can't really have a decent one-size-fits-all
> static Huffman coding table for text that may be written in any of
> tens of scripts spanning ~100k codepoints.  Now, perhaps we can
> encourage the world to use URIs not IRIs and so on, but really, that
> would be a step backwards.
>
> My proposal:
>
>   - All text values in HTTP/2.0 that are also present in HTTP/1.1
> should be sent as either UTF-8 or ISO8859-1, with a one-bit tag to
> indicate which it is.
> ...

Why do we need two options?

Best regards, Julian