Re: Unicode escape sequence | Re: draft-ietf-httpbis-header-structure-00, unicode range

Matthew Kerwin <matthew@kerwin.net.au> Wed, 14 December 2016 11:56 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 07DF8129515 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 14 Dec 2016 03:56:35 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.296
X-Spam-Level:
X-Spam-Status: No, score=-9.296 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_SORBS_SPAM=0.5, RP_MATCHES_RCVD=-2.896, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Rt8uNPnujt_d for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 14 Dec 2016 03:56:33 -0800 (PST)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id ABAC8129E07 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Wed, 14 Dec 2016 03:56:32 -0800 (PST)
Received: from lists by frink.w3.org with local (Exim 4.80) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1cH88h-0003I0-6u for ietf-http-wg-dist@listhub.w3.org; Wed, 14 Dec 2016 11:54:27 +0000
Resent-Date: Wed, 14 Dec 2016 11:54:27 +0000
Resent-Message-Id: <E1cH88h-0003I0-6u@frink.w3.org>
Received: from titan.w3.org ([128.30.52.76]) by frink.w3.org with esmtps (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <phluid61@gmail.com>) id 1cH88Y-0003H9-VG for ietf-http-wg@listhub.w3.org; Wed, 14 Dec 2016 11:54:18 +0000
Received: from mail-io0-f195.google.com ([209.85.223.195]) by titan.w3.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84_2) (envelope-from <phluid61@gmail.com>) id 1cH88R-0004PM-Pv for ietf-http-wg@w3.org; Wed, 14 Dec 2016 11:54:13 +0000
Received: by mail-io0-f195.google.com with SMTP id f73so4322722ioe.2 for <ietf-http-wg@w3.org>; Wed, 14 Dec 2016 03:53:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=6u5GRHVyLXoqFEJWVRcZNL5nUn6mk+ww/P8Klc2JJQo=; b=QXxcv4IHu35FZHKcQQKo7L3tNjNveGQozrkVmDrqm5ITaPZnrq+NEbEgiaIXUym1rw HzOUVikpoI5fxwjIT1HKSam2x2nDQym+lkTD8FuMDP+LVzMraAqWR484GFphdSqA+GmT VOd4xtzSVqBQbmC41/oAAFMzaxRZVnD2hLAVOhsD6qLMwZtbxusWzLRZlp35KghhQ1+y clEE4/Jh1IwmCcRJd1RRTEjNJqqg7te54f0/rzmEZXX8jdkXjDDZnZurjVTD/nYK14x1 VE7zE5FAlgfLxvbFJjJhk1qaQXqjg8ZuHzFN8hbN9/pY5UEOx9uN5AUPlSb4s5sEWV35 vXDw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=6u5GRHVyLXoqFEJWVRcZNL5nUn6mk+ww/P8Klc2JJQo=; b=mKuS+1s25QTnrWMdWXFrTbyDZWCAU6v0xXp/33EUjyhzgV2NX6fZBom0ZBtFngwt+V hQzI1asFr8ADrG5NW6NPd3wTfC4VF2oFVLGssv0dHl9dPFskb8W15pjgdAmCUktgiPcs QsqTVSqHFuEyocnzLX7DBe8REwRkNABq5Dt15GduimsmjNpZDUNHo9BmCT9Ys2Z/LQBv awApTH1raD20S/q+5b+lxZysWzKeMWTVbFHbt6rt3J9tnC42RWt6v6TJvsxGXiQVCMBD hwsyB8cGd7BF5M6DfVbB3K+8guEYclraFiaz8C0N7gKbIkY7YWdsXBIW88QG7iSFg9aH TBnA==
X-Gm-Message-State: AKaTC03uqi4dyrGiTbfNXMOFG8XI2vCeC3ahZ4rEhcFEiDKywpt52KcYL3hMzraz+gK9z7Gytg15RBY21Er0Hw==
X-Received: by 10.107.59.88 with SMTP id i85mr93080969ioa.198.1481716425830; Wed, 14 Dec 2016 03:53:45 -0800 (PST)
MIME-Version: 1.0
Sender: phluid61@gmail.com
Received: by 10.107.135.84 with HTTP; Wed, 14 Dec 2016 03:53:45 -0800 (PST)
In-Reply-To: <0cce5fdf-5f1a-4fd3-2e3a-e810a34baccb@gmx.de>
References: <20161213173327.C1F7D1714B@welho-filter2.welho.com> <20161213175419.GA7943@LK-Perkele-V2.elisa-laajakaista.fi> <25434.1481665395@critter.freebsd.dk> <201612140628.uBE6SO3L025885@shell.siilo.fmi.fi> <36792.1481701328@critter.freebsd.dk> <CACweHNDKgWQewZHb=Kz3_2=41M58sY5472Q5OwpqPLxorvkzHQ@mail.gmail.com> <37223.1481707288@critter.freebsd.dk> <3a65ca44-f652-3b14-6d64-46f35b32df57@isode.com> <725824b9-de61-2650-4007-fb5b026bc7a6@gmx.de> <87f1efaf-74c5-f02b-d09e-a721afa86032@isode.com> <0cce5fdf-5f1a-4fd3-2e3a-e810a34baccb@gmx.de>
From: Matthew Kerwin <matthew@kerwin.net.au>
Date: Wed, 14 Dec 2016 21:53:45 +1000
X-Google-Sender-Auth: R8G_ElkPdyI6gO-xmzxvYKPBc4U
Message-ID: <CACweHNBYf-UuxsKNxYakt22rgku9xEP4YK4yL2R+=vMf_uB2Vg@mail.gmail.com>
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Alexey Melnikov <alexey.melnikov@isode.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, Kari Hurtta <hurtta-ietf@elmme-mailer.org>, Ilari Liusvaara <ilariliusvaara@welho.com>, HTTP working group mailing list <ietf-http-wg@w3.org>, Poul-Henning Kamp <phk@varnish-cache.org>
Content-Type: multipart/alternative; boundary="94eb2c05be9438599505439cfde0"
Received-SPF: pass client-ip=209.85.223.195; envelope-from=phluid61@gmail.com; helo=mail-io0-f195.google.com
X-W3C-Hub-Spam-Status: No, score=-4.5
X-W3C-Hub-Spam-Report: AWL=-1.381, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FORGED_FROMDOMAIN=0.001, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, W3C_AA=-1, W3C_WL=-1
X-W3C-Scan-Sig: titan.w3.org 1cH88R-0004PM-Pv 84dfcceacfb6064cadbd22dd2e28c7a1
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Unicode escape sequence | Re: draft-ietf-httpbis-header-structure-00, unicode range
Archived-At: <http://www.w3.org/mid/CACweHNBYf-UuxsKNxYakt22rgku9xEP4YK4yL2R+=vMf_uB2Vg@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/33192
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

On 14 December 2016 at 20:46, Julian Reschke <julian.reschke@gmx.de> wrote:

> On 2016-12-14 11:38, Alexey Melnikov wrote:
>
>> ...
>>
>>> Has this ever been used in a protocol?
>>>
>> Some:
>> https://datatracker.ietf.org/doc/rfc5137/referencedby/
>>
>
> Actually, one.
>
> This was also extensively used in other RFCs without referencing the BCP.
>>
>
> Example?
>
> The reason why I'm asking is because the notation
>
>  \u'HHHH' or \u'HHHHHH'
>
> strikes me as:
>
> 1) verbose
>
> 2) potentially problematic because of the use of the single quote (which
> might require extra escaping in some contexts)
>
>
​Yes.

It says that "forms that use explicit string delimiters are generally
preferred over other alternatives. In many contexts, symmetric paired
delimiters are easier to recognize and understand than visually unrelated
ones." So brackets are good.

And while it advises against using Perl's \x{NNNN...} syntax (because of
potential ambiguities with two-digit hex codes), it doesn't say anything at
all about \u{N...}

Curly braces cost 14+15 bits in HPACK, parentheses 10+10 (incidentally
cheaper than single quotes, which are 11+11). It's also convenient that
little 'u' is one bit cheaper than little 'x'.

I don't think parentheses are at too much risk of needing escaping, so it
seems like the solution that goes with BCP 137, and compresses alright with
HPACK, is:

    %x5c.75.28 1*6HEXDIGIT %x29

It's still a little bit clunky for things like "Stra\u(df)e", but not so
bad for emoji "\u(1f602)" and somewhere in between for Hiragana "
\u(3053)\u(3093)\u(306b)\u(3064)".

Cheers​



> Best regards, Julian
>
> PS: and, as a nit, it's strange that the syntax uses delimiters but
> doesn't allow sequences of 1 to 3 HEXDIGs...
>
>
​Having just written "\u(df)" I kind of understand; it really feels like
I'm describing an octet rather than a codepoint. I don't think there's a
*technical* reason, though.  Is it alright to see "\u(9)" or an equivalent
in text?
-- 
  Matthew Kerwin
  http://matthew.kerwin.net.au/