Re: [I18ndir] Working Group Last Call: Structured Headers for HTTP

John C Klensin <john-ietf@jck.com> Wed, 05 February 2020 02:32 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A8A6C1200F5 for <i18ndir@ietfa.amsl.com>; Tue, 4 Feb 2020 18:32:43 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.898
X-Spam-Level:
X-Spam-Status: No, score=-1.898 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3lrLp82y3sTH for <i18ndir@ietfa.amsl.com>; Tue, 4 Feb 2020 18:32:41 -0800 (PST)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 55EA0120045 for <i18ndir@ietf.org>; Tue, 4 Feb 2020 18:32:41 -0800 (PST)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1izAUR-000NCB-V7; Tue, 04 Feb 2020 21:32:31 -0500
Date: Tue, 04 Feb 2020 21:32:25 -0500
From: John C Klensin <john-ietf@jck.com>
To: Patrik Fältström <patrik@frobbit.se>
cc: John R Levine <johnl@taugh.com>, i18ndir@ietf.org
Message-ID: <3E143C646E27AEB48F08B065@PSB>
In-Reply-To: <E3DA1665-DB13-46D2-9212-33E647D92716@frobbit.se>
References: <20200203173404.88EE813AA055@ary.qy> <E2361F8BA970A15043416C2D@PSB> <alpine.OSX.2.21.99999.374.2002031653540.31381@ary.qy> <D03AE38116EF15538E10CFAF@PSB> <7D31FE0A-D4EC-4096-83FE-97D2BF4908F5@frobbit.se> <alpine.OSX.2.21.99999.374.2002041007110.33467@ary.qy> <47AEE7D582019051ACF36647@PSB> <alpine.OSX.2.21.99999.374.2002041149130.34062@ary.qy> <4A65258034E64E1A97EFDF7A@PSB> <E3DA1665-DB13-46D2-9212-33E647D92716@frobbit.se>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/e74aCFNPChYfJprGLuIMQLA5kog>
Subject: Re: [I18ndir] Working Group Last Call: Structured Headers for HTTP
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Feb 2020 02:32:44 -0000


--On Tuesday, February 4, 2020 23:35 +0100 Patrik Fältström
<patrik@frobbit.se> wrote:

> On 4 Feb 2020, at 20:24, John C Klensin wrote:
> 
>>> I think we agree, normalization is a level up from what
>>> they're describing here.
>> 
>> Ack.
> 
> Ok, I have tried to really understand what they are doing
> here, and it feels like if Unicode was pasted in after they
> had designed the whole thing.

Indeed.  It feels as if they started with "all ASCII" and then
someone said "well, how about cases that need non-ASCII
characters", and they responded with a lot of circling around,
much of which amounts to "just use UTF-8".  That is something we
know is almost always inadequate or worse and often a sign of
deeper problems and lack of thought.

> Look for example at B.1 when they compare with JSON. They say
> one advantage of their format is that JSON do allow Unicode
> data which gives interoperability issues, but they do allow it
> themselves as well.
> 
> Then this is 3.3.3:
> 
>> Unicode is not directly supported in strings, because it
>> causes a number of interoperability issues, and - with few
>> exceptions - header values do not require it.
> 
> How do they know header values do not need it -- with a few
> exceptions?
> 
>> When it is necessary for a field value to convey non-ASCII
>> content, a byte sequence (Section 3.3.5) SHOULD be specified,
>> along with a character encoding (preferably [UTF-8]).
> 
> I think that the default encoding MUST be UTF-8 OR specified
> explicitly.
> 
> I further think it should be noted comparison of parameter
> values is NOT specified in this base specification as
> normalization might create non-interoperability. If needed,
> the specification of the header must say how comparison is
> managed.
> 
> I also think 4.2.7 should mention the resulting sequence of
> the parsing of a binary structure might be a UTF-8 encoded
> string with a reference to 3.3.3.
> 
> I.e. I find it hard to read (and understand why I
> misunderstood this at first) that the only mentioning of
> Unicode and UTF-8 is in "string" but the only thing it does is
> to reference byte sequence, which in turn do never talk about
> it. Neither at serialization or deserialization. So where
> UTF-8 strings are described thet are not mentioned.
> 
> But ok...you are right.

Right about what?  We may be in agreement after all.  I think
there are at least three different issues here.

(1) I took Martin's note, especially the "in this day and age"
part, as suggesting that the document should encourage, rather
than discourage, non-ASCII strings in header field values.   I
suggested that, in most cases, discouraging such values was
entirely appropriate although it was reasonable to make
provision for them in special cases.

(2) The we got to the question of whether non-ASCII strings
should be allowed in header field names at all.   I think we (at
least you, John Levine, and me) are in agreement that it almost
certainly a bad idea.

(3) In the situations that non-ASCII characters are allowed in
header field values, we are in agreement that the description is
inadequate in multiple ways.  They include that the spec should
either require UTF-8 everywhere (e.g., "encoding MSUT be Unicode
in UTF=8") or the description / container syntax  must specify
how and in what syntax the encoding (or "charset') is specified.
They also include a requirement that any header files defined
according to this spec, at least ones that allow non-ASCII
characters  MUST be required to include in those definitions.
Those definitions would be required to have either an
"Internationalization Considerations" section or or equivalent.
And that section MUST either explain why comparison, searching,
or ordering of all or part of the header field value is never
going to be required  or describe the mechanisms or conventions
needed to carry out the operations.

For (3), I first understood you to be suggesting a discussion
of, e.g., normalization in the container/ format document that
would apply to all header fields defined under it.  I think that
would be wrong because different header fields and their values
might require different treatment.  However, I certainly agree
that information has to be provided somewhere and that this
document is deficient in saying as little as it does about
strings, byte sequences, and non-ASCII characters without
spelling out the requirement for that type of information to be
specified somewhere.   I also think that the IETF must never
again produce a technical specification or BCP that deal with or
allows non-ASCII characters by saying the equivalent of "just
use UTF-8".

Does that put us closer together?

   john