Re: [I18ndir] Working Group Last Call: Structured Headers for HTTP

"John R Levine" <johnl@taugh.com> Wed, 05 February 2020 15:02 UTC

Return-Path: <johnl@taugh.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B3AA51200DB for <i18ndir@ietfa.amsl.com>; Wed, 5 Feb 2020 07:02:33 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1536-bit key) header.d=iecc.com header.b=NHWQV9hU; dkim=pass (1536-bit key) header.d=taugh.com header.b=B9Cbjt4B
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4djlF1t9uOwT for <i18ndir@ietfa.amsl.com>; Wed, 5 Feb 2020 07:02:29 -0800 (PST)
Received: from gal.iecc.com (gal.iecc.com [IPv6:2001:470:1f07:1126:0:43:6f73:7461]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 763CB1200C7 for <i18ndir@ietf.org>; Wed, 5 Feb 2020 07:02:29 -0800 (PST)
Received: (qmail 93820 invoked from network); 5 Feb 2020 15:02:28 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=iecc.com; h=date:message-id:from:to:cc:subject:in-reply-to:references:mime-version:content-type:user-agent; s=16e7a.5e3ad904.k2002; i=johnl-iecc.com@submit.iecc.com; bh=0KBvS0l8pyBgN6zXZ20lFMDvEpxxHHA5XSpJah7kB38=; b=NHWQV9hU1iJjn/fBe86nSc5lAFuaUAarUXSktukW1wND9MJUDxpdRfTxFMsd3dKPUd4YhaeJw4vUildjaq4J/c/JPLPgRJqGBiVeNC11F/dq6KBPsRJXdVSkqaDDWGWouSUgNOo2iKYQEdee5C80VUP/+HHRahn0edvy3LNxG1UA8vB6Vra//NO1tnXiCXi/Dhj9yUkZlm3BSTUiIHVFumXJJBu77ylNaCJmUGFunpx/l0AS3epbPKkdlbh52+Eb
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=taugh.com; h=date:message-id:from:to:cc:subject:in-reply-to:references:mime-version:content-type:user-agent; s=16e7a.5e3ad904.k2002; olt=johnl-iecc.com@submit.iecc.com; bh=0KBvS0l8pyBgN6zXZ20lFMDvEpxxHHA5XSpJah7kB38=; b=B9Cbjt4BEP5HaHuGRPjZ4m5ltxtu6EzuDGdAhfmPaFqSGd6GUKz79/qno2fuMcwSM+uzlsPvkafG/IeNNjEou6oVphqCRthiF6p6R1cCR3CWStbzg/p/E0H6C3+og73/l57mXUyd6WGbsW3puJX2R0KjLl7gJSUrQvuKI5mPFqLeM9GoKMGABK3q7q78t6ShHxkg4wWpTheEUR84Qqbsz8d0elNKIwVFMj1eJ+vXaGEkEQX2zRk5HhTLuEC0kQw0
Received: from localhost ([IPv6:2001:470:1f07:1126::78:696d:6170]) by imap.iecc.com ([IPv6:2001:470:1f07:1126::78:696d:6170]) with ESMTPSA (TLS1.3 ECDHE-RSA AES-256-GCM AEAD, johnl@iecc.com) via TCP6; 05 Feb 2020 15:02:27 -0000
Date: Wed, 05 Feb 2020 10:02:27 -0500
Message-ID: <alpine.OSX.2.21.99999.374.2002051000100.38707@ary.qy>
From: John R Levine <johnl@taugh.com>
To: John C Klensin <john-ietf@jck.com>
Cc: Patrik Fältström <patrik@frobbit.se>, i18ndir@ietf.org, art-ads@ietf.org, "Murray S. Kucherawy" <superuser@gmail.com>
In-Reply-To: <74CA67892D7FCDB725238B1B@PSB>
References: <20200203173404.88EE813AA055@ary.qy> <E2361F8BA970A15043416C2D@PSB> <alpine.OSX.2.21.99999.374.2002031653540.31381@ary.qy> <D03AE38116EF15538E10CFAF@PSB> <7D31FE0A-D4EC-4096-83FE-97D2BF4908F5@frobbit.se> <alpine.OSX.2.21.99999.374.2002041007110.33467@ary.qy> <47AEE7D582019051ACF36647@PSB> <alpine.OSX.2.21.99999.374.2002041149130.34062@ary.qy> <4A65258034E64E1A97EFDF7A@PSB> <E3DA1665-DB13-46D2-9212-33E647D92716@frobbit.se> <3E143C646E27AEB48F08B065@PSB> <572D8717-3545-4F37-8EC9-194D6A5A0E3A@frobbit.se> <74CA67892D7FCDB725238B1B@PSB>
User-Agent: Alpine 2.21.99999 (OSX 374 2019-10-27)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="0-369018697-1580914947=:38707"
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/f3Qb0zkT01w33eRVscjNeaEKWhY>
Subject: Re: [I18ndir] Working Group Last Call: Structured Headers for HTTP
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Feb 2020 15:02:34 -0000

On Wed, 5 Feb 2020, John C Klensin wrote:

> Executive summary for Murray and any current ART ADs who have
> not been following the I18N Directorate (i18ndir) list:  Martin
> Dürst posted a suggestion to the list several days ago that
> suggested we take a careful look at
> draft-ietf-httpbis-header-structure-15.  The request has led to
> a lively discussion among John Levine, Patrik Fältström and
> myself, during which I, at least, have learned a great deal
> especially about how difficult it is to find just the right
> language and terminology to talk about these areas even among
> experts with decades of experience working together on them.
> This note is, I hope, at or near the end of the thread rather
> than at the beginning or middle and ends with a few paragraphs
> of draft recommendations to you as ADs.   You can probably
> skip over the rest of it and the earlier entries in the thread
> unless you are interested in either how we got her or a deeper
> understanding of the issues.

One other point worth mentioning: we are not asking them to invent new 
stuff.  For example, this draft has base64 encoded i18n text which needs a 
tag for the encoding.  MIME already does that, they can reuse the same 
syntax.

R's,
John

> --On Wednesday, February 5, 2020 11:37 +0100 Patrik Fältström
> <patrik@frobbit.se> wrote:
>
>> On 5 Feb 2020, at 3:32, John C Klensin wrote:
>>
>>>> But ok...you are right.
>>>
>>> Right about what?
>>
>> That specifics about normalization should not be in a document
>> that describes a container format.
>
> Ah.  See below.
>
>>> We may be in agreement after all.  I think there are at least
>>> three different issues here.
>>
>> I hope so as well.
>>
>>> (1) I took Martin's note, especially the "in this day and age"
>>> part, as suggesting that the document should encourage,
>>> rather than discourage, non-ASCII strings in header field
>>> values.   I suggested that, in most cases, discouraging such
>>> values was entirely appropriate although it was reasonable to
>>> make provision for them in special cases.
>>>
>>> (2) The we got to the question of whether non-ASCII strings
>>> should be allowed in header field names at all.   I think we
>>> (at least you, John Levine, and me) are in agreement that it
>>> almost certainly a bad idea.
>>
>> I think restricting unicode to parameter values (which is even
>> more strict than header field values, as you might have
>> key/value pairs as part of a header field value in the draft)
>> is correct without major redesign of the whole thing.
>
> Agree about the principle.   I wonder if, some day, that might
> turn out to be over-restrictive.   On the other hand, this is
> about "structured headers" and not unstructured ones, so a
> parameter restriction is probably reasonable.  As I read your
> note and think about it more, the core issue is probably
> different; see under "my question" below.
>
>>> (3) In the situations that non-ASCII characters are allowed
>>> in header field values, we are in agreement that the
>>> description is inadequate in multiple ways.  They include
>>> that the spec should either require UTF-8 everywhere (e.g.,
>>> "encoding MSUT be Unicode in UTF=8") or the description /
>>> container syntax  must specify how and in what syntax the
>>> encoding (or "charset') is specified. They also include a
>>> requirement that any header files defined according to this
>>> spec, at least ones that allow non-ASCII characters  MUST be
>>> required to include in those definitions.
>>
>> Agree.
>>
>>> Those definitions would be required to have either an
>>> "Internationalization Considerations" section or or
>>> equivalent. And that section MUST either explain why
>>> comparison, searching, or ordering of all or part of the
>>> header field value is never going to be required  or describe
>>> the mechanisms or conventions needed to carry out the
>>> operations.
>>
>> Yes.
>>
>>> For (3), I first understood you to be suggesting a discussion
>>> of, e.g., normalization in the container/ format document
>>> that would apply to all header fields defined under it.  I
>>> think that would be wrong because different header fields and
>>> their values might require different treatment.  However, I
>>> certainly agree that information has to be provided somewhere
>>> and that this document is deficient in saying as little as it
>>> does about strings, byte sequences, and non-ASCII characters
>>> without spelling out the requirement for that type of
>>> information to be specified somewhere.   I also think that
>>> the IETF must never again produce a technical specification
>>> or BCP that deal with or allows non-ASCII characters by
>>> saying the equivalent of "just use UTF-8".
>>>
>>> Does that put us closer together?
>>
>> Yes.
>>
>> My question was whether not this spec:
>>
>> 1. Should point out normalization is recommended as only UTF-8
>> is not enough.
>
> Again, aha.   This is where my thinking has evolved considerably
> in the last five or six years.  Well before that, I would have
> said (and did) that say "just used UTF-8" was not enough but
> that normalization was sufficient.  Now, after the
> non-decomposing character discovery and analysis, more complex
> debates about confusable characters, Asmus's "troublesome"
> efforts, vomiting cowboys and their various friends and
> relatives, and many other issues, I would say that recommending
> UTF-8 and normalization is not enough either.
>
> The question of whether this document should specify what should
> be done or should specify that any definition of a particular
> field written to be consistent with it must specify that is
> separate.  I think we are, at this point, in agreement that it
> should be the second.
>
>> 2. Should talk about Unicode "more" when looking at byte
>> sequences. I.e. it does not talk about it at all when looking
>> at byte sequences. It talks about non-ascii in one place, in
>> the definition of "text" and says "non-ascii is not next, it
>> is binary".
>
> And I didn't catch that until now.  That is exactly got us into
> trouble with the DNS, i.e., the claim that if a string was
> all-ASCII (in that case, defined as none of the octets having a
> non-zero leading bit), then it obeyed strict,
> carefully-spelled-out rules but, if it wasn't, it was strictly a
> string of octets with no defined semantics at all.  That means
> two strings don't match unless they are bit-identical and that
> any sorting operations effectively have to be on the full string
> (not octet by octet and not considering the variable length for
> different code points encoding of things like UTF-8) and that
> maybe there is a big-endian versus little-endian question.  Even
> without concerns about normalization (much less the other things
> suggested above), as soon as one talks about Unicode (in UTF-8
> or otherwise), one is talking about "characters", not byte (or
> bit) sequences because any actual use or interpretation will
> depend on code points, not bits or octets.   So the
> recommendation to use UTF-8 and the recommendation to treat
> things as byte sequences are contradictory at a fundamental
> technical level and the penultimate paragraph of Section 3.3.3
> is bad news.
>
> I haven't studied things carefully enough, but there is another
> fundamental problem with this document.   RFC 8187 already
> specifies a way to do character handling in HTTP header fields,
> one that is consistent with my/our comments about encoded words
> several days ago.  It also specifies that UTF-8 is mandatory to
> implement.  That this document does not either reference or
> update that one but is inconsistent with it in suggesting
> treating character sequences as simply byte sequences seems to
> me to be a grave omission.
>
> Martin's "in this day and age" comment applies to the
> questions/comments above even if it didn't apply as an argument
> for more internationalization.  Maybe that is what he meant in
> the first place-- his has been surprisingly silent since this
> three-way discussion started.
>
> Is that about right?
>
> If it is, I think it is time for two directorate
> recommendations.  One is that this document, with respect to
> i18n issues, is not ready for IETF LC.  It may even be that the
> HTTP WG needs to clean up its act and, for i18n issues, may be
> in need of adult supervision because they are contradicting
> existing standards track documents within their own scope,
> apparently without noticing.
>
> The second is that it may be time to update the recommendations
> about character sets, Internationalization Considerations, etc.
> I don't know whether the revised recommendation should find
> expression in an IESG statement, a new BCP, or some other form
> of guidance or direction but it is now clear to me that, without
> it, the IETF is just going to dig itself in deeper and deeper...
> even more so if there is not a functioning directorate or
> equivalent than if there is.   Those recommendation would be,
> approximately:
>
> (i) For any document that discusses or allows the use of
> non-ASCII characters, UTF-8 is mandatory to implement.  Other
> encodings of Unicode or use of other coded character sets may be
> justified by circumstances but, if they are allowed, the CCS and
> any additional encoding information required with it must be
> identified, e.g., by the mechanisms defined in RFCs 2231 and
> 8187.
>
> (ii) For any document that allows or specifies the use of
> non-ASCII characters in a protocol context (even in what is
> nominally free text), neither "use UTF-8" nor "Use UTF-8 and
> specify a normalization" are sufficient.  The presence of such
> statements without further explanation is usually an indication
> that an author or WG has paid insufficient attention to i18n
> issues but is, instead and intentionally or not, trying to blow
> them off.  In almost all cases that will be visible to users,
> character strings will eventually be compared to others, sorted,
> searched and/or rendered.  For one or more languages or scripts,
> each of those operations requires treatment that goes beyond
> simply examining octets.   Documents allowing such strings must
> either specify how those issues are addressed or must explain
> why they are not applicable.  For characters or strings than are
> protocol elements not visible to users, specific justification
> is required if non-ASCII characters are to be allowed.
>
> best,
>   john
>
>

Regards,
John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY
Please consider the environment before reading this e-mail. https://jl.ly