Re: Benjamin Kaduk's Discuss on draft-ietf-httpbis-header-structure-18: (with DISCUSS and COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Wed, 20 May 2020 15:42 UTC

Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 808613A0AD5 for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 20 May 2020 08:42:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.649
X-Spam-Level:
X-Spam-Status: No, score=-2.649 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, MAILING_LIST_MULTI=-1, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jrac87hmJJkb for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Wed, 20 May 2020 08:42:24 -0700 (PDT)
Received: from lyra.w3.org (lyra.w3.org [128.30.52.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 33D7C3A0AD3 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Wed, 20 May 2020 08:42:23 -0700 (PDT)
Received: from lists by lyra.w3.org with local (Exim 4.92) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1jbQoW-0002s4-EI for ietf-http-wg-dist@listhub.w3.org; Wed, 20 May 2020 15:39:24 +0000
Resent-Date: Wed, 20 May 2020 15:39:24 +0000
Resent-Message-Id: <E1jbQoW-0002s4-EI@lyra.w3.org>
Received: from mimas.w3.org ([128.30.52.79]) by lyra.w3.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from <kaduk@mit.edu>) id 1jbQoV-0002rR-SE for ietf-http-wg@listhub.w3.org; Wed, 20 May 2020 15:39:23 +0000
Received: from outgoing-auth-1.mit.edu ([18.9.28.11] helo=outgoing.mit.edu) by mimas.w3.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from <kaduk@mit.edu>) id 1jbQoT-0003mZ-Go for ietf-http-wg@w3.org; Wed, 20 May 2020 15:39:23 +0000
Received: from kduck.mit.edu ([24.16.140.251]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 04KFcx5X018901 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 11:39:01 -0400
Date: Wed, 20 May 2020 08:38:58 -0700
From: Benjamin Kaduk <kaduk@mit.edu>
To: Mark Nottingham <mnot@mnot.net>
Cc: The IESG <iesg@ietf.org>, draft-ietf-httpbis-header-structure@ietf.org, httpbis-chairs@ietf.org, HTTP Working Group <ietf-http-wg@w3.org>, Tommy Pauly <tpauly@apple.com>
Message-ID: <20200520153858.GZ58497@kduck.mit.edu>
References: <158985680600.32294.14997877272521602557@ietfa.amsl.com> <CFC9B0CA-6125-4705-A13E-F2260F3EC3A5@mnot.net> <20200519170036.GN58497@kduck.mit.edu> <FB915609-4CD1-416A-98AD-3950BF3BE09C@mnot.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <FB915609-4CD1-416A-98AD-3950BF3BE09C@mnot.net>
User-Agent: Mutt/1.12.1 (2019-06-15)
X-W3C-Hub-Spam-Status: No, score=-7.2
X-W3C-Hub-Spam-Report: BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, W3C_AA=-1, W3C_IRA=-1, W3C_WL=-1
X-W3C-Scan-Sig: mimas.w3.org 1jbQoT-0003mZ-Go b7ee0adfffe5d54d72fc03215eb8f1e6
X-Original-To: ietf-http-wg@w3.org
Subject: Re: Benjamin Kaduk's Discuss on draft-ietf-httpbis-header-structure-18: (with DISCUSS and COMMENT)
Archived-At: <https://www.w3.org/mid/20200520153858.GZ58497@kduck.mit.edu>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/37680
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <https://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>

Hi Mark,

On Wed, May 20, 2020 at 06:21:05PM +1000, Mark Nottingham wrote:
> Hi Ben,
> 
> See below. The latest commit referred to is <https://github.com/httpwg/http-extensions/commit/83f3ab6c0>.
> 
> > On 20 May 2020, at 3:00 am, Benjamin Kaduk <kaduk@mit.edu> wrote:
> >> 
> 
> >> I think the solution here is to change the last paragraph of Notational Conventions to:
> >> 
> >> """
> >> For serialization to HTTP fields, the ABNF illustrates the expected wire representation with as much fidelity as possible, and the algorithms define the recommended way to produce them. Implementations MAY vary from the specified behavior so long as the output is still correctly handled by the parsing algorithm.
> >> """
> >> 
> >> Does that work for you, Ben?
> > 
> > This makes the parsing algorithms normative for both parsing and
> > serialization, which alleviates the concerns about skew.
> > 
> > I do agree with Julian that making the (minor!) change to the ABNF to
> > remove that axis of skew seems worthwhile, though I do not think I can make
> > that a Discuss point (given that I, like him, would prefer to keep the ABNF
> > as-is over removing it entirely as you propose).
> 
> See latest commit.

I see the update to make the parsing algorithm fully normative; is there
also something on the empty-list ABNF front that I'm failing to find?

> >>> Additionally, when Section 4.2.3.2 discusses parsing parameter map keys,
> >>> the handling for duplicate map key names is specified as overwriting the
> >>> previous value, in contrast to the prose description (Section 3.1.2)
> >>> that describes these keys as "unique within the scope [of] the
> >>> Parameters they occur within".  (While dictionary key names might be
> >>> expected to have a similar behavior, I did not find conflicting text for
> >>> that behavior.)
> >> 
> >> That's because there's a difference between the underlying data model (specified in section 3) and the algorithms for how we get them into and out of textual HTTP headers (section 4). There might be another serialisation of the data model in the future that has a different way of handling duplicates.
> > 
> > Perhaps I'm just hung up on the "intentionally strict processing" preface,
> > but why allow duplicates at all vs. making them an error?
> 
> Duplicates are surprisingly common in the wild; I suspect because people apply policy to add a header at multiple points, and they end up duplicating the work. We had a discussion about whether to hard fail in this case or to swallow the duplicate, and the outcome was the latter; the important part was that behaviour was consistent, not that it's universally strict.

Okay.  Sometimes it's nice if a summary of the discussion points ends up in
the document (but not knowing the discussion points myself I can't say more
for this particular case).

> >>> (1) What aspect(s) of structured field processing are case (in)sensitive?
> >>> The only mentions I see of case sensitivity are in Section 4.2 discussing
> >>> header field names and (implicitly) Section 4.2.2 discussing a
> >>> "character-for-character" comparison of dictionary key names, but of
> >>> course we cite RFC 5234 for ABNF, which uses case-insensitive matching.
> >>> On the other hand, base64 encoding requires case sensitivity for
> >>> successful round-tripping of arbitrary binary data.
> >> 
> >> As per S 2, it's up to each field to constrain the syntax of different types as appropriate to their semantics. The only exception is keys in dictionaries and parameters, which are explicitly lowercase (to avoid differences of only case there).
> > 
> > Do we want to call out case in § 2 as something that structured field
> > specifications should consider in their definition?  It's a bit of a
> > nitpicky details, admittedly, but it has caused enough problems in the past
> > that it may be worth it.
> 
> Makes sense; see latest commit.
> 
> > Also, could you help me find where we explicitly use lowercase
> > dictionary/parameter keys?  I don't remember seeing that.
> 
> Key's ABNF uses lcalpha, and its parsing and serialisation algorithms require it.

Ah, thanks; I must have glossed over that while reading.

> >>> (2) One of the stated goals of this document is to define intentionally
> >>> strict processing rules, but there are several places where we could
> >>> have removed additional optionality but elected to not do so.  What is
> >>> the criterion for "too far" towards strictness?  For example, Section
> >>> 4.2.7 gives optionality with respect to base64 padding (see COMMENT).
> >> 
> >> It's a judgement call in each case. As discussed below, we were lax on base64 due to common implementation limitations. In other cases, e.g., whitespace handling, it was to allow use with some existing headers, should one choose to, and because some people are still going to be creating these headers by hand (e.g., in .htaccess on Apache).
> >> 
> >> I'm happy to chat about it, but why is this last one a DISCUSS?
> > 
> > If there's a specific rule, it would need to be applied consistently.
> > Given that you say "judgement call in each case" that may be moot (in the
> > US sense), though I would have been happier if there was a more intentional
> > rule.
> > 
> > Also, since its vaguely topical here (but not part of the Discuss), you
> > note that several details were chosen so as to allow use with existing
> > headers.  The follow-up discussion downthread clarifies that, yes, this
> > spec itself only explicitly applies to new field specifications, but that
> > doesn't stop someone from noticing that existing fields have a compatible
> > encoding and can reuse the same machinery.  While I don't know of a
> > requirement for it, I'm used to seeing a mention in the document of the
> > motivation for such details, so that it puts everyone on more level ground
> > and doesn't require individual cleverness (and duplicated analysis work!)
> > to reach that conclusion.  Is there a reason that we don't want to mention
> > this possibility of parser/serializer reuse for existing fields in the
> > document itself?  I would find it helpful to understand the motivation for
> > certain details, though I concede that I may be unusual in that regard and
> > not part of the target audience.
> 
> This may help:
>   https://mnot.github.io/I-D/binary-structured-headers/
> 
> Especially Section 4 and Appendix A. 

Indeed, thanks for the pointer.

> It's not adopted, but so far the response has been pretty positive IMO.  
> 
> We're definitely doing a balancing act here; we didn't want to be constrained to being interoperable with all existing headers (or even the majority), because that would have made SF much more complex (given how far over the map they are; there are trivial differences in things like whitespace handling, quoting and other aspects from header to header). OTOH we did want to be able to back port common ones without too much effort. There's also some history; we wanted to be very very clear that these algorithms can't just be used on existing headers without some accommodation for values that fall out of the SF serialisation (as BSF has).
> 
> >>> What's the motivation for "MUST omit values of boolean true" (in
> >>> parameters and dictionaries)?  It seems to make the output rules more
> >>> complicated without a significant gain in encoding size.
> >> 
> >> It allows some existing HTTP headers to be treated as structured for the purposes of serialisation.
> > 
> > The downthread discussion also made the claim that this is due to a desire
> > to have exactly one way to serialize.  But what's the good in having
> > exactly one way to serialize if you don't enforce that at the receiver?
> > (Compatibility with already-defined header fields, no doubt, but it does
> > kind of undercut some of the stated goals of this document.)
> 
> Sorry, I lost state. Where did you get that impression?

Julian said it at
https://lists.w3.org/Archives/Public/ietf-http-wg/2020AprJun/0172.html and
I didn't see anyone jumping up to correct it.

> >>> Section 3.1
> >>> 
> >>>  An empty List is denoted by not serializing the field at all.
> >>> 
> >>> So any structured field definition that allows for an empty list has to
> >>> define its semantics to be equivalent to the semantics used for when the
> >>> peer does not support that structured field at all?
> >> 
> >> Yes; this point was discussed (a lot, if memory serves).
> > 
> > This seems like the sort of (potentially security relevant) detail that's
> > worth noting explicitly.
> 
> See latest commit.
> 
> >>>  A header field whose value is defined as a List of Inner Lists with
> >>>  Parameters at both levels could look like:
> >>> 
> >>> Is this also an inner list of strings?
> >> 
> >> It could be.
> > 
> > I mean, this particular example looks to my eye to be an inner list of
> > strings.  The previous example specifically called out that its example
> > used an inner list of strings, so a reader might notice the lack of such
> > callout and wonder if this one is supposed to be different.
> 
> Right, but "A header field whose value is defined as a List of Inner Lists of Strings with Parameters" is quite a mouthful.

Okay.

> >>>  6.  Return output_string converted into an array of bytes, using
> >>>      ASCII encoding [RFC0020].
> >>> 
> >>> This implicitly assumes that output_string contains only characters
> >>> representable in ASCII.  I believe this is currently true based on the
> >>> relevant ABNF constructions and serialization procedures, but it still
> >>> feels like a needless dangling edge case.
> >> 
> >> Why? HTTP fields are already strongly encouraged to be ASCII.
> > 
> > If someone tries to hand me an output_string for which this encoding fails,
> > am I expected to fail to emit the entire structured field?
> 
> Yes, but as you point out, the other algorithms that feed into this assure that it's already ascii.
> 
> (I'm beginning to appreciate that the downside of writing an algorithmic specification is that reviewers treat it like a code review)

Bugs and edge cases are bugs and edge cases wherever they appear...
but if you want an actual code review you could check out
https://mailarchive.ietf.org/arch/msg/netmod/m7salwQonJAwDYlNXfHOVTlTni8/
and
https://mailarchive.ietf.org/arch/msg/netmod/1fYZVgwskR3pH2cYzpWAH8lI0to/

> >>> Section 4.2
> >>> 
> >>>  Strings split across multiple field lines will have unpredictable
> >>>  results, because comma(s) and whitespace inserted upon combination
> >>>  will become part of the string output by the parser.  Since
> >>>  concatenation might be done by an upstream intermediary, the results
> >>>  are not under the control of the serializer or the parser.
> >>> 
> >>> This seems to have the implication that the results are unpredictable
> >>> even if a serializer and parser collaborate to use a procedure that, in
> >>> disregard of this specification, splits individual list and/or
> >>> dictionary entries across multiple field lines, due to the potential for
> >>> an intermediary that is not complicit in the deviation from the spec.
> >>> In some sense, that might be the key implication of this statement, in
> >>> which case it's surprising to not see it spelled out more clearly.
> >> 
> >> Do you have a suggestion?
> > 
> > I'd consider adding a clause at the end, "even when the serializer and
> > parser are under the control of the same party", though that may have flaws
> > of its own.
> 
> I'll give it a try. See latest commit.

Thanks.  I don't see any fatal problems with it yet...

> >>> Section 4.2.1.2
> >>> 
> >>>  3.  While input_string is not empty:
> >>> 
> >>>      1.  Discard any leading SP characters from input_string.
> >>> 
> >>>      2.  If the first character of input_string is ")":
> >>> 
> >>> Don't we need to check if the input_string is empty again after
> >>> consuming leading SP but before attempting to consume anything else?
> >> 
> >> In that case, it will fall through to (3) - running Parsing an Item - and fail there.
> > 
> > If we ask people to inspect the first character of an empty string, is
> > everybody going to get it right and avoid invoking undefined behavior in
> > their implementation language of choice?  To my eye there is some risk of
> > security-relevant implementation errors here.
> 
> We rely primarily on the test suite for avoiding interpretation problems like this. In general, we're trying to write pseudo-code; the expectation is that the implementer knows their language better than we do (backed up by the test suite).

I don't know how likely the test suite is to expose various kinds of
problematic behavior that may not reflect themselves as bad content in the
immediately next response, but I note this is a non-blocking comment and we
shouldn't spend too much time on it.

> >>> Section 4.2.2
> >>> 
> >>>  Given an ASCII string as input_string, return an ordered map whose
> >>>  values are (item_or_inner_list, parameters) tuples. input_string is
> >>> 
> >>> Should we say anything about the map keys?
> >> 
> >> I'm struggling to come up with text that is worth the extra space taken; "keys that are keys" is hardly informative, and going into more detail seems to be re-specifying in the summary.
> > 
> > "an ordered map whose keyed values are (item_or_inner_list, parameters)
> > tuples", maybe?
> 
> I don't think that adds anything.

Okay.

> >>> Section 6
> >>> 
> >>> It seems worth mentioning the handling for duplicated key names (e.g.,
> >>> in parameters and dictionaries) w.r.t. overwrite or must-be-unique, and
> >>> how there have been previous vulnerabilities relating to different
> >>> implementations choosing "first one wins" vs. "last one wins".
> >> 
> >> That doesn't seem to apply to a correct implementation, only to headers that *aren't* structured fields.
> > 
> > It's still motivation for why we are making the choices we did and a
> > benefit that structured headers have over the existing mechanisms.
> 
> Right, but that doesn't seem appropriate in Security Considerations; it's more Introduction / motivating material.
> 
> > Also, it seems to explicitly apply to parameter map keys (per the earlier
> > discussion).
> 
> I've added a note to this effect in the Dictionary and Parameter parsing algorithms; see latest commit.

Thanks for that and the other updates,

Ben