Re: Benjamin Kaduk's Discuss on draft-ietf-httpbis-header-structure-18: (with DISCUSS and COMMENT)

Mark Nottingham <mnot@mnot.net> Wed, 20 May 2020 08:24 UTC

Resent-Date: Wed, 20 May 2020 08:21:30 +0000
Resent-Message-Id: <E1jbJyk-0001Fe-JX@lyra.w3.org>
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\))
From: Mark Nottingham <mnot@mnot.net>
In-Reply-To: <20200519170036.GN58497@kduck.mit.edu>
Date: Wed, 20 May 2020 18:21:05 +1000
Cc: The IESG <iesg@ietf.org>, draft-ietf-httpbis-header-structure@ietf.org, httpbis-chairs@ietf.org, HTTP Working Group <ietf-http-wg@w3.org>, Tommy Pauly <tpauly@apple.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <FB915609-4CD1-416A-98AD-3950BF3BE09C@mnot.net>
References: <158985680600.32294.14997877272521602557@ietfa.amsl.com> <CFC9B0CA-6125-4705-A13E-F2260F3EC3A5@mnot.net> <20200519170036.GN58497@kduck.mit.edu>
To: Benjamin Kaduk <kaduk@mit.edu>
Received-SPF: pass client-ip=66.111.4.25; envelope-from=mnot@mnot.net; helo=out1-smtp.messagingengine.com
Subject: Re: Benjamin Kaduk's Discuss on draft-ietf-httpbis-header-structure-18: (with DISCUSS and COMMENT)
Archived-At: <https://www.w3.org/mid/FB915609-4CD1-416A-98AD-3950BF3BE09C@mnot.net>
Resent-From: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list

Hi Ben,

See below. The latest commit referred to is <https://github.com/httpwg/http-extensions/commit/83f3ab6c0>.

> On 20 May 2020, at 3:00 am, Benjamin Kaduk <kaduk@mit.edu> wrote:
>> 

>> I think the solution here is to change the last paragraph of Notational Conventions to:
>> 
>> """
>> For serialization to HTTP fields, the ABNF illustrates the expected wire representation with as much fidelity as possible, and the algorithms define the recommended way to produce them. Implementations MAY vary from the specified behavior so long as the output is still correctly handled by the parsing algorithm.
>> """
>> 
>> Does that work for you, Ben?
> 
> This makes the parsing algorithms normative for both parsing and
> serialization, which alleviates the concerns about skew.
> 
> I do agree with Julian that making the (minor!) change to the ABNF to
> remove that axis of skew seems worthwhile, though I do not think I can make
> that a Discuss point (given that I, like him, would prefer to keep the ABNF
> as-is over removing it entirely as you propose).

See latest commit.

>>> Additionally, when Section 4.2.3.2 discusses parsing parameter map keys,
>>> the handling for duplicate map key names is specified as overwriting the
>>> previous value, in contrast to the prose description (Section 3.1.2)
>>> that describes these keys as "unique within the scope [of] the
>>> Parameters they occur within".  (While dictionary key names might be
>>> expected to have a similar behavior, I did not find conflicting text for
>>> that behavior.)
>> 
>> That's because there's a difference between the underlying data model (specified in section 3) and the algorithms for how we get them into and out of textual HTTP headers (section 4). There might be another serialisation of the data model in the future that has a different way of handling duplicates.
> 
> Perhaps I'm just hung up on the "intentionally strict processing" preface,
> but why allow duplicates at all vs. making them an error?

Duplicates are surprisingly common in the wild; I suspect because people apply policy to add a header at multiple points, and they end up duplicating the work. We had a discussion about whether to hard fail in this case or to swallow the duplicate, and the outcome was the latter; the important part was that behaviour was consistent, not that it's universally strict.

>>> (1) What aspect(s) of structured field processing are case (in)sensitive?
>>> The only mentions I see of case sensitivity are in Section 4.2 discussing
>>> header field names and (implicitly) Section 4.2.2 discussing a
>>> "character-for-character" comparison of dictionary key names, but of
>>> course we cite RFC 5234 for ABNF, which uses case-insensitive matching.
>>> On the other hand, base64 encoding requires case sensitivity for
>>> successful round-tripping of arbitrary binary data.
>> 
>> As per S 2, it's up to each field to constrain the syntax of different types as appropriate to their semantics. The only exception is keys in dictionaries and parameters, which are explicitly lowercase (to avoid differences of only case there).
> 
> Do we want to call out case in § 2 as something that structured field
> specifications should consider in their definition?  It's a bit of a
> nitpicky details, admittedly, but it has caused enough problems in the past
> that it may be worth it.

Makes sense; see latest commit.

> Also, could you help me find where we explicitly use lowercase
> dictionary/parameter keys?  I don't remember seeing that.

Key's ABNF uses lcalpha, and its parsing and serialisation algorithms require it.

>>> (2) One of the stated goals of this document is to define intentionally
>>> strict processing rules, but there are several places where we could
>>> have removed additional optionality but elected to not do so.  What is
>>> the criterion for "too far" towards strictness?  For example, Section
>>> 4.2.7 gives optionality with respect to base64 padding (see COMMENT).
>> 
>> It's a judgement call in each case. As discussed below, we were lax on base64 due to common implementation limitations. In other cases, e.g., whitespace handling, it was to allow use with some existing headers, should one choose to, and because some people are still going to be creating these headers by hand (e.g., in .htaccess on Apache).
>> 
>> I'm happy to chat about it, but why is this last one a DISCUSS?
> 
> If there's a specific rule, it would need to be applied consistently.
> Given that you say "judgement call in each case" that may be moot (in the
> US sense), though I would have been happier if there was a more intentional
> rule.
> 
> Also, since its vaguely topical here (but not part of the Discuss), you
> note that several details were chosen so as to allow use with existing
> headers.  The follow-up discussion downthread clarifies that, yes, this
> spec itself only explicitly applies to new field specifications, but that
> doesn't stop someone from noticing that existing fields have a compatible
> encoding and can reuse the same machinery.  While I don't know of a
> requirement for it, I'm used to seeing a mention in the document of the
> motivation for such details, so that it puts everyone on more level ground
> and doesn't require individual cleverness (and duplicated analysis work!)
> to reach that conclusion.  Is there a reason that we don't want to mention
> this possibility of parser/serializer reuse for existing fields in the
> document itself?  I would find it helpful to understand the motivation for
> certain details, though I concede that I may be unusual in that regard and
> not part of the target audience.

This may help:
  https://mnot.github.io/I-D/binary-structured-headers/

Especially Section 4 and Appendix A. 

It's not adopted, but so far the response has been pretty positive IMO.  

We're definitely doing a balancing act here; we didn't want to be constrained to being interoperable with all existing headers (or even the majority), because that would have made SF much more complex (given how far over the map they are; there are trivial differences in things like whitespace handling, quoting and other aspects from header to header). OTOH we did want to be able to back port common ones without too much effort. There's also some history; we wanted to be very very clear that these algorithms can't just be used on existing headers without some accommodation for values that fall out of the SF serialisation (as BSF has).

>>> What's the motivation for "MUST omit values of boolean true" (in
>>> parameters and dictionaries)?  It seems to make the output rules more
>>> complicated without a significant gain in encoding size.
>> 
>> It allows some existing HTTP headers to be treated as structured for the purposes of serialisation.
> 
> The downthread discussion also made the claim that this is due to a desire
> to have exactly one way to serialize.  But what's the good in having
> exactly one way to serialize if you don't enforce that at the receiver?
> (Compatibility with already-defined header fields, no doubt, but it does
> kind of undercut some of the stated goals of this document.)

Sorry, I lost state. Where did you get that impression?

>>> Section 3.1
>>> 
>>>  An empty List is denoted by not serializing the field at all.
>>> 
>>> So any structured field definition that allows for an empty list has to
>>> define its semantics to be equivalent to the semantics used for when the
>>> peer does not support that structured field at all?
>> 
>> Yes; this point was discussed (a lot, if memory serves).
> 
> This seems like the sort of (potentially security relevant) detail that's
> worth noting explicitly.

See latest commit.

>>>  A header field whose value is defined as a List of Inner Lists with
>>>  Parameters at both levels could look like:
>>> 
>>> Is this also an inner list of strings?
>> 
>> It could be.
> 
> I mean, this particular example looks to my eye to be an inner list of
> strings.  The previous example specifically called out that its example
> used an inner list of strings, so a reader might notice the lack of such
> callout and wonder if this one is supposed to be different.

Right, but "A header field whose value is defined as a List of Inner Lists of Strings with Parameters" is quite a mouthful.

>>>  6.  Return output_string converted into an array of bytes, using
>>>      ASCII encoding [RFC0020].
>>> 
>>> This implicitly assumes that output_string contains only characters
>>> representable in ASCII.  I believe this is currently true based on the
>>> relevant ABNF constructions and serialization procedures, but it still
>>> feels like a needless dangling edge case.
>> 
>> Why? HTTP fields are already strongly encouraged to be ASCII.
> 
> If someone tries to hand me an output_string for which this encoding fails,
> am I expected to fail to emit the entire structured field?

Yes, but as you point out, the other algorithms that feed into this assure that it's already ascii.

(I'm beginning to appreciate that the downside of writing an algorithmic specification is that reviewers treat it like a code review)

>>> Section 4.2
>>> 
>>>  Strings split across multiple field lines will have unpredictable
>>>  results, because comma(s) and whitespace inserted upon combination
>>>  will become part of the string output by the parser.  Since
>>>  concatenation might be done by an upstream intermediary, the results
>>>  are not under the control of the serializer or the parser.
>>> 
>>> This seems to have the implication that the results are unpredictable
>>> even if a serializer and parser collaborate to use a procedure that, in
>>> disregard of this specification, splits individual list and/or
>>> dictionary entries across multiple field lines, due to the potential for
>>> an intermediary that is not complicit in the deviation from the spec.
>>> In some sense, that might be the key implication of this statement, in
>>> which case it's surprising to not see it spelled out more clearly.
>> 
>> Do you have a suggestion?
> 
> I'd consider adding a clause at the end, "even when the serializer and
> parser are under the control of the same party", though that may have flaws
> of its own.

I'll give it a try. See latest commit.

>>> Section 4.2.1.2
>>> 
>>>  3.  While input_string is not empty:
>>> 
>>>      1.  Discard any leading SP characters from input_string.
>>> 
>>>      2.  If the first character of input_string is ")":
>>> 
>>> Don't we need to check if the input_string is empty again after
>>> consuming leading SP but before attempting to consume anything else?
>> 
>> In that case, it will fall through to (3) - running Parsing an Item - and fail there.
> 
> If we ask people to inspect the first character of an empty string, is
> everybody going to get it right and avoid invoking undefined behavior in
> their implementation language of choice?  To my eye there is some risk of
> security-relevant implementation errors here.

We rely primarily on the test suite for avoiding interpretation problems like this. In general, we're trying to write pseudo-code; the expectation is that the implementer knows their language better than we do (backed up by the test suite).

>>> Section 4.2.2
>>> 
>>>  Given an ASCII string as input_string, return an ordered map whose
>>>  values are (item_or_inner_list, parameters) tuples. input_string is
>>> 
>>> Should we say anything about the map keys?
>> 
>> I'm struggling to come up with text that is worth the extra space taken; "keys that are keys" is hardly informative, and going into more detail seems to be re-specifying in the summary.
> 
> "an ordered map whose keyed values are (item_or_inner_list, parameters)
> tuples", maybe?

I don't think that adds anything.

>>> Section 6
>>> 
>>> It seems worth mentioning the handling for duplicated key names (e.g.,
>>> in parameters and dictionaries) w.r.t. overwrite or must-be-unique, and
>>> how there have been previous vulnerabilities relating to different
>>> implementations choosing "first one wins" vs. "last one wins".
>> 
>> That doesn't seem to apply to a correct implementation, only to headers that *aren't* structured fields.
> 
> It's still motivation for why we are making the choices we did and a
> benefit that structured headers have over the existing mechanisms.

Right, but that doesn't seem appropriate in Security Considerations; it's more Introduction / motivating material.

> Also, it seems to explicitly apply to parameter map keys (per the earlier
> discussion).

I've added a note to this effect in the Dictionary and Parameter parsing algorithms; see latest commit.

Cheers,


--
Mark Nottingham   https://www.mnot.net/

Benjamin Kaduk's Discuss on draft-ietf-httpbis-he… Benjamin Kaduk via Datatracker
Re: Benjamin Kaduk's Discuss on draft-ietf-httpbi… Mark Nottingham
Boolean param notation, Re: Benjamin Kaduk's Disc… Julian Reschke
empty lists, was: Benjamin Kaduk's Discuss on dra… Julian Reschke
Re: Boolean param notation, Re: Benjamin Kaduk's … Poul-Henning Kamp
Re: empty lists, was: Benjamin Kaduk's Discuss on… Mark Nottingham
Re: empty lists, was: Benjamin Kaduk's Discuss on… Julian Reschke
Re: Benjamin Kaduk's Discuss on draft-ietf-httpbi… Benjamin Kaduk
Re: Benjamin Kaduk's Discuss on draft-ietf-httpbi… Mark Nottingham
Re: Benjamin Kaduk's Discuss on draft-ietf-httpbi… Benjamin Kaduk
Re: Benjamin Kaduk's Discuss on draft-ietf-httpbi… Benjamin Kaduk
Re: Benjamin Kaduk's Discuss on draft-ietf-httpbi… Mark Nottingham
Re: Benjamin Kaduk's Discuss on draft-ietf-httpbi… Mark Nottingham
Re: Benjamin Kaduk's Discuss on draft-ietf-httpbi… Barry Leiba