Re: [babel] Benjamin Kaduk's Discuss on draft-ietf-babel-hmac-08: (with DISCUSS and COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Wed, 14 August 2019 19:48 UTC

Date: Wed, 14 Aug 2019 14:48:03 -0500
From: Benjamin Kaduk <kaduk@mit.edu>
To: Juliusz Chroboczek <jch@irif.fr>
Cc: The IESG <iesg@ietf.org>, draft-ietf-babel-hmac@ietf.org, Donald Eastlake <d3e3e3@gmail.com>, babel-chairs@ietf.org, babel@ietf.org
Message-ID: <20190814194802.GB88236@kduck.mit.edu>
References: <156521429138.8333.12124544758210076970.idtracker@ietfa.amsl.com> <87h86pcmzk.wl-jch@irif.fr>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <87h86pcmzk.wl-jch@irif.fr>
User-Agent: Mutt/1.10.1 (2018-07-13)
Archived-At: <https://mailarchive.ietf.org/arch/msg/babel/ONy19DkRhB-UBks905d5NbJVAFQ>
Subject: Re: [babel] Benjamin Kaduk's Discuss on draft-ietf-babel-hmac-08: (with DISCUSS and COMMENT)
Precedence: list

On Sat, Aug 10, 2019 at 12:51:11PM +0200, Juliusz Chroboczek wrote:
> Dear Benjamin,
> 
> Thank you for your review.
> 
> > Are the HMAC keys required to be the hash function's block size or its
> > output size?  Section 3.1 says just "the length of each key is exactly
> > the hash size of the associated HMAC algorithm", and "hash size"
> > conventionally refers to the output length.  The referenced Section 2 of
> > RFC 2104 concerns itself with the hash's compression function's block
> > size B, which is generally different.
> 
> The block size, good catch.
> 
> > Also in Section 3.1, if we are going to claim that a "random string of
> > sufficient length" suffices to initialize a fresh index, we need to
> > provide guidance on what constitutes "sufficient length" to achieve the
> > needed property.
> 
> Section 6 says
> 
>    This
>    property can be satisfied either by using a cryptographically secure
>    random number generator to generate indices and nonces that contain
>    enough entropy (64-bit values are believed to be large enough for all
>    practical applications), or by using a reliably monotonic hardware
>    clock.

And you don't want to make a forward reference to Section 6?

> > Blake2s is a keyed MAC, but is not an HMAC construction.  If we are to
> > allow its usage for providing integrity protection of babel packets
> > directly, we therefore cannot refer to the preotection scheme as "HMAC"
> > generically.  Fixing this will, unfortunately, be somewhat invasive to
> > the document, since we mention HMAC all over the place.  I believe that
> > "Keyed Message Authentication Code (Keyed MAC)" is an appropriate
> > replacement description.
> 
> I disagree.  While you're technically correct, for the typical network
> operator "HMAC" is a familiar term, "Keyed MAC" is not.  I think that
> renaming this document to "Keyed MAC" would make it less informative.

Well, I am pretty uncomfortable about saying things that are not true!
I'm willing to accept using "HMAC" as a commonly accepted shorthand for the
actual primitives being used, but only if there's a statement explaining
that the term is used as a shorthand and that the implementation details
can be different.

> > The suggestion that the large challenge nonce size admits storage of
> > state in a secure "cookie" in the nonce is true, however, implementing
> > this properly presents some subtleties, and it seems like something of
> > an attractive nuisance to suggest that it is possible without giving
> > adequate guidance at how to do it safely.  Unfortunately, the best
> > reference I can think of, offhand, is the obsoleted RFC 5077.
> 
> I agree.  This is something that we considered in the design of the
> protocol (which is why Nonces are allowed to be so large), but never
> implemented ourselves; it would probably be some work to get it right.
> Hence the very careful formulation ("might").  Let me know if you want to
> suggest a different formulation.

It's probably okay to leave this as speculative (since there's not an
existence proof), but I think it's irresponsible to not also include a
disclaimer of at least "Note that such schemes will need to consider what
degree of "freshness" is required of a challenge and the risk of replay for
the encoded state, among other things".

> > Let's also have a discussion about whether 64 bits of randomness is
> > always sufficient; I left a longer note down in the Comment since I
> > don't expect this to end up being a blocking point.
> 
> Let's.  See below.
> 
> > ----------------------------------------------------------------------
> > COMMENT:
> > ----------------------------------------------------------------------
> 
> > This is a symmetric-keyed scenario, so most attacks that involve a
> > compromised node will be "uninteresting", in that once the key is
> > exposed all guarantees are lost.  However, it may still be worth noting
> > that a compromised node can cause disruption on multi-access links
> > without detection since there is no "end of generation" signal when a
> > node changes its index.  That is, if node B reboots or otherwise resets
> > its index/pc, then compromised node C can spoof packets from B with the
> > previous index and honest node A will accept them, and B will be unable
> > to detect that it has been spoofed.  On the flip side, we may want to
> > discuss that B can watch for messages that spoof its source address to
> > detect compromised nodes.
> 
> I may be misunderstanding what you mean, but I think that's not the case.
> If B reboots, then:
> 
>   - if A has state about B, then any previous PC will be rejected;
>   - if A has no state about B, then A will send a challenge which C won't
>   - be able to reply to.
> 
> Are we misunderstanding each other?

A and C both have state about B, which is retained when B reboots.
While B is still rebooting or otherwise initializing, C can forge a message
"from" B that A will accept, and C can continue to do so until A sees the
new (legitimate) index from B and switches over to the new generation.  So
A has accepted false input but neither A nor B are aware of it.

> > Is there any need for initial PC value randomization?
> 
> I don't see why.

I don't, either, but just wanted to get a second opinion from an actual
expert.

> > Do we want to recommend starting at 0 or 1 (or prohibit recipients from
> > assuming that the initial value for an index will be that)?
> 
> The initial value as well as the rate of increase are arbitrary.  For
> example, a node may use a hardware clock (suitably offset to fit in 32
> bits) as the PC.

I agree that they're arbitrary, but have a lot of painful experience with
protocol participants making assumptions about underspecified behavior that
ended up not being protocol invariants, with subsequent breakage.  If you
think the existing ecosystem is healthy enough that there's not a real risk
of problems down the line, I can accept that, but I wanted to raise the
topic.

> > Section 1
> 
> > "This document obsoletes RFC 7298" should be in the Introduction as well
> > as the abstract.
> 
> Done.
> 
> > Is the capability for an attacker to modify/spoof Babel packets in
> > order to cause data to get dropped or cause a routing loop worth
> > mentioning here?
> 
> No, that's a particular case of redirecting.
> 
> > Section 1.2
> 
> >    o  that the Hashed Message Authentication Code (HMAC) being used is
> >       invulnerable to pre-image attacks, i.e., that an attacker is
> >       unable to generate a packet with a correct HMAC;
> 
> > I think it's more conventional to include the caveat "without [access
> > to/knowledge of] the secret key" for this sort of statement about HMAC.
> 
> Agreed, done.
> 
> >    The first assumption is a property of the HMAC being used.  The
> >    second assumption can be met either by using a robust random number
> >    generator [RFC4086] and sufficiently large indices and nonces, by
> >    using a reliable hardware clock, or by rekeying whenever a collision
> >    becomes likely.
> 
> > Does this rekeying option require an external operation/management actor
> > to trigger it?  It might be worth mentioning with some operational
> > considerations.
> 
> Disagree.

I think my main quesetion here is who determines, and how, "whenever a
collision becomes likely".  Some protocols have in-band mechanisms for
peers to make such a determination and initiate rekeying automatically;
from what I understand, babel-hmac does not.  If we actually expect
deployments to encounter scenarios where rekeying is necessary to preserve
security, we should be pretty clear about who has the responsibility to do
so.

> >    o  among different nodes, it is only vulnerable to immediate replay:
> >       if a node A has accepted a packet from C as valid, then a node B
> >       will only accept a copy of that packet as authentic if B has
> >       accepted an older packet from C and B has received no later packet
> >       from C.
> 
> > nit: I don't think "A has accepted a packet from C" is quite the right
> > precondition; it seems to be more like "A has received a valid packet
> > from C", since whether or not A (as an attacker) considers it valid is
> > irrelevant to whether (honest) B will.
> 
> C is the attacker here, A is honest.

I was (apparently) reading this differently -- "replay" could be initiated
by an RFC 3552 attacker in the network even without knowledge of the key.
So I thought that all three of A, B, and C could be honest and the risk of
replay would still exist.  Phrased alternately, A and B make independent
assessments of a packet's validity; if they are both honest and properly
implemented those assessments will produce identical output, but it is not
the case that B accepts a packet as authentic *because* A has already done
so.

> > Section 4.1
> 
> > If we had identifiers for symmetric keys or HMAC algorithms, we could
> > include those identifiers in the pseudo-header and thereby gain some
> > protection from downgrade/HMAC-stripping attacks in the presence of a
> > weak keyed MAC algorithm.  (I think we have to include both what we are
> > sending and what we think the peer can do in order to get substantial
> > protection, though, which diminishes the appeal for multicast
> > scenarios.)
> 
> This is a symmetric algorithm, for downgrade attacks to work the victim
> would need to be configured with a weak key.  I therefore don't see how
> this added complexity helps.

I've seen plenty of cases where systems are still configured with
single-DES keys as well as AES keys (whoops!).  In this particular case the
cost/benefit analysis may not favor the complexity, but I want that to be a
conscious decision and not something we slip into by default.

> > nit: I don't think the past tense is correct for "packet was carried
> > over IPvN", since we're talking about a pseudo-header used in
> > computations before the packet is sent.
> 
> Agreed, done.
> 
> > It might be worth reiterating that every time a packet goes on the wire,
> > it gets a fresh PC, regardless of whether it's a "retransmit" after a
> > timeout or a new message.
> 
> There are no retransmits in this protocol.

Hence the scare quotes.  Can you guarantee that no one will look at this
and say "I'll just cache this challenge I send to a peer and re-send it
(rate limited) to that peer in response to traffic I can't authenticate,
until I get a valid reply"?

> >    interface MTU (Section 4 of [RFC6126bis]).  For an interface on which
> >    HMAC protection is configured, the TLV aggregation logic MUST take
> >    into account the overhead due to PC TLVs (one in each packet) and
> >    HMAC TLVs (one per configured key).
> 
> > (per configured key, and also per packet, right?)
> 
> Of course.  The MTU applies per packet.
> 
> > Does it matter whether the sender increments the PC before or after
> > inserting it in the PC TLV?  (I think the only potential impact would be
> > as it relates to the value sent in response to a challenge nonce, but
> > the "increment by a positive not-necessarily-one amount" property may
> > provide all the flexibility we need.)
> 
> I don't think so.

Thanks for thining about it.

> > Section 4.3
> 
> > Validating the HMACs is the sort of operation that we tend to recommend
> > be done in constnt-time to avoid side channel attacks.  I don't have a
> > concrete attack handy here at the moment, though.
> 
> I frankly have no idea if that's necessary or not.  FWIW, the protocol is
> asynchronous, and there is jitter applied to packets.

My super-quick assessment is that the timing channel could be used to "pick
off byte by byte" an attacker's guess at the HMAC of some fixed plaintext
that the attacker would later replay (with valid HMAC) to some other
participant.  But I think the source/destination addresses in the
pseudo-header limit the potential scope a fair amount -- the attacker would
have to use the intended destination address and spoof the source address
for the intended forgery, and once the attacker is doing that, it becomes
somewhat similar to an online brute-force attack to achieve the same
forgery (since the final transaction would have the actual effect that the
spoofed/replayed packet would), and the timing channel becomes harder to
observe if the replies are going to a different place.

> >       When a PC TLV is encountered, the enclosed PC and Index are saved
> >       for later processing; if multiple PCs are found (which should not
> >       happen, see Section 4.2 above), only the first one is processed,
> >       the remaining ones MUST be silently ignored.  If a Challenge
> 
> > Any reason to not just drop the whole packet if there are multiple PCs
> > present?  I see this is not rfc7298bis but don't know what level of
> > breaking change is reasonable.
> 
> It doesn't matter much, it's a "cannot happen" case.  (Note that the HMAC
> has already been validted at this point, so it isn't a security issue.)

Oh, it's definitely not a security issue; this is more of a philosophical
question of what to do in the face of a peer with a "totally broken"
implementation.  The TLS ecosystem, for example, has been bitten by
the need for historical compatibility quirks and is now (over?)compensating
by being quite strict about rejecting malformed things.

> >    o  The preparse phase above has yielded two pieces of data: the PC
> >       and Index from the first PC TLV, and a bit indicating whether the
> >       packet contains a successful Challenge Reply.  If the packet does
> >       not contain a PC TLV, the packet MUST be dropped and processing
> >       stops at this point.  If the packet contains a successful
> >       Challenge Reply, then the PC and Index contained in the PC TLV
> >       MUST be stored in the Neighbour Table entry corresponding to the
> >       sender (which already exists in this case), and the packet is
> >       accepted.
> 
> > I'd suggest explicitly stating that if there is a challenge reply that
> > doesn't validate, the packet should be discarded.
> > Or are there multicast scenarios where that is not the case? The key
> > point being to emphasize that just the presence of a challenge reply
> > doesn't mean anything, it has to be valid in order to have significance.
> 
> This is already the case -- a packet with an incorrect challenge reply is
> treated just like a packet with no challenge reply.

This was an attempt to "consider all the ways in which a peer could
misbehave": it could send two, one valid and one invalid; or an invalid
challenge reply even though it didn't need to supply one at all; or ...
(This is basically the same philosophical question as above.)

> I don't feel comfortable with discarding a packet just because it contains
> an obsolete challenge reply -- what if the packet also contains a challenge
> request?  It also complicates the code, by requiring three challenge
> verdicts (positive/neutral/negative) rather than just two (positive/neutral).

Okay.

> >    o  At this stage, the packet contains no successful challenge reply
> >       and the Index contained in the PC TLV is equal to the Index in the
> >       Neighbour Table entry corresponding to the sender.  The receiver
> >       compares the received PC with the PC contained in the Neighbour
> >       Table; if the received PC is smaller or equal than the PC
> >       contained in the Neighbour Table, the packet MUST be dropped and
> >       processing stops (no challenge is sent in this case, since the
> >       mismatch might be caused by harmless packet reordering on the
> >       link).  Otherwise, the PC contained in the Neighbour Table entry
> >       is set to the received PC, and the packet is accepted.
> 
> > Does this mean that if packet reordering is encountered, we will just
> > not process packets that get reordered later?  (AFAIK babel will still
> > work fine in such conditions, so I'm just checking my understanding.)
> 
> Correct on both counts.  The WG did consider using a sliding window in the
> style of DTLS, but we finally decided against it for the sake of simplicity.
> Since this is a link-local protocol, packet reordering is unlikely.

Thanks for confirming (and sharing the WG's discussions).

> >    it MAY ignore a challenge request in the case where it it contained
> 
> > nit: s/it it/it is/
> 
> Done, thanks.
> 
> >    The same is true of challenge replies.  However, since validating a
> >    challenge reply is extremely cheap (it's just a bitwise comparison of
> >    two strings of octets), a similar optimisation for challenge replies
> >    is not worthwile.
> 
> > Er, challenge reply validation still requires the HMAC validation step,
> > right?
> 
> At this stage we've already verified the HMAC.  The only thing we could
> potentially save would be a bitwise comparison, and that's not worth it.

Right, but just "validating a challenge reply" in vacuum could be taken to
include validating the HMAC on the containing packet.  An editorial comment,
to be sure, but "poses minimal incremental cost" would avoid it.

> > Section 4.3.1.1
> 
> >    When it encounters a mismatched Index during the preparse phase, a
> >    node picks a nonce that it has never used with any of the keys
> >    currently configured on the relevant interface, for example by
> >    drawing a sufficiently large random string of bytes or by consulting
> 
> > (same comment as above about "sufficiently large")
> 
> See Section 6.
> 
> > Section 4.3.1.2
> 
> >    buffered TLVs in the same packet as the Challenge Reply.  However, it
> >    MUST arrange for the Challenge Reply to be sent in a timely manner
> >    (within a few seconds), and SHOULD NOT send any other packets over
> >    the same interface before sending the Challenge Reply, as those would
> >    be dropped by the challenger.
> 
> > I think this "SHOULD NOT" (or rather, "would be dropped by the
> > challenger") is predicated on the challenge request having not been a
> > replay, but I do not see anything requiring the recipient to do nonce
> > uniqueness validation.
> 
> This doesn't require delaying any packets, quite the opposite, it says
> that you SHOULD send out the challenge reply before sending out any more
> packets.

We seem to be miscommunicating; let me try again.

Consider a multi-access link with honest A and B, and attacker C.
A sends a challenge request to B, that C observes and records.  B sends a
challenge reply, and A sends some more traffic.  What happens if C starts
spamming replayed copies of that challenge request towards B?  This note
about constructing Challenge Reply is during the preparse stage, before
index/PC validation.  Can C cause B to spend a lot of time generating
challenge replies and reduce the amount of time spent sending actual data?

(There's also a variant where C knows the HMAC key and spoofs fresh
challenge requests.)

> > Section 4.3.1.3
> 
> >    neighbour that sent the Challenge Reply.  If no challenge is in
> >    progress, i.e., if there is no Nonce stored in the Neighbour
> >    Table entry or the Challenge timer has expired, the Challenge Reply
> >    MUST be silently ignored and the challenge has failed.
> 
> > I think "the challenge has failed" is predicated on the challenge reply
> > being in response to a challenge sent by this node.  The previous
> > section's "send the Challenge Reply to the unicast address" seems to
> > imply that there are no multicast scenarios which would make that not
> > the case, but I just wanted to check my understanding.
> 
> That's right.
> 
> > Section 5
> 
> > Do we need to say whether sub-TLVs are allowed in any of these TLVs?
> > (Presumably they are not, since the length is needed in order to
> > identify the length of the variable-length fields, but being explicit
> > can be useful.)
> 
> 6126bis only specifies which TLVs do allow sub-TLVs, I think it makes
> sense to follow the same format.
> 
> > Section 5.1
> 
> >    This [HMAC] TLV is allowed in the packet trailer (see Section 4.2 of
> >    [RFC6126bis]), and MUST be ignored if it is found in the packet body.
> 
> > side note: Using "MUST ignore" vs. "discard the packet" has some
> > protocol evolution consequences -- it in practice then becomes an
> > alternative padding technique for use in packet bodies, and if ever used
> > as such then could lead to a way to fingerprint an implementation or be
> > used as a hidden channel for sending other data.  But, I see that
> > ignoring at the TLV level is something of a core babel design choice,
> 
> Right.
> 
> > and I don't see any serious consequences that would merit revisiting
> > that decision.
> 
> Good.
> 
> > Section 6
> 
> >    This mechanism relies on two assumptions, as described in
> >    Section 1.2.  First, it assumes that the hash being used is
> 
> > s/hash/MAC/
> 
> I'm confused.  This section refers to pre-image attacks, which I was under
> the impression is a property of the hash.

The thing we're sending over the wire is a MAC (sometimes HMAC-SHA-256;
sometimes Blake2s; potentially other things in the future).  MACs are
almost universally constructed using hash functions, so it's pretty natural
to think of them fairly interchangably, but there do exist other
constructions like CBC-MAC.  (Also, I think technically we're concerned
about *second* preimage resistance, since we're effectively sending the
first preimage as the message.)  Getting this second-preimage resistance
from CBC-MAC constructs that take variable-length input requires some care,
though!

> > It would require a bit more thought to convince me that 64-bit indices
> > are sufficient for *all* cases.  Specifically, if we want full 64-bit
> > strength, then the 64-bit space cannot be controlled or affected by the
> > attacker to cause collisions.  But I think there will be a reasonable
> > risk that an attacker can cause a given node to need to regenerate its
> > index on demand (e.g,. but triggering a bug that crashes it, or power
> > cycling it)
> 
> Assume the attacker is able to crash the node at will, and that the node
> needs 10s to reboot.  If we assume the attacker will start seeing
> collisions after 2^32 tries, then this will happen after 1360 years, which
> is slightly more than the duration of the Byzantine empire.

"Attacks only get better; never worse."
My point was to show that it's plausible for an attacker to be able to
trigger index generation, and that if we want to be robust in the face of
future attacks, we should consider a threat model that allows them to do so
at will.  Just because we can't see how that would happen now doesn't mean
very much unless we have some sort of proof to back it up.

> The Babel working group recommends rekeying whenever Anatolia is invaded.
> 
> >    present at the receiver.  If the attacker is able to cause the
> >    (Index, PC) pair to persist for arbitrary amounts of time (e.g., by
> >    repeatedly causing failed challenges), then it is able to delay the
> >    packet by arbitrary amounts of time, even after the sender has left
> >    the network.
> 
> > I'd suggest adding another sentence describing the potential
> > consequences of selectively delayed input (i.e., messing up the
> > routing).
> 
> We don't know about any such consequences.  Still, we believe it is good
> to avoid this situation.

Maybe I'm confused, but suppose a node advertises a really-low-metric path
to a prefix, but then drops off the network.  If we cause an announcement
of that path to be delayed for a while, then when ultimately processed it
is likely to end up being the lowest-metric path to that prefix for many of
its neighbors.  Will those neighbors not try to use that path (and fail to
deliver traffic since the node is actually down)?

> >    protocol (the data structures described in Section 3.2 of
> >    [RFC6126bis] are conceptual, any data structure that yields the same
> >    result may be used).  Implementers might also consider using the fact
> 
> > nit: that's a comma splice in the parenthetical; a semicolon would be better.
> 
> I've added a conjunction, I didn't realise such usage is unacceptable.

Introspection suggests that it's a pet peeve of mine, but I do expect it to
be on the list of things the RFC Editor staff look for (and if we can fix
the easy stuff before it gets to them, they will have more time to do a
better job with the stuff we missed).

Thanks,

Ben

[babel] Benjamin Kaduk's Discuss on draft-ietf-ba… Benjamin Kaduk via Datatracker
Re: [babel] Benjamin Kaduk's Discuss on draft-iet… Juliusz Chroboczek
Re: [babel] Benjamin Kaduk's Discuss on draft-iet… Benjamin Kaduk
Re: [babel] Benjamin Kaduk's Discuss on draft-iet… Juliusz Chroboczek
Re: [babel] Benjamin Kaduk's Discuss on draft-iet… Benjamin Kaduk