[TLS] A closer look at ROBOT, BB Attacks, timing attacks in general, and what we can do in TLS

Colm MacCárthaigh <colm@allcosts.net> Thu, 14 December 2017 22:20 UTC

MIME-Version: 1.0
From: Colm MacCárthaigh <colm@allcosts.net>
Date: Thu, 14 Dec 2017 14:20:54 -0800
Message-ID: <CAAF6GDeeo2xjv1Xu7SFXVZ_zM=XUVJHT=eqH4_-G3+4UHsfvgg@mail.gmail.com>
To: "tls@ietf.org" <tls@ietf.org>
Content-Type: multipart/alternative; boundary="001a114c8ee2265dcc0560544c50"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tls/t6SKfh49fb4kRET2krZ6UoaEefs>
Subject: [TLS] A closer look at ROBOT, BB Attacks, timing attacks in general, and what we can do in TLS
Precedence: list

TLS folks,

A few weeks ago the s2n team got a mail from US CERT asking us to take a
look for any Bleichenbacher attack issues and get to back to them. We
didn't have any issues (thankfully!), but it was a good opportunity for us
to review how we defend against BB and other related attacks. The notice
was of course based on the excellent work of Hanno Böck, Juraj Soorovsky,
and Craig Young.

Based on the notice, we also went looking in some other code-bases and
implementations we have access to, and did find issues, both classic BB98
style attacks, and small timing side channels. We've worked with vendors
and code owners in each case and fixes are in place, new releases made, and
so on.

Based on our experiences with all of this over the last few weeks, I'd like
to summarize and throw out a few suggestions for making TLS stacks more
defensive and robust against problems of this class. One or two may be even
worth considering as small additions to the forthcoming TLS RFC, and I'd
love to get feedback on that.

*First: the basic specific defense against BB98*
In s2n we went with the boring basic per-the-TLS defense against BB98. We
pre-generate a random PMS:

https://github.com/awslabs/s2n/blob/master/tls/s2n_client_key_exchange.c#L64

then we over-write this PMS only if RSA succeeds:

https://github.com/awslabs/s2n/blob/55a641dc29d17780620c16713854f1c9fd31f7ce/crypto/s2n_rsa.c#L165

since it's important not to leak timing, we use a special
"constant_time_copy_or_dont" routine to do the over-write:

https://github.com/awslabs/s2n/blob/master/utils/s2n_safety.c#L70

we then allow the handshake to proceed, and fail it later:

https://github.com/awslabs/s2n/blob/088240d081953131deefb27a70fa6f728156f0cf/tls/s2n_client_finished.c#L35

So far everything is standard, though we did notice in reviewing some other
implementations that things can be unnecessarily complicated around
handling the first step. I'm including the code links here just as an
example that literally following the RFC guidance is fairly doable, but it
helps a lot to have a constant-time copy-or-not kind of routine to make it
much easier.

*Second: hide all alerts in suspicious error cases*
Next, when the handshake does fail, we do two non-standard things. The
first is that we don't return an alert message, we just close the
connection. This is intentional, and even though it violates the written
standards, we believe it's better to optimize for frustrating attackers and
not leaking information Vs making occasional debugging by TLS experts
easier. I'd be interested in the thoughts of others here. We did find
other implementations that just close() a connection, and we've never
noticed inter-operability problems. Should we tolerate this in the RFC?

*Third: mask timing side-channels with a massive delay*
The second non-standard thing we do is that in all error cases, s2n behaves
as if something suspicious is going on and in case timing is involved, we
add a random delay. It's well known that random delays are only partially
effective against timing attacks, but we add a very very big one. We wait a
random amount of time between a minimum of 10 seconds, and a maximum of 30
seconds. With the delay being granular to nanoseconds. This puts a delay of
20 seconds on average on every potential attacker measurement (though
obviously they can parallelize) and adds something around 2^60 bits of
entropy to a measurement. The effectiveness of this technique depends on
the size and distribution of the timing channel being leaked, as well the
distribution of measurement latency generally, but suppose the timing leak
is 10 microseconds, then the delay likely increases attacker difficulty by
a factor of probably hundreds of billions. I'd be interested in thoughts
on this too, as a general recommendation as "something TLS stacks
can/should do is that when there's an error, mask it with a big random
delay".

The main "downside" of this delay is that it can tie up resources - but
we're not as concerned with that. Firstly, an attacker can always cause a
connection to stall for 30 seconds, and even ordinary network conditions
like packet loss can trigger that. That's why we chose 30 seconds as our
upper bound. Second, connections are cheap, especially in
asynchronous/epoll driven models, or any model that isn't one-connection
per process. We prefer to have the defense in depth. It's worth noting that
this protection also helps with other timing attacks, such as Lucky13 (read
https://aws.amazon.com/blogs/security/s2n-and-lucky-13/ for my write up on
how the earlier version of this, with microsecond granularity, wasn't
enough to defeat Lucky13, but still made it millions of times harder for an
attacker) or even AES timing issues.

The remainder of this mail doesn't concern changes that could made in the
TLS draft, but rather just other information that may be useful for
implementors. Sorry if I'm abusing the purpose of the list a little, but
hopefully the last item in particular will make it worth the read.

*Fourth: regression tests*
We noticed that many of implementations we encountered had no regression
tests. In s2n we do have plenty of tests, but actually did not have a
specific regression test against BB. We didn't want to add one until the
Robot attack was public, but are now adding one.

*Fifth: formal constant-time verification*
Regression tests on their own do not defend against timing side channel
regressions, and we noticed in the commit history of some of the code we
reviewed that ones had snuck in over time. For example some non-s2n code we
encountered added some debug logging in the case where the padding is
mismatched, which created a very small timing difference with the cost it
takes to print the logging message. Here I'd like to share two pieces of
work we've been doing over the last year. The first is formal verification
of our code's constant time properties with ctverif and SMACK. Our
integration is at:

https://github.com/awslabs/s2n/tree/master/tests/ctverif

We've been able to instrument our most timing sensitive routines and verify
that they really do use the same number of instructions, regardless of
inputs. The annotations are very readable and easy to deal with and don't
mess up the code base, in fact the code I linked to earlier:

https://github.com/awslabs/s2n/blob/master/utils/s2n_safety.c#L70

is annotated, that's what the S2N_PUBLIC_INPUT macro is for. These
constant-time tests run on every commit and build of s2n, and defend
against regressions in the code and even compiler. For those of you who may
be at RWC2018/HACS, I'd love to show you more if you're interested in
integrating or doing anything similar!

*Sixth: fuzzy constant-time verification for code-balancing*
In building the s2n formal constant-time verification, we noticed that
there are more complicated aspects of TLS processing; such as CBC
verification and BB98 mitigation that need to be in constant-time, but
where the code also spans many functions and even messages. The solutions
here typically are not small routines that are rigorously constant time,
but instead rely on a degree of code-balancing: taking code paths that are
"similar enough" in timing so as not to be measurable.

At first, the reaction against code-balancing can often be finger-wagging,
but our our survey of TLS libraries is that most are using code-balancing
in some capacity. In fact, we can take the OpenSSL LuckyMinus20
vulnerability and regression as evidence that trying to be rigorously
constant time on large complicated algorithms that were not designed to be
constant time can seriously backfire by introducing logic errors in the
very hard-to-follow constant-time adaptations.

Because code-balancing is useful in these cases, and certainly used in many
real world implementations, we wanted to make its effect more measurable,
and more than simply good intentions. Our automated reasoning group (and in
particular Daniel Schwartz-Narbonne and Konstantinos Athanasiou) has built
and integrated a new tool called sidewinder. The PR integrating it into s2n
is here:

https://github.com/awslabs/s2n/pull/664

The great thing about Sidewinder is that it can cope with an analysis over
larger and more inter-related code units, and it can produce bounds for how
much may the execution paths differ by, for both instructions and memory
accesses.I believe it would have prevented some of the regressions we found
in our BB attack focused review of third party code bases over the last few
weeks. It's another tool I'd be excited and happy to share with folks at
RWC/HACS.

--
Colm

[TLS] A closer look at ROBOT, BB Attacks, timing … Colm MacCárthaigh
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Watson Ladd
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Colm MacCárthaigh
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Hanno Böck
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Colm MacCárthaigh
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Yoav Nir
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Nikos Mavrogiannopoulos
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Andrei Popov
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Kathleen Moriarty
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Hanno Böck
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Andrei Popov
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Eric Rescorla
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Watson Ladd
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Eric Rescorla
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Eric Rescorla
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Andrei Popov
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Tim Hollebeek
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Andrei Popov
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Tim Hollebeek
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Martin Rex
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Martin Rex
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Peter Gutmann
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Ilari Liusvaara
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Hubert Kario
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Hubert Kario
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Colm MacCárthaigh
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Hubert Kario
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Colm MacCárthaigh
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Hubert Kario
Re: [TLS] A closer look at ROBOT, BB Attacks, tim… Colm MacCárthaigh