[Acme] A single failed challenge should not invalidate an entire order

Matt Holt <matt@lightcodelabs.com> Tue, 18 August 2020 20:16 UTC

Date: Tue, 18 Aug 2020 14:16:12 -0600
From: Matt Holt <matt@lightcodelabs.com>
To: Acme <acme@ietf.org>
Message-ID: <17403370784.10c140257139874.6544499691253662216@lightcodelabs.com>
In-Reply-To:
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Importance: Medium
User-Agent: Zoho Mail
Archived-At: <https://mailarchive.ietf.org/arch/msg/acme/wIHaqikTCZ59zrWsUUus8lZ4VSg>
Subject: [Acme] A single failed challenge should not invalidate an entire order
Precedence: list

Hi,

After working heavily with ACME clients for the past 5 years (including "ACMEv1" before RFC 8555) I've come to realize some unfortunate ambiguities/inefficiencies in RFC 8555 with regards to server behavior after a challenge is attempted and failed by the client.

I recently implemented an RFC 8555-compliant client library in Go (https://github.com/mholt/acmez), and am convinced that a simple revision to the spec can both reduce costs for CAs *and* greatly simplify client implementations, if only the handling of failed challenges is revised.

My realizations are spelled out in this commit: https://github.com/mholt/acmez/commit/80adb6d5e64a3d36a56c58c66965b131ea366b8c

In summary: to get a certificate, a client creates an Order. The client then has to validate all Authorizations ("authzs"). For each Authorization, the client needs to successfully complete one of the offered Challenges. One successful challenge is sufficient to validate the authz. However, one failed challenge is apparently sufficient to invalidate the authz, and thus the entire Order. To try another challenge, the client then has to deactivate the other Authorizations (expensive) and create a new Order (also expensive), repeating the whole process. Instead, the client should be able to simply try the next challenge. In other words, a single failed challenge should not invalidate an authz; an authz should be "pending" until all offered challenges have failed or one has succeeded.

The commit I linked to above corrected my initial (overly-optimistic) interpretation of RFC 8555 where if a challenge fails, I simply need to try another one. This correction involves creating a whole new order and adds 250 lines of code, nearly double the complexity to handle the most common failure scenario. Not to mention the added cost of the DB transactions the CA has to deal with to invalidate an entire order.

The ACME spec allows a server to offer an array of challenges for each authz. In practice, there is no point offering more than one challenge if only one can ever be used.

I propose that RFC 8555 §7.5.1 be revised to say, "The server is said to "finalize" the authorization when it has successfully completed one of the challenges or failed all of them."

My commit message is quoted below, for convenience. It goes into more detail about the difficulties of the current spec (pardon any stream of consciousness as I was writing this deep in "developer mode"):

The ACME spec (RFC 8555) is somewhat ambiguous/conflicting about
finalizing authorizations. In §7.1.4 it says:

client should attempt to fulfill one of these challenges, and a
server should consider any one of the challenges sufficient to
make the authorization valid.

This makes it sound like solving any one of the possible challenges for
an authz is sufficient to make an authz "valid".

And here it says if any one of the challenges fails, the entire authz is
considered "invalid" (§7.5.1):

The server is said to "finalize" the authorization when it has
completed one of the validations. This is done by assigning the
authorization a status of "valid" or "invalid", corresponding to
whether it considers the account authorized for the identifier.

To my dismay, it appears that if any one of the challenges listed for an
authz are marked "invalid", indeed the entire order fails. This means
that a server may offer http-01, tls-alpn-01, and dns-01 challenges for
an authz, and if a client tries tls-alpn-01 and fails, it cannot simply
try http-01.

This is very unfortunate, as this is a very common use case, especially
in deployments where site owners don't control their customers' domain
names. We see a lot of cases where port 443 has TLS termination before
the ACME client (breaking the tls-alpn-01 challenge), but where port 80
is open and the http-01 challenge would succeed. We also see the reverse,
where port 80 is blocked but port 443 is open. There is often no way
for the client to know this ahead of time because it does not have an
outside perspective.

Because a single failed challenge invalidates *the entire authz* even
though other challenges *offered by the server as acceptable options*
are still perfectly capable of succeeding, we need to cancel the order
(which involves deactivating the remaining authorizations one-by-one)
and make a new one.

SUPER unfortunately, newOrder calls are rate-limited by Let's Encrypt,
effectively halving even a correctly-implemented, robust, and well-
behaved ACME client's management capacity. Orders are also associated
with a lot of state, and as such, are expensive database transactions
on the server-side. Further, client-side logic is forced to be much
more complex in order to correctly take advantage of all offered
challenge types. Clients that don't do this effectively ignore all but
the first, making it pointless to offer more than one challenge type in
the first place!

The previous logic was much cleaner and more elegant: an order was
created, its authorizations were iterated, and each authorization's
challenges were iterated until one succeeded. If any authorization
failed (i.e. all challenges failed), it simply returned that error and
the order was cancelled (other authorizations were deactivated). This
kept all error-handling and retry state local to the respective loops:

authzs -> challenges

That was the previous logic. Now, we have a third loop:

order retries -> authzs -> challenges

We need to bubble retry state up to the top-most "order" loop, which
gets manipulated in the inner-most "challenge" loop. We have to carry
failure state around through the whole retry process, mapping
identifiers to challenge types in order to remember which challenges
failed for which identifiers so we don't try them again on the next
order.

Additionally, our challenge selection is necessarily made more complex.
Before, we could just randomize the order of the challenges (as a good
practice, to avoid accidental dependence on just one challenge type).
Now, because retries are expensive and complex, we absolutely need to
avoid them as much as possible. So instead of a random order, we keep
a history of challenge success rates and choose the most successful
challenge type first, every time. If it fails, we try the next-most-
successful, and so on, but each retry is part of a new order and that's
expensive.

The ACME spec forces leaky, complex abstractions, and makes writing
correct clients more difficult and error-prone than is necessary. (Just
look at this commit!) I am not aware of any good reason that the spec is
the way that it is on this point. One possibility I've heard is "it's
simpler for servers that way" and "free CAs want to keep their costs
down" but it's NOT simpler this way (again, look at the code), and
order transactions are *expensive* -- CAs don't want frequent polling on
order status because there is so much state attached to an order! -- but
the way the spec is written requires significantly more CPU and network
cycles than are necessary.

Because it only takes one successful challenge to mark the authz as
"valid", and because order transactions are expensive for the server,
and because the client-side logic is immeasurably more complex and
convoluted and tricky to get right this way, the current ACME spec is
nonsensical on this point. Maybe it intended to optimize for server
implementations (which it didn't do successfully, as explained), but
forgot that ACME *clients* would fill the world, not servers; and now
we have something that is unintentionally hostile toward clean, correct,
efficient, and low-cost implementations.

In summary:

The ACME protocol should changed so that an authz is not marked as
"invalid" until ALL offered challenges fail, rather than just one.

[Acme] A single failed challenge should not inval… Matt Holt
Re: [Acme] A single failed challenge should not i… Kas
Re: [Acme] A single failed challenge should not i… Benjamin Kaduk