Re: [Idr] Secdir last call review of draft-ietf-idr-long-lived-gr-05

Valery Smyslov <valery@smyslov.net> Thu, 13 July 2023 08:43 UTC

From: Valery Smyslov <valery@smyslov.net>
To: 'John Scudder' <jgs@juniper.net>
Cc: secdir@ietf.org, draft-ietf-idr-long-lived-gr.all@ietf.org, idr@ietf.org, last-call@ietf.org
References: <168845800740.483.1479588038121884290@ietfa.amsl.com> <71207892-ADFC-417E-B8C4-66B564C53934@juniper.net>
In-Reply-To: <71207892-ADFC-417E-B8C4-66B564C53934@juniper.net>
Date: Thu, 13 Jul 2023 11:43:00 +0300
Message-ID: <06fd01d9b566$0a2ca550$1e85eff0$@smyslov.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Content-Language: ru
Thread-Index: AQIaq0ks/CBmjDVPdAw1+Y4xHLTveAJcYUctryJeSeA=
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/qK7HWZ-rwy_lkDjYHbjrRU2h3nA>
Subject: Re: [Idr] Secdir last call review of draft-ietf-idr-long-lived-gr-05
Precedence: list

Hi John,

please see inline.

> Hi Valery,
> 
> Thanks for your review. Some responses inline below.
> 
> > On Jul 4, 2023, at 4:06 AM, Valery Smyslov via Datatracker <noreply@ietf.org> wrote:
> >
> > Reviewer: Valery Smyslov
> > Review result: Has Issues
> >
> > I have reviewed this document as part of the security directorate's
> > ongoing effort to review all IETF documents being processed by the
> > IESG.  These comments were written primarily for the benefit of the
> > security area directors.  Document editors and WG chairs should treat
> > these comments just like any other last call comments.
> >
> > The document defines a new BGP capability "Long-lived Graceful Restart Capability"
> > that allows stale routes to be retained for a longer time than is currently allowed
> > by RFC 4724. The document is well written and is easy to understand.
> 
> Thank you!

You welcome :-)

> > My concern is that the upper limit for the "Long-lived Stale Time" period is 2^24 - 1 seconds
> > (about 194 days) and the document doesn't specify any restrictions for this value.
> 
> I’m not sure if this is different from what you meant by “any restrictions”, but Section 4.2 has "These
> timers MAY be modified by local configuration.” After discussing it with my co-authors, we agree that
> this is too easy to overlook, and propose to change it to "The timers received in the Long-lived Graceful
> Restart Capability SHOULD be modifiable by local configuration, which may impose either an upper or a
> lower bound, or both, on their respective values.” Then, we return to this in our updated Security
> Considerations section, read on.

OK.

> > It seems to me that having such long lived stale routes may open new possibilities for attackers.
> > In particular, a possibility of a resource exhausting for storing a lot of stale routes
> > for a very long time leading to a DoS attack come first to my mind.
> > This possibility is not mentioned in the Security Considerations.
> 
> We worked through several scenarios and as best we can determine, this is adequately covered under
> "The security implications of the LLGR mechanism defined in this document are akin to those incurred by
> the maintenance of stale routing information within a network." The outline looks like:
> 
> 1. To successfully mount a DoS attack against the network, the attacker has to be able to inject a large
> number of routes. If an attacker can do that, it’s a pre-existing vulnerability, not one created by LLGR.
> 2. The new vulnerability would be, if the DoS in (1) can be exacerbated by keeping the garbage routes
> stored in the network even after the attack against the proximate victim has been remediated.
>   2.a. But, if the attack is remediated, for instance by resetting the BGP session from the attacker to the
> victim (either manually, or as a result of the operation of an automatic defense feature such as max-
> prefixes), then the routes would promptly be flushed from the network as a consequence of the normal
> operation of the BGP protocol.
>   2.b. So, in order for the attack to succeed, the proximate victim would have to be prevented from
> withdrawing the routes. Ergo, the attacker would have to have the ability to not only inject routes in (1),
> but subsequently to silence the victim router (e.g. by crashing it into a non-recoverable state).
>   2.c. Even if that scenario were to be carried out (which implies underlying vulnerabilities probably more
> concerning than the LLGR resource-exhaustion vulnerability itself) the victim router’s next hop would
> disappear from the IGP, which would cause the LLGR routes to become non-resolvable, removing them
> from the FIB. Granted that RIB resources would still be consumed for the duration of the attack or the
> LLST, whichever is shorter, but in general FIB, not RIB, resources are the bottleneck.

I was mostly thinking on something like 2.b. You are in a better position to 
analyze this scenario, so if you think that it is not a real threat, then I trust you.

> We’re not absolutely opposed to including an analysis like the above in the Security Considerations, but
> pending any further discussion, we’re comfortable with leaving it at the brief outline that’s already
> present. We did add one sentence to the introductory paragraph, so
> 
> OLD:
> The security implications of the LLGR mechanism defined in this document are akin to those incurred by
> the maintenance of stale routing information within a network.
> 
> NEW:
> The security implications of the LLGR mechanism defined in this document are akin to those incurred by
> the maintenance of stale routing information within a network. However, since the retention time may
> potentially be much longer, the window during which certain attacks are feasible may be substantially
> increased.

Fine with me, thank you.

> > Then, it seems to me that the countermeasures suggested in Section 6 to avoid VPN breach
> > may not work for large values of the "Long-lived Stale Time" period.
> >
> > And a final nit: the last para of Section 6 looks to me like some sort of excuse, which
> > in my opinion is not appropriate for a technical document. No matter how complex an attack is,
> > if it is ever feasible with the given threat model, then we should just describe it
> > with no additional sentiments that it is hard. Perhaps it is better to describe possible
> > attacks in terms of attacker's capabilities. E.g.: "If an attacker is able to inject packets
> > into the network then the following attacks are possible...".
> 
> Thanks for challenging us on these! Happily, the rewrite to fix the latter also led to improving the clarity
> of exposition regarding the countermeasure. Your point is still correct of course, that if it’s impossible to
> find a viable configuration that prevents overlap of label allocation reuse time and LLST, then the attack
> can’t be entirely ruled out; I hope the proposed text is sufficiently clear on this point. I’ve pasted the
> proposed update below.
> 
> OLD:
>    Therefore, to avoid VPN breach, before enabling BGP LLGR for a VPN
>    address family, Service Providers need to check how fast a given
>    label can be reused by a PE, taking into account:
> 
>    *  The load of the BGP route churn on a PE (in terms of the number of
>       VPN labels advertised and the churn rate).
> 
>    *  The label allocation policy on the PE (possibly depending upon the
>       size of the pool of the VPN labels (which can be restricted by
>       hardware considerations or other MPLS usages), the label
>       allocation scheme (for example per route or per VRF/CE), the re-
>       allocation policy (for example least recently used label).
> 
>    Note that [RFC4781] which defines Graceful Restart Mechanism for BGP
>    with MPLS is also applicable to BGP LLGR.
> 
>    These considerations notwithstanding, the LLGR mechanism described
>    within this document is considered to be complex to exploit
>    maliciously - in order to inject packets into a topology, there is a
>    requirement to engineer a specific LLGR state between two PE devices,
>    whilst engineering label reallocation to occur in a manner that
>    results in the two topologies overlapping.  Such allocation is
>    particularly difficult to engineer (since it is typically an internal
>    mechanism of a router).
> 
> NEW:
>    In order to exploit the vulnerability described above, there is a
>    requirement to engineer a specific LLGR state between two PE devices,
>    whilst engineering label reallocation to occur in a manner that
>    results in the two topologies overlapping.  Therefore, to avoid the
>    potential for a VPN breach, before enabling BGP LLGR for a VPN
>    address family, the operator should endeavor to ensure that the lower
>    bound on when a label might be reused is greater than the upper bound
>    on LLST.  Section 4.2 discusses the provision of an upper bound on LLST.
>    Details of features for setting a lower bound on label reuse time are
>    beyond the scope of this document; however, factors that might need
>    to be taken into account when setting this value include:
> 
>    *  The load of the BGP route churn on a PE (in terms of the number of
>       VPN labels advertised and the churn rate).
> 
>    *  The label allocation policy on the PE (possibly depending upon the
>       size of the pool of the VPN labels (which can be restricted by
>       hardware considerations or other MPLS usages), the label
>       allocation scheme (for example per route or per VRF/CE), the re-
>       allocation policy (for example least recently used label).
> 
>    Note that [RFC4781] which defines Graceful Restart Mechanism for BGP
>    with MPLS is also applicable to BGP LLGR.

Thank you, this text is much better.

> We’ll post a version 06 with the updates as soon as possible. Thanks again for your review.

No problem :-)

Regards,
Valery.

> —John

[Idr] Secdir last call review of draft-ietf-idr-l… Valery Smyslov via Datatracker
Re: [Idr] Secdir last call review of draft-ietf-i… John Scudder
Re: [Idr] Secdir last call review of draft-ietf-i… Valery Smyslov