Re: [bess] Benjamin Kaduk's Discuss on draft-ietf-bess-datacenter-gateway-11: (with DISCUSS and COMMENT)

Adrian Farrel <adrian@olddog.co.uk> Fri, 28 May 2021 21:03 UTC

Reply-To: adrian@olddog.co.uk
From: Adrian Farrel <adrian@olddog.co.uk>
To: 'Benjamin Kaduk' <kaduk@mit.edu>, 'The IESG' <iesg@ietf.org>
Cc: draft-ietf-bess-datacenter-gateway@ietf.org, bess-chairs@ietf.org, bess@ietf.org, 'Matthew Bocci' <matthew.bocci@nokia.com>
References: <162191416295.8400.1863947061330586900@ietfa.amsl.com>
In-Reply-To: <162191416295.8400.1863947061330586900@ietfa.amsl.com>
Date: Fri, 28 May 2021 22:03:12 +0100
Organization: Old Dog Consulting
Message-ID: <029e01d75404$df5dd570$9e198050$@olddog.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: AQHEVVNu0WLsdURlsFTMfc0myKTQG6se8OUg
Content-Language: en-gb
Archived-At: <https://mailarchive.ietf.org/arch/msg/bess/V1wV1-_KsfDIHeOtqvM0qgbSWQA>
Subject: Re: [bess] Benjamin Kaduk's Discuss on draft-ietf-bess-datacenter-gateway-11: (with DISCUSS and COMMENT)
Precedence: list

Hi Ben,

Thanks for the Discuss and detailed Comments.

Responses in line.

All changes are held in the -12 buffer as we close off issues with the other ADs.

Cheers,
Adrian

> --------------------------------------------------------------
> DISCUSS:
>
> Thanks for having the discussion with John and updating the document
> already; I benefitted a lot from being able to read the -11 that has
> started rolling in fixes from the prior discussion.  My one new discuss
> point is relatively minor, all things considered, and is really just
> trying to nail down an aspect of internal consistency.  (I also support
> Roman's discuss, but we don't need to rehash that here.)

But, FWIW, I think addressing your Discuss and one of your Comments, will further help Roman.

> When we introduce the concept of gateways, we say that they can be
> attached to the Internet or a backbone network.  We then go on to
> provide a mechanism for gateways to advertise to some tunnel ingress
> node the complete set of gateways for a given site.  It seems that we
> do fairly consistently refer to this advertisement as being over "the
> backbone network", but I'm not seeing anything that clearly disclaims
> the applicability of this technique over the Internet itself.  However, I
> think we need to have such a disclaimer, since we do have a clearly
> stated assumption that "the connected set of DCs *and the backbone
> network connecting them* are part of the same SR BGP Link State (LS)
> instance ([RFC7752] and [I-D.ietf-idr-bgpls-segment-routing-epe])"
> (emphasis mine).  If the intent is to only use this mechanism over 
> "in-BGP-LS-instance" backbones and not over the Internet, we should
> explicitly set the scope of applicability and contrast a gateway as a
> generic concept and the gateway scenarios that this mechanism
> applies to.

It's actually very common to interconnect sites using a combination of private networks and the Internet (https://www.networkworld.com/article/3031279/sd-wan-what-it-is-and-why-you-ll-use-it-one-day.html). We use the term "backbone network" as a sort of shorthand and we don't mean to preclude use of private networks or of the Internet. We can clarify this 

We will change a sentence in the first paragraph 
OLD
   DCs are attached to the Internet or a backbone network by gateway routers
NEW
  DCs (sites) are interconnected by a backbone network, which consists of any
  number of private networks and/or the Internet, by gateway routers
END

The tunnels that interconnect any pair of ASBRs or GWs appear in the BGP-LS instance as links. The transit nodes in the "backbone network" do not show up in that BGP-LS instance.

What is important (and we missed) is that the tunnels that run across "untrusted networks" should be secure tunnels. We'll make this clear with an additional paragraph in the Security considerations as follows:

     <t>Given that the gateways and ASBRs are connected by tunnels that may run across parts of the network that are not trusted, 
        data center operators using the approach set out in this network SHOULD consider using gateway-to-gateway encryption to
        protect the data center traffic.  Additionally, due consideration SHOULD be given to encrypting end-to-end traffic as it
        would be for any traffic that uses a public or untrusted network for transport.</t>

> COMMENT:
> ----------------------------------------------------------------
>
> The Abstract is perhaps pushing the bounds of reasonable length for an abstract. 
> Perhaps:
>
> % This document defines a mechanism using the BGP Tunnel Encapsulation
> % attribute to allow datacenter gateway routers to advertise routes to the 
> % prefixes reachable in the site, including advertising them on behalf of 
> % other gateways at the same site.  This allows for multiple paths across 
> % the Internet or backbone (terminating at the different gateways) to be
> % used by segment routing to steer traffic for load-balancing and
> % resiliency purposes.

The Style Guide recommends 25 lines or fewer. We have 18 lines of text.

Perhaps we can compromise a bit. We need to mention the fact that a site may have more than one gateway because that is an important part of the problem that needs to be solved. That takes us to:

OLD
   Data centers are critical components of the infrastructure used by
   network operators to provide services to their customers.  Data
   centers are attached to the Internet or a backbone network by gateway
   routers.  One data center typically has more than one gateway for
   commercial, load balancing, and resiliency reasons.

   Segment Routing is a protocol mechanism that can be used within a
   data center, and also for steering traffic that flows between two
   data center sites.  In order that one data center site may load
   balance the traffic it sends to another data center site, it needs to
   know the complete set of gateway routers at the remote data center,
   the points of connection from those gateways to the backbone network,
   and the connectivity across the backbone network.

   Other sites, such as access networks, also need to be connected
   across backbone networks through gateways.

   This document defines a mechanism using the BGP Tunnel Encapsulation
   attribute to allow each gateway router to advertise the routes to the
   prefixes reachable in the site to which it provides access, including
   advertising them on behalf of each other gateway to the same site.
NEW
   Data centers are attached to the Internet or a backbone network by
   gateway routers.  One data center typically has more than one gateway
   for commercial, load balancing, and resiliency reasons.  Other sites,
   such as access networks, also need to be connected across backbone
   networks through gateways.

   This document defines a mechanism using the BGP Tunnel Encapsulation
   attribute to allow data center gateway routers to advertise routes to the 
   prefixes reachable in the site, including advertising them on behalf of 
   other gateways at the same site.  This allows segment routing to be used
   to identify multiple paths across the Internet or backbone network 
   between different gateways.  The paths can be selected for load-balancing,
   resilience, and quality purposes.   
END

> Section 1
>
>   The solution described in this document is agnostic as to whether the
>   transit ASes do or do not have SR capabilities.  the solution uses SR
>   to stitch together path segments between GWs and through the ASBRs.
>   Thus, there is a requirement that the GWs and ASBRs are SR-capable.
>   The solution supports the SR path being extended into the ingress and
>   egress sites if they are SR-capable.
>
> There seem to be some nodes marked "ASBR" that are at the boundary 
> between the two transit ASes, in Figure 1.  This text leaves me uncertain
> whether they are expected to support SR (vs just the ASBRs that are 
> attachment points for the ingress/egress GWs).

Whether the interior nodes of the transit ASes support SRs doesn't really matter to us. The GWs and ASBRs are interconnected by tunnels as represented by links in the BGP-LS instance. SR is used to stitch these tunnels together to create an end-to-end path between an ingress site and an egress site. Thus, the GWs and ASBRs must be SR capable in order to participate.

> Section 3
>
>   o  Each GW is configured with an identifier for the site.  That
>      identifier MUST be the same across all GWs to the site (i.e., the
>      same identifier is used by all GWs to the same site), and MUST be
>      unique across all sites that are connected (i.e., across all GWs
>      to all sites that are interconnected).
>
> The advice in draft-gont-numeric-ids-sec-considerations is probably 
> relevant here.  How should we pick these identifiers?  Which properties
> are necessary and which are not needed? 

This bullet needs to be taken along with the bullet that follows, viz.

   o  A route target ([RFC4360]) MUST be attached to each GW's auto-
      discovery route (defined below) and its value MUST be set to the
      site identifier.

We will re-jig these two bullets to say...

   o  A route target ([RFC4360]) MUST be attached to each GW's auto-
      discovery route (defined below) and its value MUST be set to a 
      value that identifies the site identifier.  The rules for constructing
      a route target are detailed in [RFC4360].  It is RECOMMENDED that
      a Type x00 or x02 route target be used.

   o Site identifiers are set through configuration.  The site identifiers 
      MUST be the same across all GWs to the site (i.e., the same
      identifier is used by all GWs to the same site), and MUST be
      unique across all sites that are connected (i.e., across all GWs
      to all sites that are interconnected).
     
> o  Each GW MUST construct an import filtering rule to import any
>      route that carries a route target with the same site identifier
>      that the GW itself uses.  This means that only these GWs will
>      import those routes, and that all GWs to the same site will import
>      each other's routes and will learn (auto-discover) the current set
>      of active GWs for the site.
>
> This seems pretty fragile in the face of identifier collisions; I hope there
> is some good text in the security considerations that covers the risks here.
> [ed. it seems we cover other aspects relating to identifier selection but not
> this one] Is there any filtering that can be done other than by site identifier,
> e.g., to know that a certain peer would never be able to advertise something
> that validly has the same site identifier?

This mechanism to control the importing of routes (using route targets) has been in use in BGP networks for a while (more than 15 years since RFC 4360).

One of the big uses for route targets is in identifying which sites belong to the same VPN (see RFC 4364). There is the potential to misconfigure the route target used to identify a set of VPN sites by giving one site the wrong value. That would mean that it:
- fails to import the relevant routes
- doesn't get its routes imported by its peers
- may accidentally import false routes
- may accidentally have its routes imported by other sites in different VPNs
That is all "bad stuff" and 4364 contains no mechanisms to handle what happens if there is a misconfiguration (it's not an attack vector unless BGP or a router is compromised - in which case far worse stuff happens). Basically, if you mess up the configuration, expect the unexpected.

For us, the biggest risk is that GWs fail to discover each other rather than GWs finding false partners. This is because the BGP peerings are explicitly configured, not discovered. And that is a big reason why this is not an attack vector for us. 

Note that Section 3 concludes with (as you comment upon, below)...

   Note that if a GW is (mis)configured with a different site identifier
   from the other GWs to the same site then it will not be auto-
   discovered by the other GWs (and will not auto-discover the other
   GWs).  This would result in a GW for another site receiving only the
   Tunnel Encapsulation attribute included in the BGP best route; i.e.,
   the Tunnel Encapsulation attribute of the (mis)configured GW or that
   of the other GWs.

That said, we will call this out in the Manageability Considerations section pointing out that getting this right is important.

>   As described in Section 1, each GW will include a Tunnel
>   Encapsulation attribute with the GW encapsulation information for
>   each of the site's active GWs (including itself) in every route
>   advertised externally to that site.  [...]
>
> (I assume this is not intended to preclude the usual route filtering/split-horizon type stuff.)

Definitely doesn't preclude it. In fact that rule remains key to the way GWs behave - they don't re-advertise routes they have learned from other GWs. But (of course) all the GWs to a site advertise the same routes. 

>                                        As the current set of active GWs
>   changes (due to the addition of a new GW or the failure/removal of an
>   existing GW) each externally advertised route will be re-advertised
>   with a new Tunnel Encapsulation attribute which reflects current set
>   of active GWs.
>
> The "everybody advertises the union of what they've seen" behavior seems
> like it will latch NLRI in place as being a GW, but here we're saying that
> removal will be propagated as well as addition.  What's the mechanism for
> removing stale data (whether maliciously added or as part of maintenance? 
> If it's an explicit withdrawal, is that also propagated by everybody?  How long
> does it have to stay around for?  (I recognize that some of this is just stock
> BGP, but I am looking for more clarity on how it interacts with the "advertise
> the union of what you saw" behavior that is new to this document.

Yes, this is all handled by standard BGP mechanisms. It's how the withdraw message works and is propagated.

If a gateway auto-discovery route gets withdrawn (explicit message or dropped peering), then the remaining gateways remove its tunnel TLVs from the union and re-advertise the site's routes. 

> The text in the next paragraph mentions that there can be situations with
> broken internal routing where things land in a broken state -- how long do
> they stay broken and how can they be fixed?

It completely depends upon the situation, what the breakage is, and how the breakage is discovered. 
JGS suggested we flag up the situation rather than try to solve it (which we did in the paragraph quoted below).
While not an impossible situation, it does represent a strange brokenness that is possibly causing traffic within the site to get misrouted as well.

>   If a gateway becomes disconnected from the backbone network, or if
>   the site operator decides to terminate the gateway's activity, it
>   MUST withdraw the advertisements described above.  This means that
>   remote gateways at other sites will stop seeing advertisements from
>   this gateway.  Note that if the routing within a site is broken (for
>   example, such that there is a route from one GW to another, but not
>   in the reverse direction), then it is possible that incoming traffic
>   will be routed to the wrong GW to reach the destination prefix - in
>   this degraded network situation, traffic may be dropped.
>
> This is probably worth reiterating in the security considerations section.

Muttering about, "Not all problems are security problems" 😊
As an attack it says, "If you can break the routing within a site, then traffic coming from outside the site might also be incorrectly routed."
We don't think reiteration of this peculiarly broken situation would help.

>   Note that if a GW is (mis)configured with a different site identifier
>   from the other GWs to the same site then it will not be auto-
>   discovered by the other GWs (and will not auto-discover the other
>   GWs).  This would result in a GW for another site receiving only the
>   Tunnel Encapsulation attribute included in the BGP best route; i.e.,
>   the Tunnel Encapsulation attribute of the (mis)configured GW or that
>   of the other GWs.
>
> Are there noteworthy operational considerations of this, e.g., if all the
> traffic gets directed to a GW that lacks the bandwidth to handle it?

It is worth noting that without this mechanism, all traffic gets directed to just one gateway because that is what BGP does (plus or minus ADDPATH). 
What this document does is allow path selection, and choosing paths allows a degree of traffic engineering. So the misconfiguration situation is never worse than before this document.

> Section 4
>
>   attribute to identify the GWs through which X can be reached.  It
>   uses this information to compute SR Traffic Engineering (SR TE) paths
>   across the backbone network looking at the information advertised to
>   it in SR BGP Link State (BGP-LS)
>
> This seems to leave the reader wondering about the details of how 
> those SR TE paths are computed.  I understand that it's properly out
> of scope for this document, but a reference would go a long way.

We appreciate the inquisitive reader! In the Introduction we say...

   The solution defined in this document can be seen in the broader 
   context of site interconnection in [I-D.farrel-spring-sr-domain-interconnect].
   That document shows how other existing protocol elements may be
   combined with the solution defined in this document to provide a full
   system, but is not a necessary reference for understanding this document.

> Section 5
>
>   for a prefix X, then each GW computes an SR TE path through that site
>   to X from each of the currently active GWs, and places each in an
>   MPLS label stack sub-TLV [RFC9012] in the SR Tunnel TLV for that GW.
>
> I don't think I understand why each (egress) GW has to (re)compute
> the path through the site to X for each of the GWs at the site -- can't
> it just take the sub-TLV it got from the peer and re-propagate it?

Oh, it doesn't have to recompute it *if* it is present (i.e. if it got it from the peer). But that part of the path might not be present (that is, the sub-TLV might not be present) because:
    a. the ingress might not have visibility into the egress site beyond simple reachability 
    b. the ingress might not care 
    c. the ingress might want to let the egress site make subtle reactive choices according to local conditions
IMHO it would be unusual for the tail end of the path to be specified in this way.

BTW, the computation is not done per packet. Just like regular routing inside the site, it is done according various state-based triggers, and stored for the destination prefix.

> Section 6
>
> [The topic of which sites are allowed to send in the site's native encapsulation seems
> related to questions of what an "SR Domain" is and what boundary security it has.  I
> think that the other ADs are basically covering this topic, though, so am not sure there
> is much more to say here.]
>
>   If the GWs for a given site are configured to allow remote GWs to
>   send them a packet in that site's native encapsulation, then each GW
>   will also include multiple instances of a Tunnel TLV for that native
>   encapsulation in externally advertised routes: one for each GW and
>   each containing a Tunnel Egress Endpoint sub-TLV with that GW's
>   address.  [...]
>
> Does this implicitly require that all the GWs of the site have the same configuration
> for whether or not to allow native encapsulation from remote GWs?  How would
> things degrade if a mixed configuration did happen to occur?

This applies GW by GW since the tunnels lead to a GW, and the path has selected the tunnels. So, the sender knows.

If a gateway is configured to not allow native encapsulation, then it will receive packets in some other encapsulation that it does understand, and it will convert them to the site's native encapsulation. 

Note that the GW is part of the site, so it really (really) needs to understand the native encapsulation in the site!

Note also that (of course?) a GW that receives packets in an encapsulation it doesn't understand (and hasn't advertised that it understands) will drop the packets (just like any other data plane node will discard packets with an unknown encapsulation).

Well, there is a case where a GW advertises that it understands an encapsulation, but actually doesn't understand it. That is at best a bug, and at worst a purchasing error.

> Section 8
>
>   From a protocol point of view, the mechanisms described in this
>   document can leverage the security mechanisms already defined for
>   BGP.  Further discussion of security considerations for BGP may be
>   found in the BGP specification itself [RFC4271] and in the security
>   analysis for BGP [RFC4272].  The original discussion of the use of
>   the TCP MD5 signature option to protect BGP sessions is found in
>   [RFC5925], while [RFC6952] includes an analysis of BGP keying and
>   authentication issues.
>
> Such an elegant way of not mentioning TCP-AO :) (I do see that it is actually referenced, just not mentioned by name.)

Hmmm. The document that contains TCP-AO is referenced, but that is because of the useful discussion of MD5 that it contains.

> The whole section is quite nicely done, actually -- thank you!

We aim to please.

> Section 11
>
> I don't really understand why draft-ietf-idr-bgpls-segment-routing-epe
> is listed as normative but draft-ietf-idr-bgp-ls-segment-routing-ext is 
> listed as informative.  They seem to be used in the same place.

Oops, the former draft should be listed as Informative

> NITS/EDITORIAL
>
> Section 1
>
>   two DC sites.  In order for a source DC (also known as an ingress DC)
>   that uses SR to load balance the flows it sends to a destination DC
>
> I'd consider "also known as an ingress DC since it forms the ingress endpoint of a tunnel".

Pedant alert - the site doesn't form the ingress of the tunnel, that is the gateway.
But thanks for making me re-read because this should be "site" not "DC"!

>   sites could each be constructed differently and use different
>   technologies such as IP, MPLS with global table routing native BGP to
>   the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN.  That is, the
>
> FWIW I don't think I figured out what "MPLS with global table routing
> native BGP to the edge" means with any real confidence, and my attempts
> to google it basically just found this document.  So please feel encouraged
> to take another look at the phrasing.  My current most-likely interpretation
> is that there is internally MPLS with global table, but that what's presented 
> to the outside is native BGP-based IP routing, so there's some implied
> translation layer.

Yeah, "MPLS inside the AS based on externally visible address reachability like what IP does." That's basically how MPLS-enabled ASes work.

> Section 3
>
>   To avoid the side effect of applying the Tunnel Encapsulation
>   attribute to any packet that is addressed to the GW itself, the GW
>   MUST use a different loopback address for packets intended for it.
>
> "different" is most clear when we list both things that differ.
> So perhaps it's safer to say that the address advertised for auto-
> discovery must use a different loopback address than is advertised
> for packets directed to the gateway itself.

OK

> Section 5
>
>   achieve this, each Tunnel TLV in the Tunnel Encapsulation attribute
>   contains a Prefix SID sub-TLV [RFC9012] for X.  As defined in
>   [RFC9012], the Prefix SID sub-TLV is only for IPv4/IPV6 labelled
>
> I wonder if this sentence break should really be a paragraph break, 
> since the following paragraph seems to cover MPLS in a way that
> roughly parallels how we treat IP here.

Ack

>   applies to routes of those types.  If the use of the Prefix SID sub-
>   tlv for routes of other types is defined in the future, further
>   documents will be needed to describe their use.
>
> I think that we are missing a "for SR TE tunnel encapsulation" at the end,
> as the current text is basically saying "if the use of X is defined in the future,
> future documents will be needed to describe the use of X", which is fairly
> devoid of content.

Ah, no.
Other Prefix-SID sub-TLVs might be defined for general use in SR.
If that happens, further documents will also be needed to describe their use in the context of this document.
Rephrased.

>   Alternatively, if MPLS SR is in use and if the GWs for a given site
>   are configured to allow remote GWs to perform SR TE through that site
>   for a prefix X, then each GW computes an SR TE path through that site
>
> We might benefit from sprinkling around some "ingress" and "egress"
> here.

Prefer chocolate sprinkles, but have reluctantly used ingress and egress.

[bess] Benjamin Kaduk's Discuss on draft-ietf-bes… Benjamin Kaduk via Datatracker
Re: [bess] Benjamin Kaduk's Discuss on draft-ietf… Adrian Farrel
Re: [bess] Benjamin Kaduk's Discuss on draft-ietf… Benjamin Kaduk
Re: [bess] Benjamin Kaduk's Discuss on draft-ietf… Adrian Farrel
Re: [bess] Benjamin Kaduk's Discuss on draft-ietf… Benjamin Kaduk
Re: [bess] Benjamin Kaduk's Discuss on draft-ietf… Benjamin Kaduk
Re: [bess] Benjamin Kaduk's Discuss on draft-ietf… Adrian Farrel
Re: [bess] Benjamin Kaduk's Discuss on draft-ietf… Benjamin Kaduk