Re: [bess] Benjamin Kaduk's Discuss on draft-ietf-bess-datacenter-gateway-11: (with DISCUSS and COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Wed, 21 July 2021 17:19 UTC

Return-Path: <kaduk@mit.edu>
X-Original-To: bess@ietfa.amsl.com
Delivered-To: bess@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E2D843A1FDC; Wed, 21 Jul 2021 10:19:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.499
X-Spam-Level:
X-Spam-Status: No, score=-1.499 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, KHOP_HELO_FCRDNS=0.399, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OD_Q402fe8Ax; Wed, 21 Jul 2021 10:18:57 -0700 (PDT)
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 669473A1FDA; Wed, 21 Jul 2021 10:18:57 -0700 (PDT)
Received: from kduck.mit.edu ([24.16.140.251]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 16LHIn7H002186 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 21 Jul 2021 13:18:54 -0400
Date: Wed, 21 Jul 2021 10:18:48 -0700
From: Benjamin Kaduk <kaduk@mit.edu>
To: Adrian Farrel <adrian@olddog.co.uk>
Cc: 'The IESG' <iesg@ietf.org>, draft-ietf-bess-datacenter-gateway@ietf.org, bess-chairs@ietf.org, bess@ietf.org, 'Matthew Bocci' <matthew.bocci@nokia.com>
Message-ID: <20210721171848.GF88594@kduck.mit.edu>
References: <162191416295.8400.1863947061330586900@ietfa.amsl.com> <029e01d75404$df5dd570$9e198050$@olddog.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <029e01d75404$df5dd570$9e198050$@olddog.co.uk>
Archived-At: <https://mailarchive.ietf.org/arch/msg/bess/PiY_rCc4yptfkfIgxf8dmVkpTxo>
Subject: Re: [bess] Benjamin Kaduk's Discuss on draft-ietf-bess-datacenter-gateway-11: (with DISCUSS and COMMENT)
X-BeenThere: bess@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: BGP-Enabled ServiceS working group discussion list <bess.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/bess>, <mailto:bess-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/bess/>
List-Post: <mailto:bess@ietf.org>
List-Help: <mailto:bess-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/bess>, <mailto:bess-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 21 Jul 2021 17:19:05 -0000

Picking up (belatedly) where I left off in my initial reply...

On Fri, May 28, 2021 at 10:03:12PM +0100, Adrian Farrel wrote:
[snip]
> >                                        As the current set of active GWs
> >   changes (due to the addition of a new GW or the failure/removal of an
> >   existing GW) each externally advertised route will be re-advertised
> >   with a new Tunnel Encapsulation attribute which reflects current set
> >   of active GWs.
> >
> > The "everybody advertises the union of what they've seen" behavior seems
> > like it will latch NLRI in place as being a GW, but here we're saying that
> > removal will be propagated as well as addition.  What's the mechanism for
> > removing stale data (whether maliciously added or as part of maintenance? 
> > If it's an explicit withdrawal, is that also propagated by everybody?  How long
> > does it have to stay around for?  (I recognize that some of this is just stock
> > BGP, but I am looking for more clarity on how it interacts with the "advertise
> > the union of what you saw" behavior that is new to this document.
> 
> Yes, this is all handled by standard BGP mechanisms. It's how the withdraw message works and is propagated.
> 
> If a gateway auto-discovery route gets withdrawn (explicit message or dropped peering), then the remaining gateways remove its tunnel TLVs from the union and re-advertise the site's routes. 

For what it's worth, what's going on here seems to have become much more
clear to me after a re-read of the document and the month's delay.  I'm
sorry that I was confused about it the first time around.

In short: the "union of what you saw" only gets advertised externally, and
the auto-discovery (internal) advertisements will always just be for the
advertising router's own information.  There's no auto-discovery that
includes the union of what you saw, which is I think what I was concerned
about here.

> > The text in the next paragraph mentions that there can be situations with
> > broken internal routing where things land in a broken state -- how long do
> > they stay broken and how can they be fixed?
> 
> It completely depends upon the situation, what the breakage is, and how the breakage is discovered. 
> JGS suggested we flag up the situation rather than try to solve it (which we did in the paragraph quoted below).
> While not an impossible situation, it does represent a strange brokenness that is possibly causing traffic within the site to get misrouted as well.
> 
> >   If a gateway becomes disconnected from the backbone network, or if
> >   the site operator decides to terminate the gateway's activity, it
> >   MUST withdraw the advertisements described above.  This means that
> >   remote gateways at other sites will stop seeing advertisements from
> >   this gateway.  Note that if the routing within a site is broken (for
> >   example, such that there is a route from one GW to another, but not
> >   in the reverse direction), then it is possible that incoming traffic
> >   will be routed to the wrong GW to reach the destination prefix - in
> >   this degraded network situation, traffic may be dropped.
> >
> > This is probably worth reiterating in the security considerations section.
> 
> Muttering about, "Not all problems are security problems" 😊
> As an attack it says, "If you can break the routing within a site, then traffic coming from outside the site might also be incorrectly routed."
> We don't think reiteration of this peculiarly broken situation would help.

Ok.

> >   Note that if a GW is (mis)configured with a different site identifier
> >   from the other GWs to the same site then it will not be auto-
> >   discovered by the other GWs (and will not auto-discover the other
> >   GWs).  This would result in a GW for another site receiving only the
> >   Tunnel Encapsulation attribute included in the BGP best route; i.e.,
> >   the Tunnel Encapsulation attribute of the (mis)configured GW or that
> >   of the other GWs.
> >
> > Are there noteworthy operational considerations of this, e.g., if all the
> > traffic gets directed to a GW that lacks the bandwidth to handle it?
> 
> It is worth noting that without this mechanism, all traffic gets directed to just one gateway because that is what BGP does (plus or minus ADDPATH). 
> What this document does is allow path selection, and choosing paths allows a degree of traffic engineering. So the misconfiguration situation is never worse than before this document.

Good point!

> > Section 4
> >
> >   attribute to identify the GWs through which X can be reached.  It
> >   uses this information to compute SR Traffic Engineering (SR TE) paths
> >   across the backbone network looking at the information advertised to
> >   it in SR BGP Link State (BGP-LS)
> >
> > This seems to leave the reader wondering about the details of how 
> > those SR TE paths are computed.  I understand that it's properly out
> > of scope for this document, but a reference would go a long way.
> 
> We appreciate the inquisitive reader! In the Introduction we say...
> 
>    The solution defined in this document can be seen in the broader 
>    context of site interconnection in [I-D.farrel-spring-sr-domain-interconnect].
>    That document shows how other existing protocol elements may be
>    combined with the solution defined in this document to provide a full
>    system, but is not a necessary reference for understanding this document.
> 
> > Section 5
> >
> >   for a prefix X, then each GW computes an SR TE path through that site
> >   to X from each of the currently active GWs, and places each in an
> >   MPLS label stack sub-TLV [RFC9012] in the SR Tunnel TLV for that GW.
> >
> > I don't think I understand why each (egress) GW has to (re)compute
> > the path through the site to X for each of the GWs at the site -- can't
> > it just take the sub-TLV it got from the peer and re-propagate it?
> 
> Oh, it doesn't have to recompute it *if* it is present (i.e. if it got it from the peer). But that part of the path might not be present (that is, the sub-TLV might not be present) because:
>     a. the ingress might not have visibility into the egress site beyond simple reachability 
>     b. the ingress might not care 
>     c. the ingress might want to let the egress site make subtle reactive choices according to local conditions
> IMHO it would be unusual for the tail end of the path to be specified in this way.

(Disclaimer: this is not an important topic and I'm happy to drop it for
expediency if desired.)  I am not sure that my point came across as
intended.  The text here seems to be about what the egress GW does when it
advertises the "union of all tunnel encapsulation information" route
externally.  My understanding/expectation was that each egress GW would be
able to just take the bits it got from auto-discovery (internal) routes and
squish them together.  As you note, the ingress may not care or need the
details of the path within the egress site from GW to prefix X, and so I
don't understand why the advertising egress GW would go to the trouble of
computing a route from the other GWs in its site to prefix X.  If such a
route was needed, the other GW in the site would be better placed to
compute such a route and include it in the auto-discovery route anyway.

> BTW, the computation is not done per packet. Just like regular routing inside the site, it is done according various state-based triggers, and stored for the destination prefix.

(Right.)

> > Section 6
> >
> > [The topic of which sites are allowed to send in the site's native encapsulation seems
> > related to questions of what an "SR Domain" is and what boundary security it has.  I
> > think that the other ADs are basically covering this topic, though, so am not sure there
> > is much more to say here.]
> >
> >   If the GWs for a given site are configured to allow remote GWs to
> >   send them a packet in that site's native encapsulation, then each GW
> >   will also include multiple instances of a Tunnel TLV for that native
> >   encapsulation in externally advertised routes: one for each GW and
> >   each containing a Tunnel Egress Endpoint sub-TLV with that GW's
> >   address.  [...]
> >
> > Does this implicitly require that all the GWs of the site have the same configuration
> > for whether or not to allow native encapsulation from remote GWs?  How would
> > things degrade if a mixed configuration did happen to occur?
> 
> This applies GW by GW since the tunnels lead to a GW, and the path has selected the tunnels. So, the sender knows.
> 
> If a gateway is configured to not allow native encapsulation, then it will receive packets in some other encapsulation that it does understand, and it will convert them to the site's native encapsulation. 
> 
> Note that the GW is part of the site, so it really (really) needs to understand the native encapsulation in the site!
> 
> Note also that (of course?) a GW that receives packets in an encapsulation it doesn't understand (and hasn't advertised that it understands) will drop the packets (just like any other data plane node will discard packets with an unknown encapsulation).
> 
> Well, there is a case where a GW advertises that it understands an encapsulation, but actually doesn't understand it. That is at best a bug, and at worst a purchasing error.

Okay.  So if there is an issue here at all (not entirely clear) it's just
an editorial one about "the GWs for a given site" vs "any GWs for a given
site", and "each GW" vs "each such GW" (or similar).

> > Section 8
> >
> >   From a protocol point of view, the mechanisms described in this
> >   document can leverage the security mechanisms already defined for
> >   BGP.  Further discussion of security considerations for BGP may be
> >   found in the BGP specification itself [RFC4271] and in the security
> >   analysis for BGP [RFC4272].  The original discussion of the use of
> >   the TCP MD5 signature option to protect BGP sessions is found in
> >   [RFC5925], while [RFC6952] includes an analysis of BGP keying and
> >   authentication issues.
> >
> > Such an elegant way of not mentioning TCP-AO :) (I do see that it is actually referenced, just not mentioned by name.)
> 
> Hmmm. The document that contains TCP-AO is referenced, but that is because of the useful discussion of MD5 that it contains.
> 
> > The whole section is quite nicely done, actually -- thank you!
> 
> We aim to please.
> 
> > Section 11
> >
> > I don't really understand why draft-ietf-idr-bgpls-segment-routing-epe
> > is listed as normative but draft-ietf-idr-bgp-ls-segment-routing-ext is 
> > listed as informative.  They seem to be used in the same place.
> 
> Oops, the former draft should be listed as Informative

It is good to know that not all of my confusion is misguided :)

> > NITS/EDITORIAL

No further comments on the nits/editorial, but the extra commentary is
appreciated.

I will try to remember to get you the chocolate-sprinkles option if I'm
ever faced with that choice...

Thanks again for all your care in the responses and updates, as well as
your patience with my laggardly replies.  Any strain in such patience is
justified, and I accept the blame for it.

I will go post a No Objection ballot on the -12 now.

-Ben

> > Section 1
> >
> >   two DC sites.  In order for a source DC (also known as an ingress DC)
> >   that uses SR to load balance the flows it sends to a destination DC
> >
> > I'd consider "also known as an ingress DC since it forms the ingress endpoint of a tunnel".
> 
> Pedant alert - the site doesn't form the ingress of the tunnel, that is the gateway.
> But thanks for making me re-read because this should be "site" not "DC"!
> 
> >   sites could each be constructed differently and use different
> >   technologies such as IP, MPLS with global table routing native BGP to
> >   the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN.  That is, the
> >
> > FWIW I don't think I figured out what "MPLS with global table routing
> > native BGP to the edge" means with any real confidence, and my attempts
> > to google it basically just found this document.  So please feel encouraged
> > to take another look at the phrasing.  My current most-likely interpretation
> > is that there is internally MPLS with global table, but that what's presented 
> > to the outside is native BGP-based IP routing, so there's some implied
> > translation layer.
> 
> Yeah, "MPLS inside the AS based on externally visible address reachability like what IP does." That's basically how MPLS-enabled ASes work.
> 
> > Section 3
> >
> >   To avoid the side effect of applying the Tunnel Encapsulation
> >   attribute to any packet that is addressed to the GW itself, the GW
> >   MUST use a different loopback address for packets intended for it.
> >
> > "different" is most clear when we list both things that differ.
> > So perhaps it's safer to say that the address advertised for auto-
> > discovery must use a different loopback address than is advertised
> > for packets directed to the gateway itself.
> 
> OK
> 
> > Section 5
> >
> >   achieve this, each Tunnel TLV in the Tunnel Encapsulation attribute
> >   contains a Prefix SID sub-TLV [RFC9012] for X.  As defined in
> >   [RFC9012], the Prefix SID sub-TLV is only for IPv4/IPV6 labelled
> >
> > I wonder if this sentence break should really be a paragraph break, 
> > since the following paragraph seems to cover MPLS in a way that
> > roughly parallels how we treat IP here.
> 
> Ack
> 
> >   applies to routes of those types.  If the use of the Prefix SID sub-
> >   tlv for routes of other types is defined in the future, further
> >   documents will be needed to describe their use.
> >
> > I think that we are missing a "for SR TE tunnel encapsulation" at the end,
> > as the current text is basically saying "if the use of X is defined in the future,
> > future documents will be needed to describe the use of X", which is fairly
> > devoid of content.
> 
> Ah, no.
> Other Prefix-SID sub-TLVs might be defined for general use in SR.
> If that happens, further documents will also be needed to describe their use in the context of this document.
> Rephrased.
> 
> >   Alternatively, if MPLS SR is in use and if the GWs for a given site
> >   are configured to allow remote GWs to perform SR TE through that site
> >   for a prefix X, then each GW computes an SR TE path through that site
> >
> > We might benefit from sprinkling around some "ingress" and "egress"
> > here.
> 
> Prefer chocolate sprinkles, but have reluctantly used ingress and egress.
>