[Gen-art] Genart last call review of draft-ietf-bess-datacenter-gateway-10

Gyan Mishra via Datatracker <noreply@ietf.org> Thu, 29 April 2021 05:46 UTC
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Gyan Mishra via Datatracker <noreply@ietf.org>
To: gen-art@ietf.org
Cc: bess@ietf.org, draft-ietf-bess-datacenter-gateway.all@ietf.org, last-call@ietf.org
Auto-Submitted: auto-generated
Precedence: bulk
Message-ID: <161967518819.13605.6722172787091620121@ietfa.amsl.com>
Reply-To: Gyan Mishra <hayabusagsm@gmail.com>
Date: Wed, 28 Apr 2021 22:46:28 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/gen-art/6UfclQvHDjGnGn0ql-TI3b1_xZ4>
Subject: [Gen-art] Genart last call review of draft-ietf-bess-datacenter-gateway-10
Reviewer: Gyan Mishra
Review result: Not Ready

I am the assigned Gen-ART reviewer for this draft. The General Area
Review Team (Gen-ART) reviews all IETF documents being processed
by the IESG for the IETF Chair.  Please treat these comments just
like any other last call comments.

For more information, please see the FAQ at

<https://trac.ietf.org/trac/gen/wiki/GenArtfaq>.

Document: draft-ietf-bess-datacenter-gateway-??
Reviewer: Gyan Mishra
Review Date: 2021-04-28
IETF LC End Date: 2021-04-29
IESG Telechat date: Not scheduled for a telechat

Summary:
   This document defines a mechanism using the BGP Tunnel Encapsulation
   attribute to allow each gateway router to advertise the routes to the
   prefixes in the Segment Routing domains to which it provides access,
   and also to advertise on behalf of each other gateway to the same
   Segment Routing domain.

This draft needs to provide some more clarity as far as the use case and where
this would as well as how it would be used and implemented.  From reading the
specification it appears there are some technical gaps that exist. There are
some major issues with this draft. I don’t think this draft is ready yet.

Major issues:

Abstract comments:
It is mentioned that the use of Segment Routing within the Data Center.  Is
that a requirement for this specification to work as this is mentioned
throughout the draft?  Technically I would think the concept of the discovery
of the gateways is feasible without the requirement of SR within the Data
Center.

The concept of load balancing is a bigger issue brought up in this draft as the
problem statement and what this draft is trying to solve which I will address
in the introduction comments.

Introduction comments:
In the introduction the use case is expanded much further to any functional
edge AS verbiage below.

OLD

   “SR may also be operated in other domains, such as access networks.
   Those domains also need to be connected across backbone networks
   through gateways.  For illustrative purposes, consider the Ingress
   and Egress SR Domains shown in Figure 1 as separate ASes.  The
   various ASes that provide connectivity between the Ingress and Egress
   Domains could each be constructed differently and use different
   technologies such as IP, MPLS with global table routing native BGP to
   the edge, MPLS IP VPN, SR-MPLS IP VPN, or SRv6 IP VPN”

This paragraph expands the use case to any ingress or egress stub domain Data
Center, Access or any.  If that is the case should the draft name change to
maybe a “stub edge domain services discovery”.  As this draft can be used for
any I would not preclude any use case and make the GW discovery open to be used
for any service GW edge function and change the draft name to something more
appropriate.

This paragraph also states for illustrative purposes which is fine but then it
expands the overlay/underlay use cases. I believe this use case can only be
used for any technology that has an overlay/underlay which would preclude any
use case with just an underlay global table routing such as what is mentioned
“IP, MPLS with global table routing native BGP to the edge.  The IP or global
table routing would be an issue as this specification requires setting a RT and
an export/import RT policy for the discover of routes advertised by the GWs. 
As I don’t think this solution from what I can tell would work technically for
global table routing I will update the above paragraph to preclude global table
routing.  We can add back in we can figure that out but I don’t think any
public or private operator would change from global table carrying all BGP
prefixes in the underlay now drastic change to VPN overlay pushing all the
any-any prefixes into the overlay as that would be a prerequisite to be able to
use this draft.

>From this point forward I am going to assume we are using VPN overlay
technology such as SR or MPLS.

NEW

   “SR may also be operated in other domains, such as access networks.
   Those domains also need to be connected across backbone networks
   through gateways.  For illustrative purposes, consider the Ingress
   and Egress SR Domains shown in Figure 1 as separate ASes.  The
   various ASs that provide connectivity between the Ingress and Egress
   Domains could be two as shown in Figure-1 or could be many more as exists   
   with the public internet use case, and each may be constructed differently
   and use different technologies such as MPLS IP VPN, SR-MPLS IP VPN, or SRv6
   IP VPN” with a “BGP Free” Core.

This may work without “BGP Free” core but I think to simplify the design
complexity I think constraining to “BGP Free” core transport layer.  SR-TE path
steering as well gets much more complicated if all P routers are running BGP as
well. I think in this example we can even explicitly say this example shows the
public internet as that would be one of the primary use cases.

This paragraph is confusing to the reader

As a precursor to this paragraph I think it maybe a good idea to state that we
are talking global table IP only routing or VPN overlay technology with SR/MPLS
underlay transport.  That will make this section much easier to understand.

Figure 1 drawing you should give a AS number to both the ingress domain and
egress domain so the reader does not have to make assumptions if it iBGP or
eBGP connected to the egress or ingress domain and state eBGP in the text
below.  Lets also call the intermediate ASNs in the middle as depicted in the
diagram could be 2 as shown illustratively but could be many operator domains
such as in the case of traversing the public internet.   In the drawing I would
replace ASBR for PE as per this solution as I am stating it has to be a VPN
overlay paradigm and not global routing.  Also in the VPN overlay scenario when
 you are doing any type of inter-as peering the inter-AS peering is almost
always between PE’s and not a separate dedicated device serving a special
“ASBR-ASBR” function as the PE is acting as the border node providing the
“ASBR” type function.  So in the re-write I am assuming the drawing has been
updated changing ASBR to  PE.  Lets give each node a number so that we can be
clear in the text exactly what node we are referring to.  In the drawing please
update that GW1 peers to PE1 and GW2 peers to PE2 and GW3 peers to PE3.  GW3
also peers to GW4 and GW2 peers  to GW5 which GW4 and GW5 are part of AS3.  In
the AS1-AS2 peering  top peer would be PE6 peers to PE8 and bottom peer PE7
peers to PE9.  So PE6 and PE7 are in AS1 and PE8 and PE9 are in AS2.  I made
the bottom to ASBRs in AS3 for the selective deterministic load balancing now
calling them GW4 and GW5 used later in the problem statement.

One major problem with this problem statement description is that it is
incorrect as far as GW load balancing that it does not work today in the
topology given in Figure-1.  The function of edge GW load balancing is based on
the iBGP path tie breaker lowest common denominator in the BGP path selection
which is lowest IGP underlay metric and as long as the metric is equal and you
have iBGP multipath enabled  you now can load balance to egress PE1 and PE2
endpoints. So in this case flows coming from AS1 into AS2 hit a P intermediate
router which has iBGP multipath enabled and has lets say equal cost for route
to the next hop attribute assuming next-hop-self is set so the cost to
loopback0 on PE1 and cost to loopback0 on PE2 is lets say 10, so now you have a
BGP multipath.  What is required though is the RD has to be unique in a “BGP
Free” core RR environment where all PE’s route-reflector-clients peer to the RR
and for all the paths that are advertised to the RR to be reflected to all the
egress PE edges the RD must be unique for the RR to reflect all paths.  BGP
add-paths is only used if you have Primary and Backup routing setup where
PE1-GW1 has a 0x prepend and PE2-GW2 has 1x prepend so now with BGP add-paths
along with BGP PIC Edge you now have a edge pre-programmed backup path.  So the
add-paths is not necessarily something that helps for load balancing and is in
fact orthogonal to load balancing as it for Primary / Backup routing and not
Active/Active load balancing routing where load balancing with VPN overlay is
simply achieved with unique RD per PE and iBGP multipath and equal cost paths
to the underlay recursive IGP learned next-hop-attribute in this case the PE
loopback 0 per the next hop rewrite via “next-hop-sellf” done on the PE-RR
peering in a standard VPN overlay topology.   As far as load balancing being
accomplished in the underlay what I have stated is independent of SR-TE however
with SR-TE candidate path the load balancing ECMP spray to egress PE egress GW
AS can also happen as well with prefix-sid.

OLD
   Suppose that there are two gateways, GW1 and GW2 as shown in
   Figure 1, for a given egress SR domain and that they each advertise a
   route to prefix X which is located within the egress SR domain with
   each setting itself as next hop.  One might think that the GWs for X
   could be inferred from the routes' next hop fields, but typically it
   is not the case that both routes get distributed across the backbone:
   rather only the best route, as selected by BGP, is distributed.  This
   precludes load balancing flows across both GWs.

I am rewriting the text in the NEW as there is some discrepancy in the routes
being distributed across the backbone and what gets distributed.  So I am
completely re-writing to make it more clear what we are trying to state here as
the text appears technically to be incorrect.  To help state the flow will use
the BGP route flow to help depict the routing and try to get to the problem
statement we are trying to portray.

NEW

   Suppose that there are two gateways, GW1 and GW2 as shown in
   Figure 1, for a given egress SR domain and each gateway advertises via EBGP
   a VPN prefix X to AS2 core domain via EBGP with underlay next hop set to GW1
   or GW2. In this case we are Active / Active load balancing with PE1 and PE2
   receives the VPN prefix and advertised the VPN prefix X into the domain with
   next-hop-self set on the PE-RR peering to the PE’s loopback0.  The P routers
   within the domain have ECMP path with IGP metric tie to the egress PE1 and
   egress PE2 for VPN Prefix X learned from GW1 and GW2. SR-TE path can now be
   stitched from GW3 to PE3 SR-TE Segment-1 to PE3 to PE6 and PE7 Segment-2 to
   PE8 and PE9 to Egress Domain via PE1 and PE2 to GW1 and GW2.  In this case
   however we don’t want the traffic to be steered via SR-TE Load balanced via
   ingress GW3 and want to take GW3 out of rotation and load balance traffic to
   GW4 and GW5 instead.

**Text above provides the updated selective deterministic gateway steering
described below to achieve the goal.  I think that may have been the intent of
the authors and I am just making it more clear**

As for problem statement as GW load balancing can occur in the underlay as
stated easily that is not the problem.

In my mind I am thinking the problem statement that we want to describe in both
the Abstract and Introduction is not vanilla simple gateway load balancing but
rather a predictable deterministic method of selecting gateways to be used that
is each VPN prefix now has a descriptor attached -  tunnel encapsulation
attribute which contains multiple TLVs one or more for each “selected gateway” 
with each tunnel TLV contains an egress tunnel endpoint sub-tlv that identifies
the gateway for the tunnel.  Maybe we can have in the sub-tlv a priority field
for pecking order preference of which GWs are pushed up into the GW hash
selected for the SR-ERO path to be stitched end to end.   So lets say you had
10 GWs and you break them up into 2 tiers or multi tiers and have maybe gateway
1-5 are primary and 6-10 are backup and that could be do to various reasons so
you can basically pick and choose based on priority which GW that gets added to
the GW hash.

I have some feedback and comments on the solution and how best to write the
verbiage to make it more clear to the reader.

I think in the solution as far s the RT to attach for the GW auto discovery. 
So with this new RT we are essentially creating a new VPN RIB that has prefixes
from all the selected gateways that are discovered from the tunnel
encapsulation attribute TLV.

In the text here what is really confusing is if the tunnel encapsulation
attribute is being attached to the underlay recursive route to next hop
attribute or the VPN overlay prefix.   So the reason I am thinking it is being
attached to the VPN overlay prefix and not the underlay next hop attribute is
how would you now create another transport RIB and if you are creating a new
transport RIB there is already a draft defined by Kaliraj Vairavakkalai or
BGP-LU SAFI 4 labeled unicast that exits today to advertise next hops between
domains for an end to end LSP load balanced path.

https://tools.ietf.org/html/draft-kaliraj-idr-bgp-classful-transport-planes-07

IANA code point below
76      Classful-Transport SAFI
[draft-kaliraj-idr-bgp-classful-transport-planes-00]

Also in line with CT another option is BGP-LU SAFI 4 to import the loopbacks
between domains which is the next hop attribute to be advertised into the core
end to end LSP.  So the BGP-LU SAFI  RIB could be used for the next GW next hop
advertisement between domains so that there is visibility of all the egress PE
loopback0 between domains.   So you can either stitch the LSP segmented LSP
like inter-as option-b SR-TE stitched and use nex-hop self PE-RR next-hop
rewrite on each of the PEs within the internet domain or you could import all
the PE loopback from all ingress and egress domains into the internet domain
similar to inter-as opt-c create end to end LSP instantiate an end to end SR-TE
path.

Maybe you could attach the RT tunnel encapsulation attribute tunnel tlv
endpoint tlv to the VPN overlay prefix.  Not sure how that would be beneficial
the underlay steers the VPN overlay.

So maybe you could couple the VPN overlay new GW RIB RT to the transport
Underlay CT CLAS RIB or BGP-LU RIB coupling  may have some benefit but that
would have to be investigated but I think is out of scope of the goals of this
draft.

I think we first have to figure out the goal and purpose of this draft by the
authors and how the GW discovery should work in light of the CT class CT RIB
AFI/SAFI codepoint draft that exists today as well as the BGP-LU option for
next hop advertisement within the internet domain.

Section 3 comments

      “Each GW is configured with an identifier for the SR domain.  That
      identifier is common across all GWs to the domain (i.e., the same
      identifier is used by all GWs to the same SR domain), and unique
      across all SR domains that are connected (i.e., across all GWs to
      all SR domains that are interconnected).

**No issues with the above**

      A route target ([RFC4360]) is attached to each GW's auto-discovery
      route and has its value set to the SR domain identifier.

**So here if the RT is attached to the GW auto-discovery route we need to state
is that the underlay route and that the PE does a next-hop-self rewrite of the
eBGP link to the BGP egress domain next hop to the loopback0 so the GW next hop
that we are tracking of all the ingress and egress PE domains is the egress and
ingress PE loopback0.**

      Each GW constructs an import filtering rule to import any route
      that carries a route target with the same SR domain identifier
      that the GW itself uses.  This means that only these GWs will
      import those routes, and that all GWs to the same SR domain will
      import each other's routes and will learn (auto-discover) the
      current set of active GWs for the SR domain.”

**So if this is the case and we are tracking the underlay RIB and attach a
route target to all the ingress PE & P next hops which is loopback0 = this is
literally identical to BGP-LU importing all the loopbacks between domains or
using CT class** There is no need for this feature to use the tunnel
encapsulation attribute.  I am not following why you would not use BGP-LU or CT
clas RIB.**

   “To avoid the side effect of applying the Tunnel Encapsulation
   attribute to any packet that is addressed to the GW itself, the GW
   SHOULD use a different loopback address for packets intended for it.”

**I don’t understand this statement as the next hop is the ingress and egress
PE loopback0 that is the next hop being tracked for the gateway load balancing.
The GW device subnet between the GW and PE is not advertised into the internet
domain as we do next-hop-self on the PE PE-RR iBGP peering and so the GW to PE
subnet is not advertised.**   Looking at it a second time I think we are
thinking here BGP-LU inter-as opt c style import of loops between domains and
so instead of importing the loop0 which carries all packets on the GW device
use a different loopback GW1 so it does not carry the FEC of all  BAU packets
similar concept utilized in RSVP-TE to VPN mapping "per-vrf TE" concept.

   “As described in Section 1, each GW will include a Tunnel
   Encapsulation attribute with the GW encapsulation information for
   each of the SR domain's active GWs (including itself) in every route
   advertised externally to that SR domain.  As the current set of
   active GWs changes (due to the addition of a new GW or the failure/
   removal of an existing GW) each externally advertised route will be
   re-advertised with a new Tunnel Encapsulation attribute which
   reflects current set of active GWs.”

**What is the route being advertised externally from the GW.  So the routes
advertised would be all the PE loopback would be advertised from both ingress
and egress domains into the internet domain and all loopback from the internet
domain into the ingress and egress domain which could be done via BGP-LU or CT
RIB – no need do reinvent the wheel and create a new RIB.  So BGP-LU or CT RIB
track the current set of active next hop GWs loopbacks between domains**If you
do SR-TE stitching then you can do the next-hop self on each PE PE-RR for the
load balancing and that would work and the load balancing would be to the PE
loopbacks or if its an end to end SR-TE path using BGP-LU or CT RIB via
importing all the PE loopbacks between domains the current set of active GWs
would be tracked via the BGP-LU or CT RIB.  So if the active GWs change due to
GW failures they would be withdrawn from the BGP-LU or CT underlay RIB.  No
need now for the tunnel encapsulation attribute at least for the GW auto
discovery load balancing**

I think it still maybe possible to retrofit this draft to utilize the CT RIB or
BGP-LU for the GW load balancing so nothing new has to be designed as far as
the underlay goes, however maybe the idea of providing some visibility into the
VPN overlay route to the underlay – maybe their maybe some benefit of using the
tunnel encapsulation attribute RT import policy to attach to the VPN overlay
prefixes.

As CT draft provides a complete solution of providing the VPN overlay per VPN
or per prefix underpinning of the VPN overlay to underlay CT RIB the problem
statement is completely solved with either the CT draft or BGP-LU.

Minor issues:
None

Nits/editorial comments:

Please add normative and informative references below.

I would reference as normative and maybe even informative the CT Class draft
which creates a new transport class and I think this draft can really work well
in conjunction with use of the CT class to couple the GW RIB created to the CT
class transport RIB and provide the end to end inter-AS stitching via the PCE
CC controller.  I am one of the co-authors of this draft and I think this draft
could be coupled with this GW draft to provide the overall goals of selective
GW load balancing.

https://tools.ietf.org/html/draft-kaliraj-idr-bgp-classful-transport-planes-07

I would also reference this draft for CT class PCEP coloring extension.

https://tools.ietf.org/html/draft-rajagopalan-pcep-rsvp-color-00

As this solution would utilize a centralized controller PCE CC for inter as
path instantiation for the GW load balancing, I think it would be a good idea
to reference the PCE CC, H-PCE and Inter-AS PCE and PCE SR extension as
informative and maybe even normative reference.
[Gen-art] Genart last call review of draft-ietf-b… Gyan Mishra via Datatracker
Re: [Gen-art] [Last-Call] Genart last call review… Lars Eggert
Re: [Gen-art] [Last-Call] Genart last call review… Gyan Mishra
Re: [Gen-art] [Last-Call] Genart last call review… Lars Eggert
Re: [Gen-art] [Last-Call] Genart last call review… Gyan Mishra
Re: [Gen-art] [Last-Call] Genart last call review… John E Drake
Re: [Gen-art] [Last-Call] Genart last call review… Gyan Mishra
Re: [Gen-art] [Last-Call] Genart last call review… Adrian Farrel
Re: [Gen-art] [Last-Call] Genart last call review… Gyan Mishra