Re: [GROW] I-D Action: draft-ietf-grow-diverse-bgp-path-dist-05.txt

"George, Wesley" <wesley.george@twcable.com> Thu, 22 September 2011 16:39 UTC

From: "George, Wesley" <wesley.george@twcable.com>
To: "robert@raszuk.net" <robert@raszuk.net>
Date: Thu, 22 Sep 2011 12:42:16 -0400
Thread-Topic: [GROW] I-D Action: draft-ietf-grow-diverse-bgp-path-dist-05.txt
Thread-Index: Acx4pOdipi4ezhD9QA2HWUBVIWw4EAAgag+w
Message-ID: <34E4F50CAFA10349A41E0756550084FB0F8D19E4@PRVPEXVS04.corp.twcable.com>
References: <20110915135818.19974.94670.idtracker@ietfa.amsl.com> <34E4F50CAFA10349A41E0756550084FB0F8D14C1@PRVPEXVS04.corp.twcable.com> <4E7A5617.2080900@raszuk.net>
In-Reply-To: <4E7A5617.2080900@raszuk.net>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "grow@ietf.org" <grow@ietf.org>
Subject: Re: [GROW] I-D Action: draft-ietf-grow-diverse-bgp-path-dist-05.txt
Precedence: list

-----Original Message-----
From: Robert Raszuk [mailto:robert@raszuk.net]
Sent: Wednesday, September 21, 2011 5:25 PM
To: George, Wesley
Cc: grow@ietf.org
Subject: Re: [GROW] I-D Action: draft-ietf-grow-diverse-bgp-path-dist-05.txt

Hello Wes,

> Other stuff: 2.1 - when discussing overhead and scale concerns for
> add paths, perhaps a citation to 4984 would be appropriate?

I would prefer not to mix the growing internet scale concerns from some
of the operational practices/configuration based based scale concerns.

WEG] Understand, but I'm not sure that it's so easy to separate the two. You'll find me saying the same thing to anyone suggesting a change that has the net effect of significantly increasing the burn rate for memory and CPU resources, whether it's a configuration change or otherwise, because it still exacerbates the overall issue. (more on that in a moment)

> I've made
> similar comments to the SIDR folks, and I think generally anything
> that adds a non-trivial amount of impact to the growth curve of the
> routing system needs to consider this.

I think there is substantial difference for local vs global size
increase of the routing system. Here in this work all concerns are
regarding to the local one.

WEG] Generally, I'm not sure that I'd make so much of a distinction. While yes, in theory changes of this type only impact the ASN that chooses to implement it, rather than what it announces to the outside world, the global scaling problem is due to the intersection between available resources, their growth curve, and the growth curve of the routing table. Saying that it only is a concern if it contributes to the size of the DFZ routing table is oversimplifying the root problem, because if internal scale problems exhaust the resources available for both internal and external routes, you still have the same end state - out of resources. In that case, the only difference between a local scaling problem and a global problem is the deployment penetration. If this is widely deployed, it has now steepened the growth curve noted in 4984, because it still is using some of the overall available resources. I've said on more than one occasion that the iBGP routes carried by an SP are as much or more of a problem than the growth of the global table because they don't have nearly as much of the aggregation and optimization to reduce their footprint. The only difference is the level of administrative control over growth, but that's a fairly limited knob to turn - for lots of reasons it may not be any more feasible to change things internally to reduce internal route growth than it is to change global route growth.
Besides, I think that your draft is trying to have it both ways - you malign Add Paths for having scaling problems, and then seem content to gloss over a very similar problem created by your solution simply because it appears to be slightly less severe and more localized.

> 4. This asserts that no code changes are necessary to RR clients. I'm
> not sure I totally agree with that... If the idea is to have a
> primary (best) RR and then N additional paths, the general assumption
> is that the N, N1, ... RRs are carrying routes that are less and less
> preferred. How does this system avoid the same sort of inconsistency
> of best path choice among different routers in the network if there
> is no way to identify those paths as secondary? I think you need some
> way to determine if the alternate routes are intended to be ECMP
> routes or backup routes... You may be able to cover this without code
> changes by using alternate configurations of other BGP preference
> indicators (MED, Localpref, metric, etc), perhaps with inbound route
> policy on the client or outbound on the RR, but since things like
> metric may be different based on where something is in the network,
> that may lead to inconsistency if used by itself. Even then, the
> draft doesn't discuss how this should be managed.

I stand by the claim that no code change is needed on clients. Moreover
no even additional policy change is required either.

The best way to illustrate this is to compare presence of additional BGP
paths on the clients in the scenario where clients are interconnected
with full IBGP mesh or would get all paths with add-path. In neither
case there is a notion of RR telling client which path is best or which
is second best .. and there is number of good reasons for that (one is
that for RR numbering paths can be different then for client, the other
one is that when we would withdraw any path advertised and ordered we
would need to re advertise with new order all remaining paths - that
amount of churn is non negligible).

Each client's BGP best path is capable of making safe (loop free)
autonomous choice of paths in PIC/fast connectivity restoration/ibgp
multipath cases.

WEG] I'm sorry, maybe I'm being thick, but I still don't understand how this would work in a way that would always avoid routing loops. Under normal state, you have a RR client reflecting its best path to the client based on the routes it receives from the rest of its neighbors, meaning that the clients don't have visibility to candidate alternatives that the RR does, so they're all making the same choice at least within the local cone of influence of that RR.
You add a second set of RRs (rr') that is announcing a second-best path as if it was the best path to restore 1 (or more) of the candidate alternatives to the client. The client receives the best and 2nd-best path and evaluates them using standard methods. If the thing that makes one route better than the other is something locally interesting like metric, and the client's particular place in the universe means that the metric is different as compared to other clients, the P routers, and the RRs, it may choose the 2nd-best path as best, and this may lead to routing loops if it tries to send the route to another router that has a different belief of what the best path is. This case is much more likely if the RR and RR' are not collocated with all of their clients and/or each other. I think that this may also be the case when the tiebreaker is router-id if you're not careful of the way that you address your route-reflectors and/or are not doing next-hop self at the edges.
Only in the case where the 2nd-best path is clearly worse to all members of the ASN (lower local pref, longer AS-path, etc) are you assured of no possibility for two routers each getting a different result when evaluating those two different routes. I think that 4.2 covers some part of this case, in the way that it documents its assumptions and what must be done to enable deployment, especially the references to ignoring IGP metric, but IMO it's not clear enough in the explanation why some of these things must be done - the failure case isn't discussed.

> 4.1 Also, there's a definite scaling consideration on the RR clients
> that isn't really discussed here - they are now going to be storing
> some number of additional routes and paths that is linearly related
> to the number of additional planes that are implemented. The addition
> of more RR sessions that presumably carry a portion of the full
> routing table now drives a non-trivial increase in memory footprint
> and processing overhead (and potentially convergence time for slower
> boxes). In the simplest case of 2 primary route-reflectors (for
> diversity), and 1 2nd-best path RR, you've added one session. If you
> want to carry a 3rd-best RR or have redundant 2nd-best RRs, you've
> added 4 sessions. It's fair to say that after a certain number of
> alternate paths, you start having less routes because there are only
> so many alternative exits, but otherwise there is a potentially large
> problem even if it's not quite as bad as addpaths. I might recommend
> that you do some analysis of the routing table to know where this
> threshold makes a difference, based on how many alternate paths an
> average route carries. In addition to being a scaling consideration,
> it also helps to inform what value of N becomes diminishing returns
> because most networks don't have that many backup paths. I envision
> this being something like "80% of routes have 4 or less paths, so
> moving beyond 4 planes may add overhead without much benefit..."

It is absolutely correct to say that more paths client carries the more
CPU cycles and memory will be used to process and store them.

However there is one observation to be made ... in 99% of cases I have
seen for distributing more then best path intra-domain the sufficient
number of paths per net on each client is 2.

WEG] the document should explicitly state this. That's exactly what I was getting at when I mentioned analysis above. If nearly all applications only need one alternate to bring the total paths to two, and more would be diminishing returns, the document should recommend this, and note that more are possible if the operator's situation dictates by simply repeating the deployment more times. I will note that this guidance as well as the note at the end of 4.2 that "The additional planes of route reflectors do not need to be fully redundant as the primary one does" contradicts your example because it has both RR1' and RR2'.

IMHO cost of bringing additional paths for control plane is quite well
understood today. Moreover it is quite implementation dependent. Some
implementation may use X bytes per path while the other one Y bytes to
store the same path. I think some separate BGP scaling document (even as
BCP) may be equally useful for any technique to advertise more then best
path. I would prefer to keep this outside of the solutions work on how
to advertise and distribute those additional paths.

WEG] I'm not looking for a level of detail that requires you to discuss the number of bytes per path. Simply noting that scaling issues exist and their general categories is enough. Make the logical leap for your reader that implementing this solution brings with it the scaling problems inherent with adding an additional route reflector (and therefore its additional routes and paths).
>
 > It may be appropriate to add a separate scaling considerations
 > discussion to your deployment considerations (section 6) to discuss
 > some of the above.

I agree 100% .. but as stated above I do not find this specific to
diverse-path. It seems a general issue and I would highly encourage
someone to take a stub to document this in IETF/IDR/GROW or maybe at
Nanog community repository.

WEG] it may not be specific to diverse-path, but diverse-path is specifically advocating doing something that would otherwise not be done (adding additional RR<->client BGP peers w/full routes beyond what is necessary for simple RR redundancy). Therefore I still think that you need to discuss the specific scaling concerns that this implementation needs to consider, even if it's at a relatively high level and the document notes that these are not unique to this implementation. I agree that a general scaling considerations document may be appropriate, but since that does not exist and I don't want this document to be blocked awaiting completion of such, a brief discussion within this document would help a lot.

> There may be additional operational considerations from the
> perspective of route analysis - if you have either a homebuilt or off
> the shelf set of software that does route analysis for the purpose of
> event root-cause analysis, anomaly detection, capacity
> planning/failure analysis, etc, it has to be aware of these
> additional planes such that it returns the proper response when
> evaluating the routing table to determine what the expected behavior
> should be in the real network. This is especially important when it
> uses the table to determine how traffic will reroute during different
> failure scenarios. These tools may act like a participant in the mesh
> rather than a client in order to get a pure view of the table, and
> that may lead to undesired results if the multiple planes aren't
> taken into account. There may also be considerations for looking
> glass implementations and the actual information that is visible on
> the RRs and RR clients as the result of standard BGP show commands
> to aid in troubleshooting and verification.

Very good point. Two comments on this ..

- As to the impact to the tools I am less worried as presence of
additional paths can be a fact today as already mentioned with full mesh
or as used by some operator's by playing with adjusting different weight
values of pair of RRs on a per net basis.

WEG] sure, but I don't think that it's valid to assume that all analysis tools have taken this into account in their implementation, so it's worth mentioning as an operational consideration. The comment may be helpful to characterize the level of potential impact.

- The use of "planes" in the draft is more of a conceptual nature. In
practice all paths are still kept in the single table where normal best
path is calculated. That means that tools like looking glass should not
observe any changes nor impact.

WEG] a good clarification to add to the document.

This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

[GROW] I-D Action: draft-ietf-grow-diverse-bgp-pa… internet-drafts
[GROW] Fwd: I-D Action: draft-ietf-grow-diverse-b… Robert Raszuk
Re: [GROW] I-D Action: draft-ietf-grow-diverse-bg… George, Wesley
Re: [GROW] I-D Action: draft-ietf-grow-diverse-bg… Robert Raszuk
Re: [GROW] I-D Action: draft-ietf-grow-diverse-bg… George, Wesley
Re: [GROW] I-D Action: draft-ietf-grow-diverse-bg… Robert Raszuk
Re: [GROW] I-D Action: draft-ietf-grow-diverse-bg… Jakob Heitz
Re: [GROW] I-D Action: draft-ietf-grow-diverse-bg… Robert Raszuk