Re: [Idr] draft-uttaro-idr-bgp-persistence-00

"UTTARO, JAMES" <> Tue, 01 November 2011 15:20 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id D045F1F0C3D for <>; Tue, 1 Nov 2011 08:20:56 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -105.399
X-Spam-Status: No, score=-105.399 tagged_above=-999 required=5 tests=[AWL=-0.001, BAYES_00=-2.599, HTML_MESSAGE=0.001, J_CHICKENPOX_13=0.6, J_CHICKENPOX_21=0.6, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id es+nZInIPDWB for <>; Tue, 1 Nov 2011 08:20:46 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 97F551F0C43 for <>; Tue, 1 Nov 2011 08:20:46 -0700 (PDT)
X-Originating-IP: []
X-StarScan-Version: 6.3.6; banners=-,-,-
X-VirusChecked: Checked
Received: (qmail 12323 invoked from network); 1 Nov 2011 15:19:48 -0000
Received: from (HELO ( by with DHE-RSA-AES256-SHA encrypted SMTP; 1 Nov 2011 15:19:48 -0000
Received: from (localhost.localdomain []) by (8.14.4/8.14.4) with ESMTP id pA1FKFpJ011411; Tue, 1 Nov 2011 11:20:15 -0400
Received: from ( []) by (8.14.4/8.14.4) with ESMTP id pA1FKDuk011332 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Tue, 1 Nov 2011 11:20:13 -0400
Received: from ([]) by ([]) with mapi id 14.01.0339.001; Tue, 1 Nov 2011 11:19:45 -0400
From: "UTTARO, JAMES" <>
To: "'Enke Chen'" <>
Thread-Topic: [Idr] draft-uttaro-idr-bgp-persistence-00
Thread-Index: AQHMkEAnLP0iAkRlVkChUYWtkZqOMZWKuQoAgAQhplCAAIdcwoABIPrwgAESmgCABpCw8A==
Date: Tue, 1 Nov 2011 15:19:44 +0000
Message-ID: <>
References: <> <> <> <> <> <> <>
In-Reply-To: <>
Accept-Language: en-US
Content-Language: en-US
x-originating-ip: []
Content-Type: multipart/alternative; boundary="_000_B17A6910EEDD1F45980687268941550FA21F96MISOUT7MSGUSR9IIT_"
MIME-Version: 1.0
Cc: " List" <>, "" <>
Subject: Re: [Idr] draft-uttaro-idr-bgp-persistence-00
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 01 Nov 2011 15:21:06 -0000


                Comments in-Line..

                Jim Uttaro

From: Enke Chen []
Sent: Friday, October 28, 2011 2:19 AM
Cc:; List; Enke Chen
Subject: Re: [Idr] draft-uttaro-idr-bgp-persistence-00


My comments are inlined.

On 10/27/11 1:17 PM, UTTARO, JAMES wrote:


GR is a solution that is essentially local in scope it does not have the ability to inform downstream speakers of the viability of routing state from the point of possible control plane failure. OTOH Persistence does propagate the condition of state. This provides distinct advantages in terms of customers awareness of the SPs control plane. One could imagine a customer receiving a STALE path and responding by selecting a backup. Some of the extensions to this draft that I have considered in colouring of STALE to inform if the condition arises from a local ( PE ) or internal iBGP ( RR ) failures..

GR makes no distinction from STALE state and ACTIVE state.. This can lead to the STALE path still being preferred throughout the topology. IMO this is incorrect behavior regardless of the comparison.

PERSISTENCE allows for a customer to indicate which paths should be candidates. Customers may want to immediately failover to the backup for some paths and not for others. GR is not capable of doing this it is all or nothing. The granularity is not sufficient. It needs to be at the path level. There may even be a case for having even more granularity i.e a per path timer.. GR is not capable of being extended for either of these cases.

I am not sure how this path level persistence would work operationally.   Without the detailed information of a provider's network, how would a customer know what kind of failures and recovery that they might experience?   Consider the example of the simultaneous RR failures in the draft,  why would
dn't any customer not to want to protect against such failures?   The end result could be that the PERSISTENCE flag is always set, thus losing its significance.
[Jim U>] One ex would be customers who create multiple VPNs over different SPs.. A customer may want to take advantage of the knowledge that a control plane failure has occurred and migrate the traffic to the backup. This could be done at a path granularity by use of the DO_NOT_PERSIST CV. . We as SPs want to provide our customers with the tools needed to manage their VPNs and not prescribe a one size fits all solution.

Regarding the use of the STALE state vs ACTIVE state, clearly there is a tradeoff.   GR uses the stale routes in order to avoid forwarding churns, which has been a critical requirement for a long time.   If there is a real need for favoring a ACTIVE one over a STALE one in GR, it can be done by a simple knob.
[Jim U>] The current draft has no ability to inform downstream speakers of whether or not a path is STALE or ACTIVE. The knob may be simple but a lot of machinery would have to be built. This is one of the big reasons for the PERSIST draft. I do not understand the routing churn part in the context of vpnv4, vpnv2, 3107 etc... maybe the GR solution was constructed as a solution that primarily speaks to eBGP IPV4 connections for the IPV4 AF ( Internet ).. I could understand that..

As you know, BGP is full of knobs that adjust behaviors for different needs :-)
[Jim U>] More Knobs..

GR does not provide protection through successive restarts of the session. I believe that if this occurs the state will be invalidated. So for a session that is bouncing due to overload condition GR will not provide the required protection

This can be addressed by a simple knob to set the min stale timer for GR.
[Jim U>] And yet more knobs

GR does not employ a make before break strategy. All state is invalidated first then the newly learned state is processed. This leads to routing churn especially if the majority of the state is the same which I am pretty sure is the case

Such behavior would be an implementation bug that needs to be fixed.  But it is not an issue with the protocol itself.

This is what we have in 4.2. Procedures for the Receiving Speaker, RFC 4724:


   The Receiving Speaker MUST replace the stale routes by the routing

   updates received from the peer.  Once the End-of-RIB marker for an

   address family is received from the peer, it MUST immediately remove

   any routes from the peer that are still marked as stale for that

   address family.
[Jim U>] This does not address the lack of clarity about make before break.. it only states that must immediately remove routes marked as stale. It should state that any paths that are learned which are the same as the STALE paths should not force the forwarding plane to be re-programmed for those paths.. This should be made clear and in general is good practice to avoid churn..

There are several possibilities for the premature purge of the stale routes. For example, the "Forwarding state" flag was somehow not set after the session was re-established, or the the EOR was sent prematurely.   Further investigation will be needed in order to identify any possible implementation or config issues involved in your setup.
[Jim U>] More moving parts to worry about..

GR invalidates state due to the case of protocol error i.e A malformed update will invalidate all of the state. This is not the desired behavior.

It has been addressed by the following extension:

[Jim U>] A few comments here.. I do not understand, the draft does not clarify that the only thing that will force a tear down is the cease subcode and a hard reset error code.. is the intention that this is the only thing that will tear it down? I guess I would like to see which things will and will not force a session termination in the original draft.. Like

-          Holdtime Expiration

-          Malformed Update

-          Consecutive Restarts.. So what does this exactly mean   "As part of this extension, possible consecutive restarts SHOULD NOT
   delete a route (from the peer) previously marked as stale, until
   required by rules mentioned in [RFC4724]." Possible consecutive restarts means what? I really need clarity on this whole notion of when is a session truly invalidated.

Why is the purpose of the following text?

   Once the session is re-established, both BGP speakers MUST set their
   "Forwarding State" bit to 1 if they want to apply planned graceful
   restart.  The handling of the "Forwarding State" bit should be done
   as specified by the procedures of the Receiving speaker in [RFC4724]
   are applied.

GR is not specific as to which events invoke it or not. From my read on the draft it is not clear if holdtime expiration invokes GR or not.. The draft is unclear.

I think that it is covered by the above extension.  If not, it should be clarified.
[Jim U>] I did not see it..

It is not clear to me how RRs and PEs differ in using GR.

I think that there is a main difference when a RR is not in the forwarding path.  In that case, the RR should always set the F bit in the GR Capability so that its clients will continue forwarding after they lose the sessions with RR.  It is a deployment issue, though.
[Jim U>] Yes.. Again from an operations perspective I have to deploy technology differently in different parts of the network across multiple vendors. This is generally not a desired starting point for the successful deployment of new technology.. I want solutions that are generic and simple to deploy.

The time that state can persist is limit to about 1 hour max.

I think that you are talking about the "Restart time" field which has 12 bits and amount to about 68 minutes.  The "Restart time" is for the session re-establishment.  It does not impact the duration for holding stale routes after the session is re-established.
[Jim U>]  But if the session does not become re-established then the state is invalidated as the session terminates with an error code that GR will not persist through..

If the session does not get re-established in 68 minutes, the stale routes would be purged.  That is a long time, isn't it?   However, if one really wants to extend the session re-establishment time and continue to hold stale routes, it can be done by a simple knob.
[Jim U>] And yet even more knobs

GR does detail the behavior where convergence is not achieved between restarts.. Similar to above..

The min stale timer knob can cover it (see above).

But do you meant "does not"?  We can certainly clarify in 4724bis if that is the case.
[Jim U>] If convergence is not achieved what is the behavior. I could not determine from the draft..

I do not believe that the current GR paradigm can be extended to cover the majority of the cases above.

Except for the path level persistence you mentioned, I believe the GR will be able to address all other persistence requirements you listed, with some simple knobs and some implementation enhancements.
[Jim U>] IMO GR was originally designed to prevent churn due to intermittent failure on an eBGP session for the IpV4 AF.. I do not want to have different knobs and implementation enhancements to solve the basics of persistence.. Regardless of that it does not inform the topology of the state of a path in re the control plane it was learned over so there can be no independent decisions about the value of a given path by different customers/providers.. This is required for my applications..


        Jim Uttaro

Thanks.   -- Enke

-----Original Message-----

From: Enke Chen []

Sent: Wednesday, October 26, 2011 8:43 PM


Cc:<>;<> List; Enke Chen

Subject: Re: [Idr] draft-uttaro-idr-bgp-persistence-00

Hi, folks:

I have a hard time in understanding what new problems (beyond the GR)

the draft try to solve :-(

If the concern is about the simultaneous RR failure as shown in the

examples in Sect. 6 Application, that can be addressed easily using GR.

As the RRs are not in the forwarding path, it means that the forwarding

is not impacted (thus is preserved) during the restart of a RR.   The

Forwarding State bit (F) in the GR capability should always be set by

the RR when it is not in the forwarding path.

Also in the case of simultaneous RR failure, I do not see why one would

want to retain some routes, but not others, using the communities

specified in the draft.  As the RRs are not in the forwarding path,

wouldn't be better to retain all the routes on a PE/client?

As you might be aware, efforts have been underway to address issues with

GR found during implementation and deployment. They include the spec

respin, notification handling, and implementations.  If there are issues

in the GR area that are not adequately addressed,  I suggest that we try

to address them in the GR respin if possible, instead of creating

another variation unnecessarily.

Thanks.   -- Enke

On 10/26/11 10:24 AM, Robert Raszuk wrote:


When one during design phase of a routing protocol or routing protocol

extension or modification to it already realizes that enabling such

feature may cause real network issue if not done carefully - that

should trigger the alarm to rethink the solution and explore

alternative approaches to the problem space.

We as operators have already hard time to relate enabling a feature

within our intradomain boundaries to make sure such rollout is network

wide. Here you are asking for the same level of awareness across ebgp

boundaries. This is practically unrealistic IMHO.

Back to the proposal ... I think that if anything needs to be done is

to employ per prefix GR with longer and locally configurable timer.

That would address information persistence across direct IBGP sessions.

On the RRs use case of this draft we may perhaps agree to disagree,

but I do not see large enough probability of correctly engineered RR

plane to experience simultaneous multiple ibgp session drops. If that

happens the RR placement, platforms or deployment model should be


Summary .. I do not think that IDR WG should adopt this document. Just

adding a warning to the deployment section is not sufficient.

Best regards,



The introduction of this technology needs to be carefully evaluated

when being deployed into the network. Your example clearly calls out

how a series of independent design can culminate in incorrect

behavior. Certainly the deployment of persistence on a router that

has interaction with a router that does not needs to be clearly

understood by the network designer. The goal of this draft is to

provide a fairly sophisticated tool that will protect the majority of

customers in the event of a catastrophic failure.. The premise being

the perfect is not the enemy of the good.. I will add text in the

deployment considerations section to better articulate that..

Thanks, Jim Uttaro

-----Original Message----- From:<>

[] On Behalf Of Robert Raszuk Sent:

Sunday, October 23, 2011 5:32 PM To:<> List Subject: [Idr]



Actually when discussing this draft a new concern surfaced which I

would like to get your answer on.

The draft in section 4.2 says as one of the forwarding rules:

o  Forwarding to a "stale" route is only used if there are no other

paths available to that route.  In other words an active path always

wins regardless of path selection.  "Stale" state is always

considered to be less preferred when compared with an active path.

In the light of the above rule let's consider a very simple case of

dual PE attached site of L3VPN service. Two CEs would inject into

their IBGP mesh routes to the remote destination: one marked as STALE

and  one not marked at all. (Each CE is connected to different PE and

each PE RT imports only a single route to a remote hub headquarter to

support geographic load balancing).

Let me illustrate:

VPN Customer HUB

PE3      PE4 SP PE1      PE2 |        | |        | CE1      CE2 |

| 1|        |10 |        | R1 ------ R2 1

CE1,CE2,R1,R2 are in IBGP mesh. IGP metric of CE1-R1 and R1-R2 are 1

and R2-CE2 is 10.

Prefix X is advertised by remote hub in the given VPN such that PE1

vrf towards CE1 only has X via PE3 and PE2's vrf towards CE2 only has

X via PE4.

Let's assume EBGP sessions PE3 to HUB went down, but ethernet link

is up, next hop is in the RIB while data plane is gone. Assume no

data plane real validation too. /* That is why in my former message

I suggested that data plane validation would be necessary */.

R1 has X via PE1/S (stale) and X via PE2/A (active) - it understands

STALE so selects in his forwarding table path via CE2.

R2 has X via PE1/S (stale) and X via PE2/A (active) - it does not

understand STALE, never was upgraded to support the forwarding rule

stated above in the draft and chooses X via CE1 (NH metric 2 vs 10).

R1--R2 produce data plane loop as long as STALE paths are present in

the system. Quite fun to troubleshoot too as the issue of PE3

injecting such STALE paths may be on the opposite site of the world.

The issue occurs when some routers within the customer site will be

able to recognize STALE transitive community and prefer non stale

paths in their forwarding planes (or BGP planes for that matter)

while others will not as well as when both stale and non stale paths

will be present.

Question 1: How do you prevent forwarding loop in such case ?

Question 2: How do you prevent forwarding loop in the case when

customer would have backup connectivity to his sites or connectivity

via different VPN provider yet routers in his site only partially

understand the STALE community and only partially follow the

forwarding rules ?

In general as the rule is about mandating some particular order of

path forwarding selection what is the mechanism in distributed

systems like today's routing to be able to achieve any assurance that

such rule is active and enforced across _all_ routers behind EBGP

PE-CE L3VPN boundaries in customer sites ?

Best regards, R.

-------- Original Message -------- Subject: [Idr]

draft-uttaro-idr-bgp-persistence-00 Date: Sat, 22 Oct 2011 00:23:55

+0200 From: Robert Raszuk<><> Reply-To:<> To:<> List<><>


I have read the draft and have one question and one observation.


What is the point of defining DO_NOT_PERSIST community ? In other

words why not having PERSIST community set would not mean the same as

having path marked with DO_NOT_PERSIST.


I found the below statement in section 4.2:

o  Forwarding must ensure that the Next Hop to a "stale" route is


Of course I agree. But since we stating obvious in the forwarding

section I think it would be good to explicitly also state this in

the best path selection that next hop to STALE best path must be


However sessions especially those between loopbacks do not go down

for no reason. Most likely there is network problem which may have

caused those sessions to go down. It is therefor likely that LDP

session went also down between any of the LSRs in the data path and

that in spite of having the paths in BGP and next hops in IGP the LSP

required for both quoted L2/L3VPN applications is broken. That may

particularly happen when network chooses to use independent control

mode for label allocation.

I would suggest to at least add the recommendation statement to the

document that during best path selection especially for stale paths

a validity of required forwarding paradigm to next hop of stale

paths should be verified.

For example using techniques as described in:


Best regards, R.

_______________________________________________ Idr mailing list<>

_______________________________________________ Idr mailing list<>


Idr mailing list<>