Re: [Idr] draft-uttaro-idr-bgp-persistence-00

Enke Chen <> Fri, 28 October 2011 06:18 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id A76B121F8B0E for <>; Thu, 27 Oct 2011 23:18:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -5.532
X-Spam-Status: No, score=-5.532 tagged_above=-999 required=5 tests=[AWL=0.466, BAYES_00=-2.599, HTML_MESSAGE=0.001, J_CHICKENPOX_13=0.6, RCVD_IN_DNSWL_MED=-4]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id JAnwzRtt3OOP for <>; Thu, 27 Oct 2011 23:18:20 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 8AE8021F8B0C for <>; Thu, 27 Oct 2011 23:18:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;;; l=34749; q=dns/txt; s=iport; t=1319782700; x=1320992300; h=message-id:date:from:mime-version:to:cc:subject: references:in-reply-to; bh=0afKxcp99vYpy9vbpr+hrcMYcntAvl23XGlsUdDIOig=; b=L51pe+dDwcZx9xEvjee9CH2NdtLeNpEbLwoKxAP4ewH3oaKwJ1Zz0qTV AzlE3+RPlI10niCEo+Dz96/Rr2a2IrhyZEcG//86HgIe1EuY9MY23kBtW Ogc2Q/iR4+COg5IGHdegYzZ6iswbDvYbJ2H72H93vekQc3RSfHAP0cq0C g=;
X-IronPort-AV: E=Sophos; i="4.69,417,1315180800"; d="scan'208,217"; a="10867177"
Received: from ([]) by with ESMTP; 28 Oct 2011 06:18:20 +0000
Received: from ( []) by (8.14.3/8.14.3) with ESMTP id p9S6IJAr032593; Fri, 28 Oct 2011 06:18:19 GMT
Message-ID: <>
Date: Thu, 27 Oct 2011 23:19:24 -0700
From: Enke Chen <>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:6.0) Gecko/20110812 Thunderbird/6.0
MIME-Version: 1.0
References: <> <> <> <> <> <>
In-Reply-To: <>
Content-Type: multipart/alternative; boundary="------------030002050604080607060202"
Cc: " List" <>, "" <>
Subject: Re: [Idr] draft-uttaro-idr-bgp-persistence-00
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 28 Oct 2011 06:18:22 -0000


My comments are inlined.

On 10/27/11 1:17 PM, UTTARO, JAMES wrote:
> Enke,
> GR is a solution that is essentially local in scope it does not have the ability to inform downstream speakers of the viability of routing state from the point of possible control plane failure. OTOH Persistence does propagate the condition of state. This provides distinct advantages in terms of customers awareness of the SPs control plane. One could imagine a customer receiving a STALE path and responding by selecting a backup. Some of the extensions to this draft that I have considered in colouring of STALE to inform if the condition arises from a local ( PE ) or internal iBGP ( RR ) failures..
> GR makes no distinction from STALE state and ACTIVE state.. This can lead to the STALE path still being preferred throughout the topology. IMO this is incorrect behavior regardless of the comparison.
> PERSISTENCE allows for a customer to indicate which paths should be candidates. Customers may want to immediately failover to the backup for some paths and not for others. GR is not capable of doing this it is all or nothing. The granularity is not sufficient. It needs to be at the path level. There may even be a case for having even more granularity i.e a per path timer.. GR is not capable of being extended for either of these cases.

I am not sure how this path level persistence would work 
operationally.   Without the detailed information of a provider's 
network, how would a customer know what kind of failures and recovery 
that they might experience?   Consider the example of the simultaneous 
RR failures in the draft,  why wouldn't any customer not to want to 
protect against such failures?   The end result could be that the 
PERSISTENCE flag is always set, thus losing its significance.

Regarding the use of the STALE state vs ACTIVE state, clearly there is a 
tradeoff.   GR uses the stale routes in order to avoid forwarding 
churns, which has been a critical requirement for a long time.   If 
there is a real need for favoring a ACTIVE one over a STALE one in GR, 
it can be done by a simple knob.

As you know, BGP is full of knobs that adjust behaviors for different 
needs :-)

> GR does not provide protection through successive restarts of the session. I believe that if this occurs the state will be invalidated. So for a session that is bouncing due to overload condition GR will not provide the required protection

This can be addressed by a simple knob to set the min stale timer for GR.

> GR does not employ a make before break strategy. All state is invalidated first then the newly learned state is processed. This leads to routing churn especially if the majority of the state is the same which I am pretty sure is the case

Such behavior would be an implementation bug that needs to be fixed.  
But it is not an issue with the protocol itself.

This is what we have in 4.2. Procedures for the Receiving Speaker, RFC 4724:


    The Receiving Speaker MUST replace the stale routes by the routing
    updates received from the peer.  Once the End-of-RIB marker for an
    address family is received from the peer, it MUST immediately remove
    any routes from the peer that are still marked as stale for that
    address family.


There are several possibilities for the premature purge of the stale 
routes. For example, the "Forwarding state" flag was somehow not set 
after the session was re-established, or the the EOR was sent 
prematurely.   Further investigation will be needed in order to identify 
any possible implementation or config issues involved in your setup.

> GR invalidates state due to the case of protocol error i.e A malformed update will invalidate all of the state. This is not the desired behavior.

It has been addressed by the following extension:

> GR is not specific as to which events invoke it or not. From my read on the draft it is not clear if holdtime expiration invokes GR or not.. The draft is unclear.

I think that it is covered by the above extension.  If not, it should be 

> It is not clear to me how RRs and PEs differ in using GR.

I think that there is a main difference when a RR is not in the 
forwarding path.  In that case, the RR should always set the F bit in 
the GR Capability so that its clients will continue forwarding after 
they lose the sessions with RR.  It is a deployment issue, though.

> The time that state can persist is limit to about 1 hour max.

I think that you are talking about the "Restart time" field which has 12 
bits and amount to about 68 minutes.  The "Restart time" is for the 
session re-establishment.  It does not impact the duration for holding 
stale routes after the session is re-established.

If the session does not get re-established in 68 minutes, the stale 
routes would be purged.  That is a long time, isn't it?   However, if 
one really wants to extend the session re-establishment time and 
continue to hold stale routes, it can be done by a simple knob.

> GR does detail the behavior where convergence is not achieved between restarts.. Similar to above..

The min stale timer knob can cover it (see above).

But do you meant "does not"?  We can certainly clarify in 4724bis if 
that is the case.

> I do not believe that the current GR paradigm can be extended to cover the majority of the cases above.

Except for the path level persistence you mentioned, I believe the GR 
will be able to address all other persistence requirements you listed, 
with some simple knobs and some implementation enhancements.

> Thanks,
> 	Jim Uttaro

Thanks.   -- Enke

> -----Original Message-----
> From: Enke Chen []
> Sent: Wednesday, October 26, 2011 8:43 PM
> Cc:; List; Enke Chen
> Subject: Re: [Idr] draft-uttaro-idr-bgp-persistence-00
> Hi, folks:
> I have a hard time in understanding what new problems (beyond the GR)
> the draft try to solve :-(
> If the concern is about the simultaneous RR failure as shown in the
> examples in Sect. 6 Application, that can be addressed easily using GR.
> As the RRs are not in the forwarding path, it means that the forwarding
> is not impacted (thus is preserved) during the restart of a RR.   The
> Forwarding State bit (F) in the GR capability should always be set by
> the RR when it is not in the forwarding path.
> Also in the case of simultaneous RR failure, I do not see why one would
> want to retain some routes, but not others, using the communities
> specified in the draft.  As the RRs are not in the forwarding path,
> wouldn't be better to retain all the routes on a PE/client?
> As you might be aware, efforts have been underway to address issues with
> GR found during implementation and deployment. They include the spec
> respin, notification handling, and implementations.  If there are issues
> in the GR area that are not adequately addressed,  I suggest that we try
> to address them in the GR respin if possible, instead of creating
> another variation unnecessarily.
> Thanks.   -- Enke
> On 10/26/11 10:24 AM, Robert Raszuk wrote:
>> Jim,
>> When one during design phase of a routing protocol or routing protocol
>> extension or modification to it already realizes that enabling such
>> feature may cause real network issue if not done carefully - that
>> should trigger the alarm to rethink the solution and explore
>> alternative approaches to the problem space.
>> We as operators have already hard time to relate enabling a feature
>> within our intradomain boundaries to make sure such rollout is network
>> wide. Here you are asking for the same level of awareness across ebgp
>> boundaries. This is practically unrealistic IMHO.
>> Back to the proposal ... I think that if anything needs to be done is
>> to employ per prefix GR with longer and locally configurable timer.
>> That would address information persistence across direct IBGP sessions.
>> On the RRs use case of this draft we may perhaps agree to disagree,
>> but I do not see large enough probability of correctly engineered RR
>> plane to experience simultaneous multiple ibgp session drops. If that
>> happens the RR placement, platforms or deployment model should be
>> re-engineered.
>> Summary .. I do not think that IDR WG should adopt this document. Just
>> adding a warning to the deployment section is not sufficient.
>> Best regards,
>> R.
>>> Robert,
>>> The introduction of this technology needs to be carefully evaluated
>>> when being deployed into the network. Your example clearly calls out
>>> how a series of independent design can culminate in incorrect
>>> behavior. Certainly the deployment of persistence on a router that
>>> has interaction with a router that does not needs to be clearly
>>> understood by the network designer. The goal of this draft is to
>>> provide a fairly sophisticated tool that will protect the majority of
>>> customers in the event of a catastrophic failure.. The premise being
>>> the perfect is not the enemy of the good.. I will add text in the
>>> deployment considerations section to better articulate that..
>>> Thanks, Jim Uttaro
>>> -----Original Message----- From:
>>> [] On Behalf Of Robert Raszuk Sent:
>>> Sunday, October 23, 2011 5:32 PM To: List Subject: [Idr]
>>> draft-uttaro-idr-bgp-persistence-00
>>> Authors,
>>> Actually when discussing this draft a new concern surfaced which I
>>> would like to get your answer on.
>>> The draft in section 4.2 says as one of the forwarding rules:
>>> o  Forwarding to a "stale" route is only used if there are no other
>>> paths available to that route.  In other words an active path always
>>> wins regardless of path selection.  "Stale" state is always
>>> considered to be less preferred when compared with an active path.
>>> In the light of the above rule let's consider a very simple case of
>>> dual PE attached site of L3VPN service. Two CEs would inject into
>>> their IBGP mesh routes to the remote destination: one marked as STALE
>>> and  one not marked at all. (Each CE is connected to different PE and
>>> each PE RT imports only a single route to a remote hub headquarter to
>>> support geographic load balancing).
>>> Let me illustrate:
>>> VPN Customer HUB
>>> PE3      PE4 SP PE1      PE2 |        | |        | CE1      CE2 |
>>> | 1|        |10 |        | R1 ------ R2 1
>>> CE1,CE2,R1,R2 are in IBGP mesh. IGP metric of CE1-R1 and R1-R2 are 1
>>> and R2-CE2 is 10.
>>> Prefix X is advertised by remote hub in the given VPN such that PE1
>>> vrf towards CE1 only has X via PE3 and PE2's vrf towards CE2 only has
>>> X via PE4.
>>> Let's assume EBGP sessions PE3 to HUB went down, but ethernet link
>>> is up, next hop is in the RIB while data plane is gone. Assume no
>>> data plane real validation too. /* That is why in my former message
>>> I suggested that data plane validation would be necessary */.
>>> R1 has X via PE1/S (stale) and X via PE2/A (active) - it understands
>>> STALE so selects in his forwarding table path via CE2.
>>> R2 has X via PE1/S (stale) and X via PE2/A (active) - it does not
>>> understand STALE, never was upgraded to support the forwarding rule
>>> stated above in the draft and chooses X via CE1 (NH metric 2 vs 10).
>>> R1--R2 produce data plane loop as long as STALE paths are present in
>>> the system. Quite fun to troubleshoot too as the issue of PE3
>>> injecting such STALE paths may be on the opposite site of the world.
>>> The issue occurs when some routers within the customer site will be
>>> able to recognize STALE transitive community and prefer non stale
>>> paths in their forwarding planes (or BGP planes for that matter)
>>> while others will not as well as when both stale and non stale paths
>>> will be present.
>>> Question 1: How do you prevent forwarding loop in such case ?
>>> Question 2: How do you prevent forwarding loop in the case when
>>> customer would have backup connectivity to his sites or connectivity
>>> via different VPN provider yet routers in his site only partially
>>> understand the STALE community and only partially follow the
>>> forwarding rules ?
>>> In general as the rule is about mandating some particular order of
>>> path forwarding selection what is the mechanism in distributed
>>> systems like today's routing to be able to achieve any assurance that
>>> such rule is active and enforced across _all_ routers behind EBGP
>>> PE-CE L3VPN boundaries in customer sites ?
>>> Best regards, R.
>>> -------- Original Message -------- Subject: [Idr]
>>> draft-uttaro-idr-bgp-persistence-00 Date: Sat, 22 Oct 2011 00:23:55
>>> +0200 From: Robert Raszuk<>  Reply-To:
>>> To: List<>
>>> Hi,
>>> I have read the draft and have one question and one observation.
>>> Question:
>>> What is the point of defining DO_NOT_PERSIST community ? In other
>>> words why not having PERSIST community set would not mean the same as
>>> having path marked with DO_NOT_PERSIST.
>>> Observation:
>>> I found the below statement in section 4.2:
>>> o  Forwarding must ensure that the Next Hop to a "stale" route is
>>> viable.
>>> Of course I agree. But since we stating obvious in the forwarding
>>> section I think it would be good to explicitly also state this in
>>> the best path selection that next hop to STALE best path must be
>>> valid.
>>> However sessions especially those between loopbacks do not go down
>>> for no reason. Most likely there is network problem which may have
>>> caused those sessions to go down. It is therefor likely that LDP
>>> session went also down between any of the LSRs in the data path and
>>> that in spite of having the paths in BGP and next hops in IGP the LSP
>>> required for both quoted L2/L3VPN applications is broken. That may
>>> particularly happen when network chooses to use independent control
>>> mode for label allocation.
>>> I would suggest to at least add the recommendation statement to the
>>> document that during best path selection especially for stale paths
>>> a validity of required forwarding paradigm to next hop of stale
>>> paths should be verified.
>>> For example using techniques as described in:
>>> draft-ietf-idr-bgp-bestpath-selection-criteria
>>> Best regards, R.
>>> _______________________________________________ Idr mailing list
>>> _______________________________________________ Idr mailing list
>> _______________________________________________
>> Idr mailing list