RE: a draft about how BFD notifying the state change to applications

richard.spencer@bt.com Wed, 24 August 2005 17:41 UTC

Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1E7zFj-0004rQ-Of; Wed, 24 Aug 2005 13:41:31 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1E7zFg-0004rL-GO for rtg-bfd@megatron.ietf.org; Wed, 24 Aug 2005 13:41:29 -0400
Received: from ietf-mx.ietf.org (ietf-mx [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA07393 for <rtg-bfd@ietf.org>; Wed, 24 Aug 2005 13:41:27 -0400 (EDT)
From: richard.spencer@bt.com
Received: from smtp4.smtp.bt.com ([217.32.164.151]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1E7zG2-0000uO-K2 for rtg-bfd@ietf.org; Wed, 24 Aug 2005 13:41:51 -0400
Received: from i2km99-ukbr.domain1.systemhost.net ([193.113.197.31]) by smtp4.smtp.bt.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 24 Aug 2005 18:41:18 +0100
Received: from i2km41-ukdy.domain1.systemhost.net ([193.113.30.29]) by i2km99-ukbr.domain1.systemhost.net with Microsoft SMTPSVC(5.0.2195.6713); Wed, 24 Aug 2005 18:41:18 +0100
X-MimeOLE: Produced By Microsoft Exchange V6.0.6603.0
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Date: Wed, 24 Aug 2005 18:41:17 +0100
Message-ID: <B5E87B043D4C514389141E2661D255EC0A835B87@i2km41-ukdy.domain1.systemhost.net>
Thread-Topic: a draft about how BFD notifying the state change to applications
Thread-Index: AcWosu7ePL6jDt2fTfu5lz9HNLQuAAABvVTg
To: yangpingan@huawei.com
X-OriginalArrivalTime: 24 Aug 2005 17:41:18.0098 (UTC) FILETIME=[08506720:01C5A8D3]
X-Spam-Score: 0.3 (/)
X-Scan-Signature: 612a16ba5c5f570bfc42b3ac5606ac53
Content-Transfer-Encoding: quoted-printable
Cc: rtg-bfd@ietf.org
Subject: RE: a draft about how BFD notifying the state change to applications
X-BeenThere: rtg-bfd@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "RTG Area: Bidirectional Forwarding Detection DT" <rtg-bfd.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:rtg-bfd@ietf.org>
List-Help: <mailto:rtg-bfd-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=subscribe>
Sender: rtg-bfd-bounces@ietf.org
Errors-To: rtg-bfd-bounces@ietf.org

Pingan,

Perhaps it would help to understand what the perceived benefits of your solution are if you could provide some example applications with details of how it would work? Please see further comments in line...

> Richard,
> 
> What you said is just one possible situation of  WTR and HOLD 
> process.

Yes, that's why I said it was an example.

> Normally the hold time is less than WTR time(The recommended
> value of HOLD timer is several seconds or less than one 
> second. and WTR timer several minute).

It doesn't matter what you set the HOLD time to, (unless you set it to zero) you are still negating the benefits of fast fault detection. Regarding the WTR timer, other than addressing the issue of flapping sessions (which IMO your solution doesn't do anyway) what do you believe the benefits of using this timer are? IMO it raises more issues than it solves. For example, performance monitoring should only be performed when a link/LSP/route is available. In your proposal you would need to start measuring performance when the WTR timer expires, don't you think this will have an adverse effect on meeting SLAs?, i.e. outages will be longer (and more expensive to the operator) than they need to be. As another example, what if you are using an expensive ISDN dialup circuit for the backup path, wouldn't you want to switch back over to the primary (e.g. leased line) circuit as soon as possible to avoid unnecessary dialup charges?

> So this cannot  be achieved by alarm threshold.

If you are referring here to the ability to wait for a user configurable period of time before sending the BFD status to an application then yes you are correct, using an alarm threshold doesn't provide this ability. However, that's not the intention, the purpose of the threshold as I described would be to detect session flapping in order to take the appropriate action (similar to the way in which BGP route dampening works). As I said previously, I can't think of any benefits of using a WTR timer and the HOLD timer negates the benefits of fast detection.

> And another purpose of WTR  is that: before BFD notifying 
> forwarding engine, it gives the chance for preparing 
> forwarding table(especially for 1hop BFD),

Exactly what preparation do you believe can/should be done while waiting for the WTR timer to expire?

> so can avoid  traffic lost when defect is recovered.

Why do you believe traffic will be lost when a defect is recovered from? If there is no backup for the failed LSP/route/link then any traffic being sent is going to be lost anyway until the defect is recovered from. If traffic is currently using a backup LSP/route/link then why do you think traffic will be lost when the defect is recovered from and the primary becomes available?

For example, in the backup static IP route case, once a BFD session associated with a primary static IP route comes back up then the router can simply stop forwarding using the backup route and start using the primary. You mention ARP in your draft, but IMO this shouldn't be a problem. BFD systems connected via Ethernet need ARP cache entries associated with the next hops they use to reach each other before they can send BFD control packets. The ARP cache entry for a primary static IP route next-hop must already be in the ARP cache by the time the BFD session comes back up, and therefore no user traffic using the static route should be lost.

Lets take OSPF as another example, why would you want to wait a period of time (i.e. the WTR time) before re-establishing an OSPF adjacency or re-advertising routes to an OSPF neighbour that has become reachable? You can't send any IP traffic to a neighbour until you've exchanged routes anyway. So, as with the previous static route example, if there is no alternative route you're dropping packets anyway, if there is an alternative route then you will continue use the alternative until you've exchanged LSAs and learned alterative routes with a shorter path.

Regards,
Richard

> ----- Original Message ----- 
> From: <richard.spencer@bt.com>
> To: <yangpingan@huawei.com>; <internet-drafts@ietf.org>
> Cc: <rtg-bfd@ietf.org>; <jhaas@nexthop.com>; 
> <dward@cisco.com>; <d.katz@juniper.com>
> Sent: Wednesday, August 24, 2005 7:22 PM
> Subject: RE: a draft about how BFD notifying the state change 
> to applications
> 
> 
> Pingan,
> 
> If I understand correctly, the problem you are trying to 
> solve is the case where we get a flapping BFD session (i.e. 
> the session is going up and down repeatedly), e.g. due to an 
> intermittent link failure. Your solution is to use hold 
> timers to delay any actions to be taken by applications as a 
> result of a session going up or down.
> 
> IMO your solution does not solve the problem. If a session is 
> flapping then the flapping needs to be detected and the 
> appropriate action taken. Your solution does not reliably 
> detect flapping and I assume is based on ignoring flaps that 
> occur within the configured hold times.
> 
> The current text in section 4 states: "When BFD session goes 
> up, if HOLD timer is running, then stop HOLD timer and does 
> not notify application the state change of the session. 
> Otherwise start WTR timer." The effect of this proposal is 
> that when you get a BFD session going up and down, although 
> it will be detected by BFD, the application will never take 
> any action if the flap interval is less than the hold timer. 
> For example, lets say the flap interval is 5secs and the hold 
> timer is set to 10secs. When the session goes down the hold 
> timer will start but after 5 secs the session will come back 
> up and the hold timer will be stopped and no action taken. 
> This cycle will continue indefinitely, i.e. traffic will 
> continue to be dropped intermittently and the flapping will 
> not be detected.
> 
> Even if you don't stop the hold timer when the session comes 
> back up, you still won't reliably detect flapping. Lets say a 
> session goes down and you start the hold timer (set to say 10 
> secs). The session might go up and down 5 times during the 10 
> sec hold time, but by chance when the hold timer expires the 
> session may be up. In which case, the application won't take 
> any action despite traffic being lost intermittently over the 
> 10 sec period.
> 
> Also, the main purpose of using BFD is to provide fast fault 
> detection. If hold timers are introduced to delay any actions 
> to be taken following fault detection (e.g. failover to a 
> backup route), then what's the point of having fast detection 
> in the first place?
> 
> An alternative solution would be to use a BFD session up/down 
> state transition alarm threshold. If the number of times a 
> session goes up/down within a configured time period exceeds 
> the threshold, then the session can i) be disabled or 
> administratively shut down (under control of OSS or the BFD 
> system itself), ii) an alarm raised to instigate fault 
> diagnosis/correction activities, or iii) both. This solution 
> does not require any changes to the BFD protocol and could be 
> implemented today.
> 
> Regards,
> Richard
> 
> > -----Original Message-----
> > From: rtg-bfd-bounces@ietf.org [mailto:rtg-bfd-bounces@ietf.org]On
> > Behalf Of yangpingan 30338
> > Sent: 24 August 2005 03:26
> > To: internet-drafts@ietf.org
> > Cc: rtg-bfd@ietf.org; jhaas@nexthop.com; dward@cisco.com;
> > d.katz@juniper.com
> > Subject: a draft about how BFD notifying the state change to
> > applications
> > 
> > 
> > 
> > Dear all,
> > 
> > I have written a draft about how BFD notifying the state 
> > change to applications, the purpose is to reduce the effect 
> to system
> > when BFD session goes up and down frequently, and reduce the 
> > packet loss when defect is recovered. It's 4 pages long, 
> > Comments are very welcome.
> > 
> > Please help to post to the IETF website this draft and thank 
> > you for that.
> > 
> > 
> > Thank you and Best regards.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > **************************************************************
> > ****************************
> >  This email and its attachments contain confidential 
> > information from HUAWEI, which is intended only for the 
> > person or entity whose address is listed above. Any use of 
> > the information contained herein in any way (including, but 
> > not limited to, total or partial disclosure, reproduction, or 
> > dissemination) by persons other than the intended 
> > recipient(s) is prohibited. If you receive this e-mail in 
> > error, please notify the sender by phone or em
> > ail immediately and delete it!
> >  
> > **************************************************************
> > **************************
> >
>