Re: BFD stability follow-up from IETF-91

Marc Binderberger <marc@sniff.de> Mon, 01 December 2014 08:51 UTC

Return-Path: <marc@sniff.de>
X-Original-To: rtg-bfd@ietfa.amsl.com
Delivered-To: rtg-bfd@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4CB2C1A0120 for <rtg-bfd@ietfa.amsl.com>; Mon, 1 Dec 2014 00:51:42 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.56
X-Spam-Level:
X-Spam-Status: No, score=-1.56 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HELO_EQ_DE=0.35, T_RP_MATCHES_RCVD=-0.01] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hABenCiw_rNn for <rtg-bfd@ietfa.amsl.com>; Mon, 1 Dec 2014 00:51:40 -0800 (PST)
Received: from door.sniff.de (door.sniff.de [IPv6:2001:6f8:94f:1::1]) by ietfa.amsl.com (Postfix) with ESMTP id 4217F1A044D for <rtg-bfd@ietf.org>; Mon, 1 Dec 2014 00:51:40 -0800 (PST)
Received: from [IPv6:::1] (localhost.sniff.de [127.0.0.1]) by door.sniff.de (Postfix) with ESMTP id 18F9B2AA0F; Mon, 1 Dec 2014 08:51:37 +0000 (GMT)
Date: Mon, 01 Dec 2014 00:54:54 -0800
From: Marc Binderberger <marc@sniff.de>
To: Manav Bhatia <manavbhatia@gmail.com>
Message-ID: <20141201005454102977.238caa76@sniff.de>
In-Reply-To: <CAG1kdojJfGqDUW2_CshM58v7+sF2H-vaCN1j-9EMYH0yRG5=UQ@mail.gmail.com>
References: <20141126001931.GJ20330@pfrc> <CAG1kdoghcA=xSaXmkr68qduH2t8oC=-ZazoQztj8JK12SazKsw@mail.gmail.com> <20141126005023981392.0c488535@sniff.de> <CAG1kdojJfGqDUW2_CshM58v7+sF2H-vaCN1j-9EMYH0yRG5=UQ@mail.gmail.com>
Subject: Re: BFD stability follow-up from IETF-91
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: GyazMail version 1.5.15
Archived-At: http://mailarchive.ietf.org/arch/msg/rtg-bfd/54q6TJehmMmYVyR0yaaexXtuwlM
Cc: "rtg-bfd@ietf.org" <rtg-bfd@ietf.org>
X-BeenThere: rtg-bfd@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "RTG Area: Bidirectional Forwarding Detection DT" <rtg-bfd.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtg-bfd/>
List-Post: <mailto:rtg-bfd@ietf.org>
List-Help: <mailto:rtg-bfd-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Dec 2014 08:51:42 -0000

Hello Manav!

quite some emails about this topic :-) Let me nevertheless reply to this one:

> Now given just the sequence numbers its almost impossible for A to
> know whether the issue was at the RX or the TX side.

... or in between (the link/Internet).

For the Rx side: you timestamp the packets when the hardware receives them. 
This stamp could be in some additional "pak header", something that 
implementations probably have (but yes, it's an implementation detail for me, 
not an aspect of the BFD packet).

Then you timestamp the packet whenever you receive it in your software path, 
assuming the BFD implementation is in software. You may even have additional 
time stamps while the packet proceeds through the various functions.

You have a rotating buffer for these information and save the last N packet 
informations of the session when it goes down. Plus unexpected packets like 
the delayed "Up" packets while you are already in Down.

If the sender does something equivalent on the Tx side and also saving N 
packet information after receiving the first "Down" packet back from the Rx 
node you have Tx and Rx delta times and you can correlate them based on the 
sequence number. If N is large enough, e.g. N = 10 or 50 or such, then you 
should be able to debug the session how it went down.

For your example the Rx timestamps would tell that "101", "102" and "103" 
arrived at your Rx hardware with a delay. If the Tx timestamps show no delay 
in the Tx path then you would conclude it's the link/Internet in between.

If you manage to synchronize the systems down to a few milliseconds then the 
timestamps would even allow to explicitly measure the link delay, that Tx 
packet generation happened without a delay etc..

True, this requires you have the information from both, the Rx and the Tx 
node.


This sounds clumsy, compared to the elegant all-in-one-packet but at the end 
both the Tx and Rx side may need multiple timestamps and some history anyway 
to debug their system if a problem shows up.


Regards, Marc





> My claim is that just using sequence numbers may NOT help.
> 
> A BFD session with an interval of 33ms on router A will flap if it
> does not see any BFD packet for 100ms.
> 
> Assume that the last seq number that A sees from the remote end is,
> say 100. A will bring down the BFD session if it now does not see 101,
> 102 and 103 in the next 100ms.
> 
> Further assume that these packets were not seen by A and the session
> flaps. However, we get these 3 BFD packets immediately after this flap
> -- at 100ms + some_delta.
> 
> Now given just the sequence numbers its almost impossible for A to
> know whether the issue was at the RX or the TX side.
> 
> Am i missing something?





> 
> Cheers, Manav
> 
> 
> On Wed, Nov 26, 2014 at 2:20 PM, Marc Binderberger <marc@sniff.de> wrote:
>> Hello Manav,
>> 
>>> I believe the work is important and addresses something thats really
>>> required (spent too much time debugging why BFD flapped!).
>> 
>> agree :-) we should keep the discussion alive.
>> 
>> 
>>> side Time stamping would have helped in debugging whether the BFD
>>> packet was sent late, or whether the packet was sent on time and also
>>> arrived on time but was delayed when passing it up the BFD
>>> stack/processor (lay in the RX buffer for tad too long)
>> 
>> well, I can see a point in having the Tx timestamps in the packet mainly 
>> for
>> the purpose of knowing "this" packet was okay/not okay on the Tx side and 
>> to
>> correlate it with your local Rx measurement.
>> 
>> And even this point is less relevant with sequence numbers as this number
>> allows the identification of packets and thus the correlation of 
>> information
>> from the Tx and Rx system.
>> 
>> 
>> Regards, Marc
>> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, 26 Nov 2014 12:26:41 +0530, Manav Bhatia wrote:
>>> Hi Jeff,
>>> 
>>> I vividly remember the original intent of the stability draft was to
>>> help debug BFD failures -- to isolate the issue at the RX or the TX
>>> side Time stamping would have helped in debugging whether the BFD
>>> packet was sent late, or whether the packet was sent on time and also
>>> arrived on time but was delayed when passing it up the BFD
>>> stack/processor (lay in the RX buffer for tad too long), etc. But then
>>> time stamping came with its own set of issues, and was hence dropped
>>> from the original draft.
>>> 
>>> Can the authors send a summary on the list on why time stamping was
>>> dropped so that we're all clear on that one.
>>> 
>>> The current proposal does help but is not complete.
>>> 
>>> Assume that the RX end loses a BFD session and learns later that it
>>> did eventually receive the missing BFD packets (based on the seq #).
>>> How would it know which end was misbehaving? Was it a delay at the TX
>>> side, or was it the RX that delayed passing the packets to the BFD
>>> process(or). This is usually what we want to debug and i want to
>>> understand how this draft with sequence numbers can unequivocally tell
>>> me that.
>>> 
>>> I believe the work is important and addresses something thats really
>>> required (spent too much time debugging why BFD flapped!). Clearly
>>> what would help is putting a small section that describes how we can
>>> use the sequence numbers to debug what and where things went wrong.
>>> 
>>> Cheers, Manav
>>> 
>>> 
>>> On Wed, Nov 26, 2014 at 5:49 AM, Jeffrey Haas <jhaas@pfrc.org> wrote:
>>>> draft-ashesh-bfd-stability-01 was presented again during IETF-91 in
>>>> Honolulu.  The slides can be viewed here:
>>>> 
>>>> http://www.ietf.org/proceedings/91/slides/slides-91-bfd-4.pptx
>>>> 
>>>> To attempt to simplify the presentation, the contentious portion of the
>>>> timers were removed from the proposal, leaving only the sequence 
>>>> numbering
>>>> for detecting loss of BFD async packets.
>>>> 
>>>> When the room was polled to see whether the draft should be adopted as a 
>>>> WG
>>>> item, the sense of the room was very quiet.  As promised, this is to
>>>> inquire
>>>> for support for this draft on the WG mailing list to make sure the whole
>>>> group has a voice.
>>>> 
>>>> It should be noted that post-meeting discussion on the fate of this draft
>>>> noted that BFD authentication code points are plentiful and are available
>>>> with expert review.  Should the draft authors wish to continue this work 
>>>> as
>>>> Experimental, that is an option.
>>>> 
>>>> -- Jeff
>>>> 
>>> 
>