Re: BFD stability follow-up from IETF-91

Marc Binderberger <marc@sniff.de> Mon, 08 December 2014 04:56 UTC

Return-Path: <marc@sniff.de>
X-Original-To: rtg-bfd@ietfa.amsl.com
Delivered-To: rtg-bfd@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2E43E1A3BA4 for <rtg-bfd@ietfa.amsl.com>; Sun, 7 Dec 2014 20:56:59 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.56
X-Spam-Level:
X-Spam-Status: No, score=-1.56 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HELO_EQ_DE=0.35, T_RP_MATCHES_RCVD=-0.01] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KWn8hcvkqAQD for <rtg-bfd@ietfa.amsl.com>; Sun, 7 Dec 2014 20:56:57 -0800 (PST)
Received: from door.sniff.de (door.sniff.de [IPv6:2001:6f8:94f:1::1]) by ietfa.amsl.com (Postfix) with ESMTP id 1869D1A1B68 for <rtg-bfd@ietf.org>; Sun, 7 Dec 2014 20:56:57 -0800 (PST)
Received: from [IPv6:::1] (localhost.sniff.de [127.0.0.1]) by door.sniff.de (Postfix) with ESMTP id CBBD32AA0F; Mon, 8 Dec 2014 04:56:54 +0000 (GMT)
Date: Sun, 07 Dec 2014 21:00:38 -0800
From: Marc Binderberger <marc@sniff.de>
To: "Sam K. Aldrin" <aldrin.ietf@gmail.com>, Manav Bhatia <manavbhatia@gmail.com>
Message-ID: <20141207210038569945.f0a11d44@sniff.de>
In-Reply-To: <3EA747F1-FFDD-4E36-B8A2-58E362C1F601@gmail.com>
References: <CO2PR0501MB823C222B7D62779F4DF58CDB3780@CO2PR0501MB823.namprd05.prod.outlook.com> <D0A647C1.28843%mmudigon@cisco.com> <CO2PR0501MB8234A1BDDFD008EE12C847AB3780@CO2PR0501MB823.namprd05.prod.outlook.com> <CECE764681BE964CBE1DFF78F3CDD3943F5AE38D@xmb-aln-x01.cisco.com> <CAG1kdogkUr2YyodeUPWOqea+2jqOkmdYnPywVHCw8j1+=9eM6A@mail.gmail.com> <CECE764681BE964CBE1DFF78F3CDD3943F5AE4AE@xmb-aln-x01.cisco.com> <20141204151708.GA9458@pfrc> <7347100B5761DC41A166AC17F22DF1121B8AAC29@eusaamb103.ericsson.se> <059338DA-6758-46C1-AD23-D2039C875D09@gmail.com> <CAG1kdogeZBuhmRmgkY2jo2oFTMOXzwWbS=f0H4M4mh9mJXAdNg@mail.gmail.com> <58D290A6-1EB1-425B-9FFA-3025A3CAE4EE@gmail.com> <CAG1kdoiumymdnAyG8jOJNSztVHtTO0DzLeHd1SnpP8R6xNeVvw@mail.gmail.com> <3EA747F1-FFDD-4E36-B8A2-58E362C1F601@gmail.com>
Subject: Re: BFD stability follow-up from IETF-91
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: GyazMail version 1.5.15
Archived-At: http://mailarchive.ietf.org/arch/msg/rtg-bfd/jjMt8sUN_GImT7tMDwu3CYiF1Lg
Cc: "rtg-bfd@ietf.org" <rtg-bfd@ietf.org>
X-BeenThere: rtg-bfd@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "RTG Area: Bidirectional Forwarding Detection DT" <rtg-bfd.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtg-bfd/>
List-Post: <mailto:rtg-bfd@ietf.org>
List-Help: <mailto:rtg-bfd-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Dec 2014 04:56:59 -0000

Hello Sam, Manav et al.,

really a long thread ... :-)

Reading the emails I had a few time the question: what are we trying to 
achieve here?

There is debugging. Obviously, having more (relevant) data helps to find the 
cause, likely by eliminating the not-causes. I don't expect anything perfect, 
as Sam stated

> Debugging Packet drops, latency and mis-ordering of packets is mostly 
> shooting in the dark :D


But then there may be another motivation. Manav wrote

> Ideally even if there is some bit of congestion i would like the BFD 
> packet to get through.

and I wonder what do you mean?  You mean BFD should not flap if it was 
congestion?
The "stability" and some comments in the emails makes me wonder if we talk 
about "do not timeout when ...". And that would be a very different topic. It 
would mean we talk about a different trigger than the timeout defined in 5880.

In this case we would need to talk about the new trigger condition first 
before we discuss what data needs to be exchanged.


>>> I do not see why it shouldn't be part of BFD v2 OR lead to v2 :D 
>> Sure, that would make Marc very happy ! :-)

Count me in! :-)

Seriously: if the discussion is to improve debugging of v1 then v2 maybe a 
long shot. If we talk about different triggers then a v2 header may make 
sense.


Regards, Marc



On Thu, 4 Dec 2014 19:52:19 -0800, Sam K. Aldrin wrote:
> Hi Manav,
> 
> On Dec 4, 2014, at 6:51 PM, Manav Bhatia wrote:
>> Hi Sam,
>>>> Ideally even if there is some bit of congestion i would like the BFD 
>>>> packet to get through.
>>> I understand the queuing problems but I am not clear how the ID is going 
>>> to solve the problem, if there is only congestion and not
>> 
>> It would help.
>> 
>> Assume the last sequence number you saw before the flap was s1. You timed 
>> out since you did not see s2 and s3 before the timeout. Now further assume 
>> that you know that you did receive s2 and s3, however they arrived after 
>> the BFD expiry interval then you know that there wasnt a drop, and the 
>> packet arrived late because of some queing issue. Now how you determine 
>> whether this delay was seen at the TX or RX side is open to discussion.
> 
> As the draft doesn't say, what exactly one would/should do, assuming there 
> is packet throttling happing due to various reasons, could you/authors 
> elaborate on this? I would like to see those tangible things detailed first.
> 
> I see the problem little differently though. BFD session flapping is an 
> indicator of the network behavior, be it device or network congestion or 
> something else. Even if the packets arrive late, as you say, one cannot 
> know where exactly the delay is happening.
> 
>> 
>> Without a sequence number you have no idea whether the packet was dropped 
>> or whether it arrived/processed late. 
>> 
>> If by some out-of-band mechanism you can figure out that the TX was done 
>> on time and the delay was at RX end then its an implementation issue on 
>> the RX side. If the TX was delayed then its an implementation issue at the 
>> TX side.
> Don't think so. It could be the lot more than implementation issue. Could 
> be due to network congestion due to bursty traffic and nothing to do with 
> the implementation.
>> 
>> This helps isolating the node that needs to fix the issue, otherwise we're 
>> only shooting in the dark. 
> The issue you see is what I think could be the actual network behavior and 
> not the device issue. 
> Debugging Packet drops, latency and mis-ordering of packets is mostly 
> shooting in the dark :D
> Nevertheless, I do not see this as BFD specific only. 
> 
>> 
>>> packet drop. In that case just the sequence # doesn't help and 
>>> timestamping is needed. Even if timestamping is to be done, realistically 
>>> it cannot happen the same way across multi vendor i.e. where the 
>>> timestamping should/could be done. For ex: Before it is queued or after 
>>> OR LC/RP/SP/Process etc.
>> 
>> This isnt a new problem. Each vendor time stamps 1588 packets differently. 
>> However, the aggregate solution works across multiple vendors.
>> 
>> While it may not solve all the issues because of the vagaries of how each 
>> vendor does time stamping it would certainly help debugging large number 
>> of BFD flaps.
> As timestamp is not in the ID, we could differ the discussion for later, 
> but I definitely believe that it is a bigger issue where TS is taken, if  
> granularity and accuracy are important.
>> 
>>> 
>>> Secondly, if the congestion happens, the CIR/PIR should apply to data 
>>> packets too. In that case BFD flap is at least a good indicator of the 
>>> problem, isn't it? 
>> 
>> There is usually a separate CIR/PIR for different CPU bound packets. So 
>> based on some parameters you might impose a different CIR/PIR for BFD than 
>> say, ssh and radius packets. If BFD flaps and you know you missed a few 
>> sequence numbers and you see drops in that queue then all of this could be 
>> co-related and you could fix the CPU queue parameters for BFD. I 
>> understand that this is a very implementation specific issue, but then 
>> thats what i had said earlier -- such a mechanism can help in isolating 
>> implementation specific issues as well.
> I agree that it will be helpful. But then, when a new mechanism is 
> introduced, it should clearly spell out the mechanisms on interpreting the 
> problems and how to deal with it. The ID has none of it, as of now.
>> 
>>> 
>>> Lastly, these improvements is change from the existing BFD 
>>> model/protocol. 
>>> I do not see why it shouldn't be part of BFD v2 OR lead to v2 :D
>> 
>> Sure, that would make Marc very happy ! :-)
>> 
>> I am not sure if we have enough momentum right now that can propel us 
>> towards BFD v2.
>> 
>> OTOH, if the WG believes that this is an opportune time for us to start 
>> looking at BFDv2, then i would be more than willing to participate!
> Well, if there is real issue that existing BFD falls short of the needs, 
> then interest automatically increases. As this ID introduces new things to 
> existing version, it should be pursued as part of next version, rather than 
> changing the existing model for the same version.
> 
> -sam
> 
>> 
>> Cheers, Manav 
>>> 
>>> -sam
>>>> 
>>>> Cheers, Manav
>>>> 
>>>>> - I see concerns regarding timestamps and sequence numbers expressed in 
>>>>> emails. In that case, the proposed model is still not going to identify 
>>>>> the problem completely. Am I reading it right?
>>>>> 
>>>>> -sam
>>>>> On Dec 4, 2014, at 7:47 AM, Gregory Mirsky wrote:
>>>>> 
>>>>>> Hi Jeff,
>>>>>> I can reference RFC 5357 here. The Appendix describes what is called 
>>>>> TWAMP-Light mode with Stateless Reflector. About year and a half the 
>>>>> Errata been accepted that describes Stateful Reflector, which supports 
>>>>> measurement of one-way latency/jitter and packet loss metrics.
>>>>>>
>>>>>>       Regards,
>>>>>>               Greg
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Rtg-bfd [mailto:rtg-bfd-bounces@ietf.org] On Behalf Of Jeffrey 
>>>>> Haas
>>>>>> Sent: Thursday, December 04, 2014 7:17 AM
>>>>>> To: Nobo Akiya (nobo)
>>>>>> Cc: rtg-bfd@ietf.org
>>>>>> Subject: Re: BFD stability follow-up from IETF-91
>>>>>>
>>>>>> On Thu, Dec 04, 2014 at 03:14:50PM +0000, Nobo Akiya (nobo) wrote:
>>>>>>> If what you say is the only requirement not met, one approach may be 
>>>>> to pursue a non-standard-track document describing some suggested 
>>>>> implementation techniques to locally store TX/RX timestamp.
>>>>>>>
>>>>>>> Given that echo approach will be less accurate and given that we 
>>>>> seem to be having difficulty converging, I thought I???ll throw out 
>>>>> another idea.
>>>>>>
>>>>>> I think my biggest concern is that the echo approach has 
>>>>> bidirectional packet loss possibilities.  Async at least lets the 
>>>>> receiver know about unidirectional packet loss.
>>>>>>
>>>>>> Of course, if your goal is to notify the sender that their packets 
>>>>> are being lost, you need a backchannel anyway.  I just don't know if we 
>>>>> want that back channel to be bfd.
>>>>>>
>>>>>> - Jeff
>>>>>>
>>>>> 
>>>> 
>>> 
>> 
>