Re: BFD stability follow-up from IETF-91

"Sam K. Aldrin" <aldrin.ietf@gmail.com> Fri, 05 December 2014 03:52 UTC

Return-Path: <aldrin.ietf@gmail.com>
X-Original-To: rtg-bfd@ietfa.amsl.com
Delivered-To: rtg-bfd@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 729091A9244 for <rtg-bfd@ietfa.amsl.com>; Thu, 4 Dec 2014 19:52:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Upr0MCZ5XRY8 for <rtg-bfd@ietfa.amsl.com>; Thu, 4 Dec 2014 19:52:26 -0800 (PST)
Received: from mail-pd0-x231.google.com (mail-pd0-x231.google.com [IPv6:2607:f8b0:400e:c02::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 483B51A87AF for <rtg-bfd@ietf.org>; Thu, 4 Dec 2014 19:52:26 -0800 (PST)
Received: by mail-pd0-f177.google.com with SMTP id ft15so18969779pdb.36 for <rtg-bfd@ietf.org>; Thu, 04 Dec 2014 19:52:25 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :message-id:references:to; bh=6GeUfSXJPLl32UzBUdcPh3ibayVbPAUq/zt241TVGaY=; b=jeBB8Ql14tlqGwSW4/3ZPYfpwjfIAs+RyYOG/jzCd3vsfQTvPDHTjTUJbVGdUBZ1Y/ Ghjvacwea+74TW2E9hlwzxndQ0VoWKZCFE+NVPTm0L3n9ugN4VvPx25jRurE75QEsUNa HSHoun+0hZd/Njs3XOdoN61qycHlnvTA4kcdvhhsEu4udLWRxNbPOqN7udt49x7nXQtk MwVqPv6s/hpMqxjoyTCPRvUrGJiTCxQhcWLZrOuOaWtbCLfJmqpAEa1nSRGZojzhK2Wm k4F93jbTR8r83ytH1TIxtfRX34aTOVokJNnIrZZDuN0Cok5qkawS6WyOHSL+04zBDIRG Rugg==
X-Received: by 10.70.53.164 with SMTP id c4mr24595148pdp.17.1417751545530; Thu, 04 Dec 2014 19:52:25 -0800 (PST)
Received: from [192.168.1.11] (c-107-3-154-60.hsd1.ca.comcast.net. [107.3.154.60]) by mx.google.com with ESMTPSA id wo3sm27330072pbc.79.2014.12.04.19.52.23 for <multiple recipients> (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 04 Dec 2014 19:52:24 -0800 (PST)
Subject: Re: BFD stability follow-up from IETF-91
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: multipart/alternative; boundary="Apple-Mail=_89D147DB-5307-4045-BD0A-9F0AEDD76694"
From: "Sam K. Aldrin" <aldrin.ietf@gmail.com>
In-Reply-To: <CAG1kdoiumymdnAyG8jOJNSztVHtTO0DzLeHd1SnpP8R6xNeVvw@mail.gmail.com>
Date: Thu, 04 Dec 2014 19:52:19 -0800
Message-Id: <3EA747F1-FFDD-4E36-B8A2-58E362C1F601@gmail.com>
References: <CO2PR0501MB823C222B7D62779F4DF58CDB3780@CO2PR0501MB823.namprd05.prod.outlook.com> <D0A647C1.28843%mmudigon@cisco.com> <CO2PR0501MB8234A1BDDFD008EE12C847AB3780@CO2PR0501MB823.namprd05.prod.outlook.com> <CECE764681BE964CBE1DFF78F3CDD3943F5AE38D@xmb-aln-x01.cisco.com> <CAG1kdogkUr2YyodeUPWOqea+2jqOkmdYnPywVHCw8j1+=9eM6A@mail.gmail.com> <CECE764681BE964CBE1DFF78F3CDD3943F5AE4AE@xmb-aln-x01.cisco.com> <20141204151708.GA9458@pfrc> <7347100B5761DC41A166AC17F22DF1121B8AAC29@eusaamb103.ericsson.se> <059338DA-6758-46C1-AD23-D2039C875D09@gmail.com> <CAG1kdogeZBuhmRmgkY2jo2oFTMOXzwWbS=f0H4M4mh9mJXAdNg@mail.gmail.com> <58D290A6-1EB1-425B-9FFA-3025A3CAE4EE@gmail.com> <CAG1kdoiumymdnAyG8jOJNSztVHtTO0DzLeHd1SnpP8R6xNeVvw@mail.gmail.com>
To: Manav Bhatia <manavbhatia@gmail.com>
X-Mailer: Apple Mail (2.1283)
Archived-At: http://mailarchive.ietf.org/arch/msg/rtg-bfd/2Ox5PDF9JTZuM_R2eesrkkNNerg
Cc: "rtg-bfd@ietf.org" <rtg-bfd@ietf.org>
X-BeenThere: rtg-bfd@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "RTG Area: Bidirectional Forwarding Detection DT" <rtg-bfd.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtg-bfd/>
List-Post: <mailto:rtg-bfd@ietf.org>
List-Help: <mailto:rtg-bfd-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 05 Dec 2014 03:52:29 -0000

Hi Manav,

On Dec 4, 2014, at 6:51 PM, Manav Bhatia wrote:

> Hi Sam,
>> Ideally even if there is some bit of congestion i would like the BFD packet to get through.
> I understand the queuing problems but I am not clear how the ID is going to solve the problem, if there is only congestion and not
> 
> It would help.
> 
> Assume the last sequence number you saw before the flap was s1. You timed out since you did not see s2 and s3 before the timeout. Now further assume that you know that you did receive s2 and s3, however they arrived after the BFD expiry interval then you know that there wasnt a drop, and the packet arrived late because of some queing issue. Now how you determine whether this delay was seen at the TX or RX side is open to discussion.


As the draft doesn't say, what exactly one would/should do, assuming there is packet throttling happing due to various reasons, could you/authors elaborate on this? I would like to see those tangible things detailed first.

I see the problem little differently though. BFD session flapping is an indicator of the network behavior, be it device or network congestion or something else. Even if the packets arrive late, as you say, one cannot know where exactly the delay is happening.

> 
> Without a sequence number you have no idea whether the packet was dropped or whether it arrived/processed late. 
> 
> If by some out-of-band mechanism you can figure out that the TX was done on time and the delay was at RX end then its an implementation issue on the RX side. If the TX was delayed then its an implementation issue at the TX side.
Don't think so. It could be the lot more than implementation issue. Could be due to network congestion due to bursty traffic and nothing to do with the implementation.
> 
> This helps isolating the node that needs to fix the issue, otherwise we're only shooting in the dark. 
The issue you see is what I think could be the actual network behavior and not the device issue. 
Debugging Packet drops, latency and mis-ordering of packets is mostly shooting in the dark :D
Nevertheless, I do not see this as BFD specific only. 

> 
> packet drop. In that case just the sequence # doesn't help and timestamping is needed. Even if timestamping is to be done, realistically it cannot happen the same way across multi vendor i.e. where the timestamping should/could be done. For ex: Before it is queued or after OR LC/RP/SP/Process etc.
> 
> This isnt a new problem. Each vendor time stamps 1588 packets differently. However, the aggregate solution works across multiple vendors.
> 
> While it may not solve all the issues because of the vagaries of how each vendor does time stamping it would certainly help debugging large number of BFD flaps.
As timestamp is not in the ID, we could differ the discussion for later, but I definitely believe that it is a bigger issue where TS is taken, if  granularity and accuracy are important.
> 
> 
> Secondly, if the congestion happens, the CIR/PIR should apply to data packets too. In that case BFD flap is at least a good indicator of the problem, isn't it? 
> 
> There is usually a separate CIR/PIR for different CPU bound packets. So based on some parameters you might impose a different CIR/PIR for BFD than say, ssh and radius packets. If BFD flaps and you know you missed a few sequence numbers and you see drops in that queue then all of this could be co-related and you could fix the CPU queue parameters for BFD. I understand that this is a very implementation specific issue, but then thats what i had said earlier -- such a mechanism can help in isolating implementation specific issues as well.
I agree that it will be helpful. But then, when a new mechanism is introduced, it should clearly spell out the mechanisms on interpreting the problems and how to deal with it. The ID has none of it, as of now.
> 
> 
> Lastly, these improvements is change from the existing BFD model/protocol. 
> I do not see why it shouldn't be part of BFD v2 OR lead to v2 :D
> 
> Sure, that would make Marc very happy ! :-)
> 
> I am not sure if we have enough momentum right now that can propel us towards BFD v2.
> 
> OTOH, if the WG believes that this is an opportune time for us to start looking at BFDv2, then i would be more than willing to participate!
Well, if there is real issue that existing BFD falls short of the needs, then interest automatically increases. As this ID introduces new things to existing version, it should be pursued as part of next version, rather than changing the existing model for the same version.

-sam

> 
> Cheers, Manav 
> 
> -sam
>> 
>> Cheers, Manav
>> 
>> - I see concerns regarding timestamps and sequence numbers expressed in emails. In that case, the proposed model is still not going to identify the problem completely. Am I reading it right?
>> 
>> -sam
>> On Dec 4, 2014, at 7:47 AM, Gregory Mirsky wrote:
>> 
>> > Hi Jeff,
>> > I can reference RFC 5357 here. The Appendix describes what is called TWAMP-Light mode with Stateless Reflector. About year and a half the Errata been accepted that describes Stateful Reflector, which supports measurement of one-way latency/jitter and packet loss metrics.
>> >
>> >       Regards,
>> >               Greg
>> >
>> > -----Original Message-----
>> > From: Rtg-bfd [mailto:rtg-bfd-bounces@ietf.org] On Behalf Of Jeffrey Haas
>> > Sent: Thursday, December 04, 2014 7:17 AM
>> > To: Nobo Akiya (nobo)
>> > Cc: rtg-bfd@ietf.org
>> > Subject: Re: BFD stability follow-up from IETF-91
>> >
>> > On Thu, Dec 04, 2014 at 03:14:50PM +0000, Nobo Akiya (nobo) wrote:
>> >> If what you say is the only requirement not met, one approach may be to pursue a non-standard-track document describing some suggested implementation techniques to locally store TX/RX timestamp.
>> >>
>> >> Given that echo approach will be less accurate and given that we seem to be having difficulty converging, I thought I???ll throw out another idea.
>> >
>> > I think my biggest concern is that the echo approach has bidirectional packet loss possibilities.  Async at least lets the receiver know about unidirectional packet loss.
>> >
>> > Of course, if your goal is to notify the sender that their packets are being lost, you need a backchannel anyway.  I just don't know if we want that back channel to be bfd.
>> >
>> > - Jeff
>> >
>> 
>> 
> 
>