Re: Tuning BFD session times

Greg Mirsky <gregimirsky@gmail.com> Tue, 03 April 2018 05:28 UTC

Return-Path: <gregimirsky@gmail.com>
X-Original-To: rtg-bfd@ietfa.amsl.com
Delivered-To: rtg-bfd@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 88C09127775 for <rtg-bfd@ietfa.amsl.com>; Mon, 2 Apr 2018 22:28:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.998
X-Spam-Level:
X-Spam-Status: No, score=-0.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RXJJexW5fPxO for <rtg-bfd@ietfa.amsl.com>; Mon, 2 Apr 2018 22:28:01 -0700 (PDT)
Received: from mail-wr0-x235.google.com (mail-wr0-x235.google.com [IPv6:2a00:1450:400c:c0c::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AEE8D126579 for <rtg-bfd@ietf.org>; Mon, 2 Apr 2018 22:28:00 -0700 (PDT)
Received: by mail-wr0-x235.google.com with SMTP id u11so16834192wri.12 for <rtg-bfd@ietf.org>; Mon, 02 Apr 2018 22:28:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=jo2Q9kvstBhULc8yR0W/pJQrwPE7TJf4JMqi4VvLTpc=; b=XUI/kzJNnj5NVa8qC2lA+4qbkGgIiV9yiZIcWGXvPdgEDX8HixQqOWTj6IY+FA5Cvw 5ud8LgrRdk+dpPNbOk/L3pInYYv+Zw9YbY9njqcdPNuqDg+Rh422C7Yj3vlEpEyWJDYO EhxncXKfXowhsBsLbysYjxbA7XDA10eCmxl0EodbSrvTws5P4+jzKNHb9IYHd9CvU3sg wx5LuV9TTJNKPpwBQvHD61/4P0pEoKniuDeadJL6MZ4/UzUhtqYSpYfp07L6a2tthTzc 4oneI2Y3G1zQyV/KjgoV3KJmyPv4zbwU0/Gvlr6gwY0Z7UGCD5ffNlnZjRJAbGDv32H9 Tadg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=jo2Q9kvstBhULc8yR0W/pJQrwPE7TJf4JMqi4VvLTpc=; b=Li8S9FIxTycwwUsppPB94Zo7mDW71PyoxET2lf8ecvspupZptHo8LOc7Abnby1OwNY nB/DN9Lmjz8ePz6scwjM0jZts/kO+kcjReORJumStyP8fA9NBMSSLnIuDPGy/MYD3brc 6lW+zfvi3mje7NC9YRLal+47J1ewxMhboW+R1gB9Zl0DPZSGSs/2HNHia5uq0IQ/+fyt FnJg0eXC7RmuuL9rBBdogs+0KAE9tekWt1CelwwCXMASHFLLRgl+yo4fG9Tzf+xiZczF ob71qZLpl5MCAja5QKcIkQV6NOkg7f8f8WHWBssM2DYrJFiJk7QdhRN82Kl1gQlYXlWS YQmA==
X-Gm-Message-State: ALQs6tAqh/Wg/Vs7eysBVIayYvhPEDobCVhDK9uLbILrjuTio63WtvY9 aWQZ8vvx8SEeiEGB+nQGH+rAbuWPSRNyciSf7USJ25Nq
X-Google-Smtp-Source: AIpwx4+ZBd+ICDj44Et3kZHiGHk1Yp13hHspOeHtrQY4eg6naWU1GRFke07Lnds+W9VEbT1c0qWJ+i5IyV9+XTiCcPc=
X-Received: by 2002:a19:179b:: with SMTP id 27-v6mr7010354lfx.143.1522733278902; Mon, 02 Apr 2018 22:27:58 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.46.73.66 with HTTP; Mon, 2 Apr 2018 22:27:58 -0700 (PDT)
In-Reply-To: <8424435A-4521-45EA-807C-7D6EC4BF0D29@gmail.com>
References: <20180328184959.GB25442@pfrc.org> <BL0PR0102MB3345EC535EE558FC4CC692E6FAA70@BL0PR0102MB3345.prod.exchangelabs.com> <DB3PR03MB0969877EE63E4514F3975D6C9DA70@DB3PR03MB0969.eurprd03.prod.outlook.com> <BL0PR0102MB33453186C4DE4B6646FD67FEFAA70@BL0PR0102MB3345.prod.exchangelabs.com> <CA+RyBmXayE4Nb1F-xGYnqBnrzHo3GRCA5sbx+epr=ZB60Y6OaA@mail.gmail.com> <8424435A-4521-45EA-807C-7D6EC4BF0D29@gmail.com>
From: Greg Mirsky <gregimirsky@gmail.com>
Date: Mon, 02 Apr 2018 22:27:58 -0700
Message-ID: <CA+RyBmWcAFsuK3St7L+k_FP2mtsit1aooY-Usui-jELs=s1i-A@mail.gmail.com>
Subject: Re: Tuning BFD session times
To: Mahesh Jethanandani <mjethanandani@gmail.com>
Cc: Ashesh Mishra <mishra.ashesh@outlook.com>, "rtg-bfd@ietf.org" <rtg-bfd@ietf.org>
Content-Type: multipart/alternative; boundary="0000000000002d3fed0568eaf861"
Archived-At: <https://mailarchive.ietf.org/arch/msg/rtg-bfd/9mtGBlKixqIk3pzhLBVzZ9E_vGk>
X-BeenThere: rtg-bfd@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: "RTG Area: Bidirectional Forwarding Detection DT" <rtg-bfd.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rtg-bfd/>
List-Post: <mailto:rtg-bfd@ietf.org>
List-Help: <mailto:rtg-bfd-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtg-bfd>, <mailto:rtg-bfd-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 03 Apr 2018 05:28:05 -0000

Hi Mahesh,
thank you for additional information to help understand the use case.
Please consider my follow-up notes in-lined and tagged GIM>>.

Regards,
Greg

On Mon, Apr 2, 2018 at 1:45 PM, Mahesh Jethanandani <mjethanandani@gmail.com
> wrote:

> Greg,
>
> An ICMP ping run from the control plane may or may not follow the same
> path followed by BFD in the data plane.
>
GIM>> Does that mean that this is case is not only for single-hop BFD but
includes scenarios with multi-hop BFD?

> Plus, the processing of an ICMP ping on most systems is very different
> from the way BFD is processed.
>
GIM>> I think that is very important to specify where in the processing
sequence timestamp values being taken. For example, in case of DM in CFM,
it is suggested to take egress timestamp (T1 nd T3) when the first octet of
the frame being transmitted. Similarly, the ingress timestamp (T2 and T2) -
when the first octet of the frame being received. The goal this
recommendation is to exclude queuing delays and jitter that are introduced
by the MEPs to measure, as consistently as possible, latency and jitter of
the path. Hence the question, When timestamps to be taken for
draft-am-bfd-performance?

> Plus what better way than to use what you are trying to measure for, give
> you the measurement.
>
GIM>> I think that approach to add time measurement to each and every
protocol that may need to characterize its performance and/or tune interval
of periodic message would not be advisable. I believe that active
measurement methods allow reasonably accurate measurements that can be
properly interpreted based on  representative statistics of test sessions.
Alternatively, hybrid measurement methods, e.g. Alternate Marking, may be
used.

>
> Cheers.
>
>
> On Apr 2, 2018, at 9:18 AM, Greg Mirsky <gregimirsky@gmail.com> wrote:
>
> Hi Asheh,
> thank you for very detailed explanation of the scenarios that had
> motivated this work. Couple questions to help me better understand the use
> cases:
>
>    - if understand the first scenario, you propose to use BFD to measure
>    the propagation delay as it is the main component that influences BFD
>    interval selection. Do you see that available on-demand tools, e.g. ICMP
>    ping, not adequate to monitor RTT? What functionality is absent?
>    - For the case #2, the same question as above as I believe that ICMP
>    ping can be directed over the specific egress interface quite easily.
>
> Regards,
> Greg
>
> On Sun, Apr 1, 2018 at 8:17 PM, Ashesh Mishra <mishra.ashesh@outlook.com>
> wrote:
>
>> Hi Sasha,
>>
>> There are two scenarios here and they depend on whether the satellite is
>> in geo-stationary orbit (GEO) or non-geo-stationary orbit (NGSO).
>>
>> Scenario-1: Non-Geostationary Satellites: This is the scenario that you
>> described. Satellites in Middle Earth Orbit (MEO) or Low Earth Orbit (LEO)
>> move relative to the earth and hence, their distance from the ground
>> terminals varies as they pass over a given location. This results in
>> varying RTT (sometimes by as much as 30ms). The issue in this scenario is
>> not necessarily that the BFD detect interval must change frequently but
>> that it's difficult to accurately select the intervals as the RTT depends
>> on the location of the terminal and the gateway (and this gets quite
>> complex). If the session can automatically decide the interval, then the
>> complexity in starting a new service is reduced. Another complicating
>> factor is when the terminal moves (ship or aircraft, for example) as this
>> increases the variance of the RTT. We typically set the intervals to a high
>> enough level but that affects the performance. We see the same varying RTT
>> in GEO when the terminal is mobile but the percentage change is much
>> smaller than the overall RTT of GEO (because GEO satellites are much
>> farther away from the earth at ~36,000kms vs MEO at ~8,000kms and LEO at
>> ~200-1000kms).
>>
>> Scenario-2: Low-latency link backed up by high-latency link: In this case
>> a GEO satellite backs up NGSO-based connection or fiber (or other
>> terrestrial wired/wireless WAN options). The end-to-end service then has
>> very different RTT when the primary is active versus when the backup is
>> active. The typical solution is to base timers on the backup RTT, which is
>> very inefficient.
>>
>> Regards,
>> Ashesh
>> ------------------------------
>> *From:* Alexander Vainshtein <Alexander.Vainshtein@ecitele.com>
>> *Sent:* Sunday, April 1, 2018 9:28:43 AM
>> *To:* Ashesh Mishra
>> *Cc:* Jeffrey Haas; rtg-bfd@ietf.org
>> *Subject:* RE: Tuning BFD session times
>>
>>
>> Ashesh,
>>
>> I would like to understand better the use case with satellite links that
>> you have described.
>>
>> In particular, can you please explain why long RTT affects the BFD
>> detection times?
>>
>> As I see it, what could really affect these times is variable delay
>> introduced in some cases by the satellite links since the distance between
>> the satellite and the terrestrial antennae may change significantly with
>> time.
>>
>>
>> What did I miss?
>>
>>
>>
>> Regards,
>>
>> Sasha
>>
>>
>> Office: +972-39266302 <+972%203-926-6302>
>>
>> Cell:      +972-549266302 <+972%2054-926-6302>
>>
>> Email:   Alexander.Vainshtein@ecitele.com
>>
>>
>> *From:* Rtg-bfd [mailto:rtg-bfd-bounces@ietf.org] *On Behalf Of *Ashesh
>> Mishra
>> *Sent:* Sunday, April 1, 2018 5:54 PM
>> *To:* Jeffrey Haas <jhaas@pfrc.org>; rtg-bfd@ietf.org
>> *Subject:* Re: Tuning BFD session times
>>
>>
>> Jeff, thanks for kicking-off this discussion on the list!
>>
>>
>> One additional comment that I wanted to make was around automation. There
>> were questions during the meeting around the need for auto-tuning and that
>> the process of determining the interval can/should be manual.
>>
>>
>> The automation of control in all aspects of dynamic behavior is a
>> priority for network operators. When configuring manually, parameters such
>> BFD intervals are typically set at very conservative values because human
>> latency is very high when responding to changing network conditions. Manual
>> configuration also takes a lot of time and is accounts for significant
>> number of lost opportunities and value for operators.
>>
>>
>> *[JH] "**applications should generally choose a detection interval that
>> is reasoanble for their application rather than try to dynamically discover
>> the lowest possible stable detection interval. **"*
>>
>> [AM] This depends on the use-case. From the point-of-view of a service
>> provider that delivers long-haul connectivity (typical scenario in which
>> the link characteristics have large variance) then the intent is to provide
>> the best performance. As such providers deliver connectivity to critical
>> applications, and are often the only way of delivering connectivity in such
>> places, the ability to tune the system to deliver an up-time that is
>> superior drives significant value. Consider a scenario where there is a
>> 130ms RTT link (MEO satellite, LEO will be in the 20-60ms range) and its
>> backup is a 600ms RTT link (GEO satellite), and are being used to deliver
>> transit connectivity. The rate at which the end-to-end service can run BFD
>> is significantly faster when MEO is active versus when GEO is active. The
>> application, in this scenario, may survive the RTT, but the business
>> continuity is critical in many cases. Since the provider of long-haul can
>> not control the application, it must provide the best possible failover
>> performance.
>>
>>
>> *[JH] "1. BFD is asymmetric..  This means a receiving BFD implementation
>> must provide feedback to a sending implementation in order for it to
>> understand perceived reliability."*
>>
>> [AM] May not need to be the BFD implementation providing the feedback if
>> there are other performance mechanisms running. The challenge is to
>> standardize the mechanism that BFD can use (if the measurement is not
>> self-contained in BFD). You're right in pointing out the challenge in
>> accounting for the CPU delays and that was the reason for the original
>> proposal for BFD performance measurement. If the measurement is within the
>> BFD realm, it will account for the CPU delays. However, most good BFD
>> engines have relatively deterministic performance and are quite optimized
>> so the variance with scale and time is not significant (but I concede that
>> not all BFD implementations are good).
>>
>>
>> *[JH] "2. Measurement infrastructure may negatively impact session
>> scale.  Greg, I believe, made this point when discussing host processing
>> issues vs. BFD ingress/egress."*
>>
>> [AM] This is an issue if using a measurement mechanism within BFD (other
>> performance measurement methods are always running in network for SLA
>> reporting and/or network optimization). Within a metro-area with fiber or
>> terrestrial wireless (microwave, LTE, etc.) connectivity, I would likely
>> not need constant auto-tuning. The variance in the primary and backup links
>> in such network will not be significant to affect the BFD parameters. In
>> long-haul links, this may be a valuable feature in which case, the
>> additional overhead may be justified. So it depends on the use-case whether
>> continuous auto-tuning is required or if it is one-time.
>>
>>
>> *[JH] "3. Detection interval calculations really need to take into
>> account things that are greater than simple packet transmission times.  As
>> an example, if your measurement is always taken during low system CPU or
>> network activity, how high is your confidence about the interval?  What
>> about scaling vs. number of total BFD sessions?"*
>>
>> [AM] Great questions. Typically when running BFD or CFM (or similar) high
>> frequency OAM, CPU peaks should not affect the OAM performance (a variety
>> of methods, based on the system on which OAM is running, can ensure that).
>> CPU peaks become a bigger issue if BFD is used to detect continuity for a
>> particular flow (or QoS).
>>
>>
>> --
>>
>> Asheh
>> ------------------------------
>>
>> *From:* Rtg-bfd <rtg-bfd-bounces@ietf.org> on behalf of Jeffrey Haas <
>> jhaas@pfrc..org <jhaas@pfrc.org>>
>>
>> *Sent:* Wednesday, March 28, 2018 11:49 AM
>> *To:* rtg-bfd@ietf.org
>> *Subject:* Tuning BFD session times
>>
>>
>>
>> Working Group,
>>
>> We had very active discussion (yay!) at the microphone as part of Mahesh's
>> presentation on BFD Performance Measurement.
>> (draft-am-bfd-performance)
>>
>> I wanted to start this thread to discuss the greater underlying issues
>> this
>> discussion raised.  In particular, active tuning of BFD session
>> parameters.
>> Please note that opinions I state here are as an individual contributor.
>>
>> BFD clients typically want the fastest, most stable detection interval
>> that
>> is appropriate to their application.  That stability component is very
>> important since too aggressive of timers can result in unnecessary BFD
>> session instability which will impact the subscribing application.  Such
>> stability is a function of many things, scale of the system running BFD
>> being a major one.
>>
>> In my opinion, applications should generally choose a detection interval
>> that is reasoanble for their application rather than try to dynamically
>> discover the lowest possible stable detection interval.  This is because a
>> number of unstable factors, such as CPU load, contention with other
>> network
>> traffic and other things that are outside the general control of many
>> sytems may impact such scale.
>>
>> That said, here's a few thoughts on active feedback mechanisms:
>> 1. BFD is asymmetric.  This means a receiving BFD implementation must
>> provide
>>    feedback to a sending implementation in order for it to understand
>>    perceived reliability.
>> 2. Measurement infrastructure may negatively impact session scale.  Greg,
>> I
>>    believe, made this point when discussing host processing issues vs. BFD
>>    ingress/egress.
>> 3. Detection interval calculations really need to take into account things
>>    that are greater than simple packet transmission times.  As an example,
>>    if your measurement is always taken during low system CPU or network
>>    activity, how high is your confidence about the interval?  What about
>>    scaling vs. number of total BFD sessions?
>>
>> I have no strong conclusions here, just some cautionary thoughts.
>>
>> What are yours?
>>
>> -- Jeff
>>
>> ____________________________________________________________
>> _______________
>>
>> This e-mail message is intended for the recipient only and contains
>> information which is
>> CONFIDENTIAL and which may be proprietary to ECI Telecom. If you have
>> received this
>> transmission in error, please inform us by e-mail, phone or fax, and then
>> delete the original
>> and all copies thereof.
>> ____________________________________________________________
>> _______________
>>
>
>
> Mahesh Jethanandani
> mjethanandani@gmail.com
>
>