Re: [Idr] BGP Auto-Discovery Protocol State Requirements

Jeffrey Haas <jhaas@pfrc.org> Tue, 23 March 2021 19:18 UTC

Return-Path: <jhaas@slice.pfrc.org>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D5BFA3A121B for <idr@ietfa.amsl.com>; Tue, 23 Mar 2021 12:18:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id wAVytR1D_nQ3 for <idr@ietfa.amsl.com>; Tue, 23 Mar 2021 12:18:40 -0700 (PDT)
Received: from slice.pfrc.org (slice.pfrc.org [67.207.130.108]) by ietfa.amsl.com (Postfix) with ESMTP id 501573A1218 for <idr@ietf.org>; Tue, 23 Mar 2021 12:18:40 -0700 (PDT)
Received: by slice.pfrc.org (Postfix, from userid 1001) id 9707E1E447; Tue, 23 Mar 2021 15:40:22 -0400 (EDT)
Date: Tue, 23 Mar 2021 15:40:22 -0400
From: Jeffrey Haas <jhaas@pfrc.org>
To: heasley <heas@shrubbery.net>
Cc: "idr@ietf.org" <idr@ietf.org>
Message-ID: <20210323194021.GM31047@pfrc.org>
References: <20210319143448.GM29692@pfrc.org> <CAOj+MMFKqpZCyzDbGr0JzZLu7sjEw9NBQ=J9rTqDOuP+Yf1mog@mail.gmail.com> <20210319144657.GO29692@pfrc.org> <CAOj+MME8GB4jo_q3kHm1jx6E60GCHeU-pz0eYy_96BJ+ak7_Bw@mail.gmail.com> <20210319152832.GP29692@pfrc.org> <BYAPR08MB549328E3379E94589DC3CE0885649@BYAPR08MB5493.namprd08.prod.outlook.com> <20210323120515.GA31047@pfrc.org> <CAOj+MMGY+sMHr29Uw4bFct9kxoBnp=fJDULVjvFQL1UxC3JYtQ@mail.gmail.com> <20210323150837.GB31047@pfrc.org> <YFovvIX4osJ3TifH@shrubbery.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <YFovvIX4osJ3TifH@shrubbery.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/Ir3JprMvycoqtL_3cVxHHJEsbd0>
Subject: Re: [Idr] BGP Auto-Discovery Protocol State Requirements
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Mar 2021 19:18:43 -0000

On Tue, Mar 23, 2021 at 06:13:16PM +0000, heasley wrote:
> Tue, Mar 23, 2021 at 11:08:37AM -0400, Jeffrey Haas:
> > If a discovered peering session is unacceptable, why would you keep running
> > the BGP state machine over and over if it's not going to improve?
> > 
> > The options we have upon discovering a peer that isn't acceptable are:
> > 1. Quiesce the disovered session and require a manual clearing event.
> > 2. Keep the instantiated session running even if it never connects.
> >    Analogous to explicit configuration of a misconfigured session.
> > 3. The discovery protocol itself tells us that the situation has changed.
> >    This gives the implementation the option to quiesce without requiring a
> >    manual clearing event.
> 
> maybe 4. a RFD-like auto-discovered peer connection suppression?

The RFC 4271 machinery technically already has this sort of thing.  The two
interacting RFC things are the ConnectRetryTimer and the
DampPeerOscillations feature.

What is partially problematic on the "rely on the existing BGP FSM" is many
people aren't thinking through those timer considerations:

- The recommended value for ConnectRetryTimer is 120s.
- Oscillation damping is intentionally under-specified.  And it's totally
  optional.
- Many implementations (including the one I code for), do not stick to 120s
  values as an initial value.

Operators usually want their sessions to come up promptly.  In a data center
fabric context, how long after discovery are you willing to wait before it
comes up?

What this likely will mean for people's implementations is:
- The initial ManualStart event for the session (since this is coming from
  outside of BGP timer infra) will be shortly after discovery.
- The ConnectRetryTimer is likely to be short.  Oscillations damping is also
  likely to be short, even if it's on a decay cycle that takes it to longer
  values after a small number of rounds.

What this can mean is that the BGP Speaker sending out discovery messages
can expect to be hammered very hard with connection requests.  The listen
socket for a central dispatched listen mechanism will likely overload, which
will cause TCP timers to either go into backpressure modes, or aggressive
implementations that will simply close the socket and trigger the connecting
BGP FSM to back off itself.

Are people going to be happy when their fabric connection takes a few
minutes to Establish?  Probably not.  They'll tweak the timers.

And that's on top of the timers for the auto-discovery mechanism itself and
its reliability mechanisms.  See Section 3.2 in the auto-discovery
considerations draft.

> expect each will have propoents.

It's engineering.  After you've dealt with correctness, you have choices and
you have consequences.

-- Jeff