Re: [Idr] draft-spaghetti-idr-bgp-sendholdtimer - Feedback requested

Jeffrey Haas <> Fri, 23 April 2021 21:00 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id B1E423A0BFA for <>; Fri, 23 Apr 2021 14:00:59 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001] autolearn=unavailable autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id wTuj3rMjKXva for <>; Fri, 23 Apr 2021 14:00:55 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 913093A0C36 for <>; Fri, 23 Apr 2021 14:00:44 -0700 (PDT)
Received: by (Postfix, from userid 1001) id 141B21E44B; Fri, 23 Apr 2021 17:23:49 -0400 (EDT)
Date: Fri, 23 Apr 2021 17:23:48 -0400
From: Jeffrey Haas <>
To: Ben Cox <>
Message-ID: <>
References: <>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <>
User-Agent: Mutt/1.5.21 (2010-09-15)
Archived-At: <>
Subject: Re: [Idr] draft-spaghetti-idr-bgp-sendholdtimer - Feedback requested
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 23 Apr 2021 21:01:00 -0000

On Tue, Apr 20, 2021 at 02:47:35PM +0100, Ben Cox wrote:
> We submitted our first draft today (
> ) and we are looking for feedback knowing that it is not complete but
> is likely in a state for some discussion.
> Since this draft tweaks the BGP FSM, we would like to make sure that
> it's done correctly and so we are soliciting help from this working
> group.

As someone who spent a lot of time working with the editors of what became
RFC 4271, you're going to find there's a lot of corner cases in the FSM to
deal with.  :-)  We can work on the text once it's clearer what we should

A number of stream of thought comments about the topic covered by the draft:

I don't object quite as much as Jakob does about the issues of claiming
implementations that are exhibiting a stuck session as contributing to stale
routes and blackholes in the Internet.  A simple truth of BGP as a
hop-by-hop protocol is that if you can't distribute your updates, the
downstreams are operating on an incorrect version of reality.  

That said, the problem statement isn't really crisp enough I'd want to make
such a large claim.

TCP applications will exhibit zero-windowing naturally over the course of a
session's lifetime.  This typically will occur when there's some level of
backpressure on the application due to I/O issues or even CPU.  While these
situations aren't desirable since they signal congestion in the application,
congestion SHALL happen.[1]

Most of the headaches I'd discusss for the problem this draft attempts to
solve revolve around the consequences of this statement in section 2 of the

:   Generally BGP implementation have no visibility into lower-layer
:   subsystems such as TCP or the peer's current Receive Window.
:   Therefor this document banks on BGP implementations being able to
:   detect an inability to push more data to the remote peer, at which
:   point the SendHoldTimer starts.

It is quite true that it is difficult to get a sense of what's going on when
you have simple socket (for example) backpressure.  What I believe an
application really wants to know for some sort of send-hold-timer is the

- A session's TCP window has reached zero availability at a given sequence
- It stays at that sequence number with zero availability for "too long".

Typical POSIX socket APIs won't give you this kind of certitude.  The
condition, as written in the draft, effectively manifests as a write() to
the socket that fails.  But how does it fail:
- Is it because there is no space in the socket to accept data?
- Is it because there are no local resources (e.g. mbufs) to push the next
  bit of data?
- Is it because a kernel timer hasn't triggered a bit of TCP behavior to try
  to move some data?
- Is it because we had blocked and we're waiting for select/kevent to fire
  and let us know that there's space - and it simply hasn't bothered even
  though there's space in the window?
- Is it because a low-watermark feature is preventing the socket from being
  ready even though the window has advanced?
- Is it because the session is draining VERY.... SLOWLY... and the kernel
  won't wake us up even though the window is no longer zero - just not big
  enough? (A version of the prior issue.)

Some things that can cause zero windowing may be worth keeping in mind to
decide how bad they can be and what an implementation may want to do about

- Perhaps it's doing a live-upgrade and needs to wedge the incoming session
  to do so? (ISSU, e.g.)
- It might be configured with incoming session prioritization, and your
  socket isn't important enough by policy.  (Corrollary: Not all routes are
  created equal.)
- It may be behind in servicing the socket. For example, too much work,
  perhaps due to a reconfiguration event that will resolve at some point.
- A firewall may be interfering with TCP and the receiver is actually fine.

Finally, I want to highlight something rather critical:  
Just because you reset your session, in some of these pathologies the router
you're resetting your session with may not withdraw your routes downstream
fast enough.  

This is the stale situation highlighted in the draft.  The distinction is
that sometimes thrashing makes it worse.

If you're going to disconnect from such a peer, you had better have enough
routing diversity to survive disconnecting from it.  You also better be
prepared to keep the session down for some time.  

Keeping the session down is not addressed in the draft.

-- Jeff

[1] When I give presentations discussing BGP, I tend to note that BGP is for
the most part a simple protocol with a lot of optional features... and the
hard part is making it scale.