Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Jeffrey Haas <> Wed, 16 December 2020 21:36 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 60CA53A1105 for <>; Wed, 16 Dec 2020 13:36:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.891
X-Spam-Status: No, score=-1.891 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001, T_SPF_HELO_TEMPERROR=0.01] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id z_iX5YQtZihE for <>; Wed, 16 Dec 2020 13:35:54 -0800 (PST)
Received: from ( []) by (Postfix) with ESMTP id 75AC83A1101 for <>; Wed, 16 Dec 2020 13:35:54 -0800 (PST)
Received: by (Postfix, from userid 1001) id 26BA01E356; Wed, 16 Dec 2020 16:53:16 -0500 (EST)
Date: Wed, 16 Dec 2020 16:53:15 -0500
From: Jeffrey Haas <>
To: Job Snijders <>
Cc: Christoph Loibl <>, Robert Raszuk <>, "" <>
Message-ID: <>
References: <> <> <> <> <> <> <> <>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <>
User-Agent: Mutt/1.5.21 (2010-09-15)
Archived-At: <>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 16 Dec 2020 21:36:01 -0000

On Tue, Dec 15, 2020 at 09:54:01PM +0000, Job Snijders wrote:
> All BGP implementations tested so far (5 famous ones) appear vulnerable
> because they continue to consider the BGP session healthy & stable
> (meanwhile OutQ keeps growing endlessly and zero BGP messages go across
> the wire).

Sadly, this is a general problem of TCP implementations.  There were some
rather interesting intentional window congestion attacks going around a few
years ago.

> One network operator (with thousands of EBGP sessions in the DFZ)
> reported to me the above stalled-TCP scenario is *not* a common case on
> the Internet. On a normal day, a network operator will see no (zero)
> sessions stuck this way, which leads me to believe 'recvwind=0' ...
> *for the duration of the hold timer* is a very strong indicator for a
> really broken situation which should be attempted to automatically
> resolve.

I understand the intent of your text above.  However, in the presence of
very short hold timers (say 3-10 seconds), this may be too short.

It's my experience across three implementations that during extremely scaled
scenarios that a BGP receiver may indicate zero windowing easily for
multiple seconds.  Basically, if you don't manage to get CPU or other
resources to drain the socket a bit at your end of things, and you have
enough sockets, you don't get around to this fast enough.

So, while crafting your attempted mitigations, remember that a box that goes
high enough CPU to impact sessions during reconfig (e.g.) may not be the
time you want to drop the session.  It's exactly these sort of situations
that pushed implementations to perhaps ignore timers expiring when they
notice there's still data on the socket to read.

> I believe BGP implementations are not helping any known deployment
> scenarios by *not* disconnecting a stuck peer, however on the other we
> now know about various operational examples where honoring recvwind=0
> for (hours, days) longer than $holdtimer led to global scale problems.

FWIW, implementations that have peer groups getting hung up on slow peers
simply exacerbate the issue.  However, I think the scenario should be
treated separably.

-- Jeff