Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Job Snijders <> Wed, 16 December 2020 16:30 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 851403A10C1 for <>; Wed, 16 Dec 2020 08:30:09 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id M1NmifiPCwCi for <>; Wed, 16 Dec 2020 08:30:06 -0800 (PST)
Received: from ( [IPv6:2a01:4f8:fff0:2d:8::215]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 5627D3A10C5 for <>; Wed, 16 Dec 2020 08:30:05 -0800 (PST)
Received: from (unknown []) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id C6C99601AE; Wed, 16 Dec 2020 16:30:03 +0000 (UTC)
Received: from ( []) by
Received: from localhost ( [local]) by (OpenSMTPD) with ESMTPA id 9db33a26; Wed, 16 Dec 2020 16:29:47 +0000 (UTC)
Date: Wed, 16 Dec 2020 16:29:47 +0000
From: Job Snijders <>
To: John Scudder <>
Cc: Christoph Loibl <>, "" <>, Robert Raszuk <>
Message-ID: <X9o1+4/>
References: <> <> <> <> <> <>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <>
X-Clacks-Overhead: GNU Terry Pratchett
Archived-At: <>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 16 Dec 2020 16:30:10 -0000

Dear all,

A follow-up with some PCAP data to illustate the issue

On Tue, Dec 15, 2020 at 08:00:06PM +0000, John Scudder wrote:
>   rtr-A                   rtr-B
>   (congested c-p)         (uncongested c-p)
>   send window: >0         send window: 0
>   recv window: 0          recv window: >0
> In this case we expect:
>  a) rtr-B does not send any BGP packet (KEEPALIVE/UPDATE/NOTIFICATION)
> to rtr-A in normal operating circumstances.
>  b) rtr-A does not expect any KEEPALIVE/UPDATE packets from rtr-B. The
> session remains established even if no packet is received in the
> holdtime.
>  c) rtr-A continues to send KEEPALIVE packets to rtr-B.

A PCAP showing the above scenario is available here:

    webhosted decoded:
    original .pcap file:

rtr-A ==, rtr-B ==
The PCAP was captured on a wiretap applied between these BGP nodes.

rtr-B is trying to send a full routing table (~ 800,000 routes) to
rtr-A, rtr-A in turn has only a few (stale) downstream routes.

Things are normal during the first 22 seconds (packets 1-149). At packet
150 it becomes clear rtr-B is not able to send UPDATEs (or KEEPALIVEs or
WITHDRAWs) to rtr-A. This situation persisted for ~ 46 minutes, at which
point I manually killed the sesssion.

During those 46 minutes (wall clock) the OutQ on rtr-B climbed to
thousands. Even if rtr-B has decent OutQ deduplication (which could
somewhat hide the detrimental effects of this situation), it is clear
from the packet capture that rtr-B is not able to follow the spirit and
intent of the BGP Hold Timers: BGP communication is completely and fully
stalled in one direction. All of rtr-A's BGP messages are TCP ACKed, but
zero progress is made sending anything from rtr-B to rtr-A.

About the OpenBGPD example solution: OpenBGPD ('bgpd') is an integral
part of the OpenBSD Network Operating System. The OpenBSD developers are
responsible for bgpd, userland, the kernel, ssh, TCP subsystem, NIC
drivers, all of it. Conceptually such a 'complete fullstack
implementation' is no different than its more impressive siblings Junos
(a complete operating system with an embedded BGP implementation), IOS
XR, or SR-OS. Customers are interested in the complete product.

I think vendors are expected to have full ownership of their entire
product: customers most likely won't care how "Inside the router
chassis" the BGP daemon(s), kernel(s), hypervisor, filesystem, NIC
drivers, etc are separate components. What matters is what actually
happens on the wire between between individual BGP nodes.

I think Takt offered a valuable insight:

    "Usually there is some tx buffering going on. Just because write()
    was successful doesn't mean a message actually arrived on the other
    hand. But if write() blocked for an Holdtimer interval it is sure
    there is an issue."

It's up to each implementation/vendor how to pull up state from lower
layers into their BGP engine. In the case of OpenBGPD we are fortunate
to have some infrastructure in place to accomodate this type of
improvement. Other implementations might have to come up with different
solutions depending how they designed to handle IO or buffer BGP

Compliance testing for the yet-to-be-submitted internet-draft will be
gauged simply by looking at what happens on the wire between two nodes.

Kind regards,