Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Gyan Mishra <> Fri, 18 December 2020 20:44 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 2F8C13A0896 for <>; Fri, 18 Dec 2020 12:44:59 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.997
X-Spam-Status: No, score=-1.997 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, BODY_ENHANCEMENT2=0.1, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id anOqlOVrPD-P for <>; Fri, 18 Dec 2020 12:44:57 -0800 (PST)
Received: from ( [IPv6:2607:f8b0:4864:20::1036]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 04A273A0888 for <>; Fri, 18 Dec 2020 12:44:57 -0800 (PST)
Received: by with SMTP id iq13so1939108pjb.3 for <>; Fri, 18 Dec 2020 12:44:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=qtnRFswc3jEN3Qk6Wx0yPOzA6etoSDga6hAF7OBv6NY=; b=lX4E7q45pMFJ2Oy8CeG8u2kuteg3NMZZJKgYtIkvf2waCSIG7Z+dpJTOWjCxs3m/eb LQFF1/2qww5u+fVVyVnwiFLQN0DYZEsMPSyL6M935iLG/hOdbzPAXP1J7DncCC9YTDGq ovrE8fC9Aq2VwaIoRnp4RVWaHZCoSpxTvaGcEGr6DPGWThHLP6y5BQpGkQADYwAfqiV7 6hSCiAK28zo4khbmoGBXKj1xvkCaBVD6ea33Bxqppl9cUDD8C5Nv7Nz/foEO60RIQjBm I1rfLwMtNEuwBJySw0NJNZ00apo2fQGP1DkDo5R0dbvBvQmnVdGlRfe3FdrtyqvinOuc 8jUw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qtnRFswc3jEN3Qk6Wx0yPOzA6etoSDga6hAF7OBv6NY=; b=hqrKP/sMlhKOA0JFwKkuwCQQ6kTjgq6FA0uZtbPRJakJvShmkdMuweIpVoi1vybALz f9oCO/+wNUC0AbZu0KClZdQ1DoiSiRwncXyYDBNC7tEobegHQK8JRTr5s8fj7/wT1yRA SwNkIbyVOkl0bnaXRxmDVO+te56aQBVMet70hgMvgZK1nkmRYCfrb56Mp3HUnNOB6DN1 p8yX2E59Fm/YASURBZva6h37XzU1VR34kzfphMpeAhaT8Ji4Ozog4U4LUgnx1Djkq6j5 r4I2kQYtJHyN2NrWfaHhkHQAb1/D9S1Ayt619AXWUdNgqOuIrsra4Kq98PM13dT4ax41 Qhxg==
X-Gm-Message-State: AOAM532xX3sqa1P5i5q8zFSFAsDMUgXwnjx4FnnhPFeNDyFVWXKwH1hZ //vJDlTjMzhTF0L8d6+6r/cNbQiMff77cgJ7ALA=
X-Google-Smtp-Source: ABdhPJw+SevMGIQykNsw+9vPiV67MZ94WYqcvB+4iQB/Es0CNYV073ijqY+GM11EZA+Js/dGot9Zcj0DMilZvVr7uKU=
X-Received: by 2002:a17:902:b717:b029:d9:e816:fd0b with SMTP id d23-20020a170902b717b02900d9e816fd0bmr6178242pls.50.1608324296251; Fri, 18 Dec 2020 12:44:56 -0800 (PST)
MIME-Version: 1.0
References: <> <> <> <> <> <> <> <> <> <> <> <> <>
In-Reply-To: <>
From: Gyan Mishra <>
Date: Fri, 18 Dec 2020 15:44:27 -0500
Message-ID: <>
To: Greg Mirsky <>, Jeffrey Haas <>
Cc: Brian Dickson <>, "Jakob Heitz (jheitz)" <>, "" <>
Content-Type: multipart/alternative; boundary="0000000000005c9bf305b6c32ef2"
Archived-At: <>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 18 Dec 2020 20:44:59 -0000


+ Greg Mirsky

Would a simple solution be to use BFD RFC 5880 for liveliness detection
single hop in async mode with BGP to bring down the protocol BGP registered
with BFD.

I don’t know which vendor router as was not mentioned, however wanted to
point out that the default MTU for BGP is 560 and if pmtud is not enabled
on the peer which sets the MSS to interface MTU - 40 then that could have
caused the receive buffer to fill resulting in a receive window 0 situation.

Since BGP utilizes TCP window scaling it utilizes the standard TCP
congestion control algorithm.

After the initial SYN/ACK during the 3 way handshake the initial window
size is established by the SYN/ACK sent by the receiver.

So now a BGP update comes along and sends either an advertisement or
withdrawal of the internet table.

RtrA = Sender
RtrB = Receiver

RTRB sends win 65535 *ws=-28
RTRA sends BGP updates until the last segment
RTRB send Ack

** all of a sudden RTRA management plane is not able to keep up with the
BGP updates as the TCP receive buffer is full**

At this point the TCP session goes into a “Paused” state and the sender
RtrA as it has received a TCP receive window 0 message, it stops sending
BGP updates waiting for an ACK from RTRB with a new non zero window size to
unpause the TCP session so data segments can be sent again.

As the application is not a file transfer between end hosts, and is two
routers running BGP I don’t know if BGP implementation has a IPC call that
signals BGP to hang on let’s wait for the receiver RTB to clear his buffer
and signal with non zero ack.  If BGP could sense the TCP receive window 0
via IPC that would be best and immediately tear down BGP and send
notification hold timer expired.

However, regardless of the TCP to BGP IPC communication here is what
happens without BFD which resulted in the internet outage.

RTRA is still waiting for a ACK with non zero window before it can send

During this time until the BGP hold time expires default 90 seconds traffic
is not able to reroute on an alternate path and we are black holding
traffic until RTRB sends BGP notification hold time expired followed by TCP
RST and BGP peer session torn down.

6.5 <>.  Hold Timer
Expired Error Handling

   If a system does not receive successive KEEPALIVE, UPDATE, and/or
   NOTIFICATION messages within the period specified in the Hold Time
   field of the OPEN message, then the NOTIFICATION message with the
   Hold Timer Expired Error Code is sent and the BGP connection is

With BFD RFC 5880 liveliness detection single hop async mode with sub
second timers set.

Once the BFD session is 2 way established the path is declared operational
continuity test. ( excluded from S-BFD if used - highly recommend to use
standard BFD for MetroE continuity check prior to bringing up routing

A separate BFD session is created for each communication path to monitor
the data plane bi directional liveliness detection with the registered
protocol BGP interval  X multiplier and if the number of async multiplier
control packets are not received by the other end in a row the BFD session
is declared down and BFD pulls down the registered routing protocol in this
case BGP.

In this case we are guessing that the TCP receive buffer is full because
the link is congested and so cannot process any more packets on the NIC
including BGP or BFD control packets.

So in this particular case with BFD Asynchronous mode enabled let’s say
with interval 50ms and multiplier 3 as soon as soon as Receiver RTR-B
misses 3 consecutive BFD control packets it pulls down the BGP session
within 150ms at which time RTR-B sends notification log message that the
hold time has expired and TCP RST is sent closing the session to RTR-A.

In this particular case if BFD asynchronous was enabled and was on a LAG
802.1ax but was not using BFD LAG RFC 7130 where a micro BFD session is per
member link in the LAG.

So even if the vendor implementation is running non RFC 7130 micro BFD
session per member link and let’s say a single BFD session over one of the
member link where visibility is not present with all the memes links at
least in this particular case as it’s a TCP receive buffer issue and not
link integrity issue I think it would not impact BFD taking down the link.

I doubt this was the case but if there was not any link congestion which
would mean the BFD liveliness detection would not fail.  So in this case
maybe highly unlikely if only TCP was in a stuck paused state for the BGP
session however their was no link congestion.

BFD used UDP 6784 and is checking link integrity liveliness which would be
fine and not fail if the link is not congested.  So then if BGP is having
an issue with the TCP session being in a paused state is their IPC TCP to

Would BFD be able to detect that BGP is hung due to TCP Paused state from
receive window 0?

My guess is as BGP protocol is registered with BFD, that BFD would be
monitoring the BGP protocol status and state and I would guess detect the
stuck state and bring down BGP.

I think this second scenario where the link is not congested and TCP is
stuck can be easily tested in a lab with a Spirent traffic generator.


On Thu, Dec 17, 2020 at 9:29 AM Jeffrey Haas <> wrote:

> Brian,
> > On Dec 16, 2020, at 6:08 PM, Brian Dickson <
>> wrote:
> > Thinking a bit bigger-picture, who could or should be able to (a)
> detect, and (b) respond to, a situation like this in future?
> > What are the pros/cons of different approaches, in terms of risk (of
> accidental or malicious outages induced), or effectiveness?
> >
> > I'll start:
> > A large network peering with another large network, is likely to have
> more visibility.
> > If all of the sessions are stuck (but still up), that's a much stronger
> indicator, and maybe that'd be a good situation when auto-reaction would be
> appropriate.
> > Assuming the auto-reaction was limited to large peers of a large
> network, I think this is less risky, and still very effective.
> > (There would still be a challenge of how to share this state discovery
> across an ASN, but that's a more constrained problem to solve, IMHO.)
> This sort of analysis is amenable to current centrally collected telemetry
> situations.  It's just not one that would be in people's playbooks.
> > A large ASN may want a reliable, secure method of shutting down peers
> for some modest duration. Is that also something to consider developing a
> solution for?
> Arguably, how is this different than your billing system deciding someone
> hasn't paid the bill and you shut down their BGP?  In this case, it's just
> a matter of finding all of either the impacted peering sessions, or having
> a list of all peering sessions by AS.
> > I don't see that possible without signatures a la RPKI, but don't know
> if it's something anyone would really want to have available.
> Pushing this action into someone else's system is probably a non-starter.
> If you think people were upset at the Dutch court attack scenario...
> > The conceptual model would be that of a rapid administrative shutdown of
> peering sessions, possibly with a pre-configured timer to re-enable
> sessions and/or a start time?
> I think in the scenario in question, automated and persisted shutdown is
> the desire.  Figuring out whether you have stable BGP, much less safe
> transit BGP through the provider is trickier.
> -- Jeff
> _______________________________________________
> Idr mailing list


*Gyan Mishra*

*Network Solutions A**rchitect *

*M 301 502-134713101 Columbia Pike *Silver Spring, MD