Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Jeffrey Haas <> Fri, 18 December 2020 21:49 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id BBE333A0977 for <>; Fri, 18 Dec 2020 13:49:48 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001] autolearn=unavailable autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id KTdV2U4tiNh9 for <>; Fri, 18 Dec 2020 13:49:47 -0800 (PST)
Received: from ( []) by (Postfix) with ESMTP id 457ED3A0978 for <>; Fri, 18 Dec 2020 13:49:47 -0800 (PST)
Received: from ( []) by (Postfix) with ESMTPSA id 158CF1E354; Fri, 18 Dec 2020 17:07:15 -0500 (EST)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.\))
From: Jeffrey Haas <>
In-Reply-To: <>
Date: Fri, 18 Dec 2020 16:49:48 -0500
Cc: Greg Mirsky <>, Brian Dickson <>, "Jakob Heitz (jheitz)" <>, "" <>
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <> <> <> <> <> <> <> <> <> <> <> <>
To: Gyan Mishra <>
X-Mailer: Apple Mail (2.3608.
Archived-At: <>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 18 Dec 2020 21:49:49 -0000


> On Dec 18, 2020, at 3:44 PM, Gyan Mishra <> wrote:
> Jeffrey 
> + Greg Mirsky 
> Would a simple solution be to use BFD RFC 5880 for liveliness detection single hop in async mode with BGP to bring down the protocol BGP registered with BFD.

BFD is used for BGP regularly.  The use of it for ISP to ISP connections, as in the issue described, is not very typical.  Resiliency of the session is far more important for ISP to ISP communication than fast failure.

For BFD sessions to customers, fast failure is sometimes used.

> As the application is not a file transfer between end hosts, and is two routers running BGP I don’t know if BGP implementation has a IPC call that signals BGP to hang on let’s wait for the receiver RTB to clear his buffer and signal with non zero ack.  If BGP could sense the TCP receive window 0 via IPC that would be best and immediately tear down BGP and send notification hold timer expired.

BGP implementations vary quite a bit.  Simpler implementations that pay only attention to basic socket APIs would simply see things like EWOULDBLOCK, EAGAIN or similar if doing async stuff.  If they're doing blocking sockets (unusual!), the implementation simply hangs.

FWIW, blocked sockets for this sort of thing is usually a socket programmer's first introduction to things that cause zero-windowing.

> During this time until the BGP hold time expires default 90 seconds traffic is not able to reroute on an alternate path and we are black holding traffic until RTRB sends BGP notification hold time expired followed by TCP RST and BGP peer session torn down.

It's important for the general case to realize that just because BGP is wedged up (control plane) that the forwarding plane may - or may not - be fine.  You can't tell from BGP.

What you do know in the abstract is that you care about the sessions being healthy in particular:
1. If you're not able to receive updates from your peer, you may end up with stale forwarding via that peer.
2. If you have stuff to send to the peer, they may end up with stale forwarding to you.

In that second case, you have a better local sense as to how urgent being stuck is.  If you have thousands of updates queued, it's probably dire.  If you have a few... is it?  If it's for a low priority network, maybe not.  If it's for google, probably much more important.

But in general, being stuck or out of sync is a problem.

But similarly, in general, the cost of dropping and re-establishing a peering session is very high.  So, there's resistance to knocking a session over because it's had some level of "temporary" hiccup.  Your definition of "temporary" will vary, and thus part of the motivation for this conversation.

> In this case we are guessing that the TCP receive buffer is full because the link is congested and so cannot process any more packets on the NIC including BGP or BFD control packets.

The fate is potentially shared, but not a guarantee.  If the congestion is happening because traffic is selectively dropping for your BGP session, BFD may behave fine.  Perhaps you have a congestion issue to your  router's CPU, but the line card's BFD is fine.

> So in this particular case with BFD Asynchronous mode enabled let’s say with interval 50ms and multiplier 3 as soon as soon as Receiver RTR-B misses 3 consecutive BFD control packets it pulls down the BGP session within 150ms at which time RTR-B sends notification log message that the hold time has expired and TCP RST is sent closing the session to RTR-A.

This would be way too short for most ISP scenarios. 

> BFD used UDP 6784 and is checking link integrity liveliness which would be fine and not fail if the link is not congested.  So then if BGP is having an issue with the TCP session being in a paused state is their IPC TCP to BGP to BFD.

TCP session state is very decoupled from UDP state, so the best inference you can make is "BFD works, TCP hopefully can get through?"  But as I noted above, there's no guarantee of that.

For a different flavor of this type of problem, IS-IS doesn't use IP transport.  This means IP forwarding can be broken but you can get ISO packets through.

> I think this second scenario where the link is not congested and TCP is stuck can be easily tested in a lab with a Spirent traffic generator.

I'd suggest playing with selective packet loss for a link for a busy TCP session.  You should find that with no more than 15% of TCP packet loss that your throughput becomes terrible, and sessions may simply fail because the TCP ACK necessary to advance the window may simply not get through.

-- Jeff