Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Gyan Mishra <> Fri, 18 December 2020 23:29 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 247243A03F5 for <>; Fri, 18 Dec 2020 15:29:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id uvmVqWbuWANp for <>; Fri, 18 Dec 2020 15:29:26 -0800 (PST)
Received: from ( [IPv6:2607:f8b0:4864:20::102d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 6D6073A03F4 for <>; Fri, 18 Dec 2020 15:29:26 -0800 (PST)
Received: by with SMTP id f14so2139038pju.4 for <>; Fri, 18 Dec 2020 15:29:26 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Hp74pyLGKV3NUO2birhCQ/l/p+/4O8oTjqeTJOFHmA4=; b=ooM2HoJStIyjLefHsmgkh04K5XF41c6vEkihUR6QQUdOwXpZafqaUlnRYVypB4KlxW usQW6aRluauKH+oEWEK6r5E15p10OTPVxuMdpndWoi0h3RIJwHIpB1ST0C6Qh2No4dFt jvGs8xooOsRcRzwoHDJTjEtkJgLrvG0HTYRzAA2e62rjb3ApeQRiYa89p8W5xPF5lyfi wNbQ0ai/okE09g2MpTzKRsAufcomRMwdaLZipdgDg5SMrXhk0Y8F83hwvzf5JI/1stIJ g/b2nb3jhRwMRVB/gwvv/u7nZ6VO86Qvv+9rlCGF+Lh6dCyJVFXXJFsMf0U8+Kp5+7VH 5P8A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Hp74pyLGKV3NUO2birhCQ/l/p+/4O8oTjqeTJOFHmA4=; b=hMndD50H925nkURmzKl1ZlH+HFg/O7/Q3cAG9m6lWQbCidD/AtlP7h4XFWvcrVsuEc M8fg7pAfnY8j77bEkCVv0tU9WVLvpPWqxOn3AgbQwz41tFh6B6IHoOVMkSbIae6GBYPK ZOY5W7wWh075H0rRqffZitIy8ZIDmu7p8qV5E1qI09xzdjjGoor/gkW/gGqoNh5TLDqn zVoLvKRUnv1RPPGAc4wtU9u8hn6/vRb7LdfuF0l57C4DP5jbpP97hT04K1Twd4M3AsJN etInzsA9Bk81XVZ/m1JULGqIBssadmfQQLxvcjheaeg6GcO+aN37MAOqIGJRoshcgIhK Uqdw==
X-Gm-Message-State: AOAM533YlI+27NsY0fUoXO9qXJ3KlbhTaia41vtncLEtIyzUz3liMocv WO+fpYMnbiW+Eby/IquymPKm0bnEkoNdHi/XydU=
X-Google-Smtp-Source: ABdhPJyzInCzNhWP0MgYYjfKxHKg8sc3ISj0xT1NyMTZDww1jcuO3gH0Csk58O66VgypDPQFZBmKQspD87F3WB/tYvg=
X-Received: by 2002:a17:902:b717:b029:d9:e816:fd0b with SMTP id d23-20020a170902b717b02900d9e816fd0bmr6690173pls.50.1608334165690; Fri, 18 Dec 2020 15:29:25 -0800 (PST)
MIME-Version: 1.0
References: <> <> <> <> <> <> <> <> <> <> <> <> <> <> <>
In-Reply-To: <>
From: Gyan Mishra <>
Date: Fri, 18 Dec 2020 18:29:14 -0500
Message-ID: <>
To: Jeffrey Haas <>
Cc: Brian Dickson <>, Greg Mirsky <>, "Jakob Heitz (jheitz)" <>, "" <>
Content-Type: multipart/alternative; boundary="000000000000a04fde05b6c57a93"
Archived-At: <>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 18 Dec 2020 23:29:29 -0000

Hi  Jeff




On Fri, Dec 18, 2020 at 4:49 PM Jeffrey Haas <> wrote:

> Gyan,
> > On Dec 18, 2020, at 3:44 PM, Gyan Mishra <> wrote:
> >
> > Jeffrey
> >
> > + Greg Mirsky
> >
> > Would a simple solution be to use BFD RFC 5880 for liveliness detection
> single hop in async mode with BGP to bring down the protocol BGP registered
> with BFD.
> BFD is used for BGP regularly.  The use of it for ISP to ISP connections,
> as in the issue described, is not very typical.  Resiliency of the session
> is far more important for ISP to ISP communication than fast failure.

    Gyan>. Understood. Agreed.  Resiliency and stability.  Generally tuning
timers for fast convergence within an operators domain or PE-CE customer
side where I agree ISP to ISP it’s about stability and may leave default
BGP 90 second dead timer as in this case.  However,  I believe ISPs are
starting to use BFD and S-BFD for inter ISP but are very careful as don’t
want to create inter ISP instability from flapping link due to tight

> For BFD sessions to customers, fast failure is sometimes used.

   Gyan> Agreed, BGP fast external failure works well for L3 connections
where as soon as you loose connected from link down the peer immediately
goes down.  The gain with BFD is with MetroE links where OAM fault
propagation far end link down is not sent or where link stays UP connected
 to L2 switch at an IXP NAP peering point where you now end up relying on
default BGP timers for convergence.  BFD also even in LAN under floor L3
links where you can rely on fast external failover, BFD or S-BFD may still
be beneficial for one way fiber scenarios.

> >
> > As the application is not a file transfer between end hosts, and is two
> routers running BGP I don’t know if BGP implementation has a IPC call that
> signals BGP to hang on let’s wait for the receiver RTB to clear his buffer
> and signal with non zero ack.  If BGP could sense the TCP receive window 0
> via IPC that would be best and immediately tear down BGP and send
> notification hold timer expired.
> BGP implementations vary quite a bit.  Simpler implementations that pay
> only attention to basic socket APIs would simply see things like
> EWOULDBLOCK, EAGAIN or similar if doing async stuff.  If they're doing
> blocking sockets (unusual!), the implementation simply hangs.

   Gyan> Understood.  I guess this could be a case where the BGP version
draft could be handy troubleshooting these types of issues.

> FWIW, blocked sockets for this sort of thing is usually a socket
> programmer's first introduction to things that cause zero-windowing.

   Gyan> Yep

> > During this time until the BGP hold time expires default 90 seconds
> traffic is not able to reroute on an alternate path and we are black
> holding traffic until RTRB sends BGP notification hold time expired
> followed by TCP RST and BGP peer session torn down.
> It's important for the general case to realize that just because BGP is
> wedged up (control plane) that the forwarding plane may - or may not - be
> fine.  You can't tell from BGP.
> What you do know in the abstract is that you care about the sessions being
> healthy in particular:
> 1. If you're not able to receive updates from your peer, you may end up
> with stale forwarding via that peer.
> 2. If you have stuff to send to the peer, they may end up with stale
> forwarding to you.

   Gyan> Good point.  The management plane is critical to monitor health
such as resources, memory, cpu etc do TCP state is part of the management
plane so BGP process being split between TCP sockets  part of the
management plane and BGP functions themselves.  Due to BGP use of TCP
socket it does make it more vulnerable to DDOS and of course security
concerns and ensuring authentication is enabled.  The NOC usually is
looking for down peers and would not catch missing routes until reported by
Customer outage.  Health of a peer telemetry is critical for operators
which is usually any non zero number for received routes and stale routes
or missing routes is very difficult to troubleshoot and easily missed until
you get a ticket.

In that second case, you have a better local sense as to how urgent being
> stuck is.  If you have thousands of updates queued, it's probably dire.  If
> you have a few... is it?  If it's for a low priority network, maybe not.
> If it's for google, probably much more important.

  Gyan> Yep.  I think operators for any critical peering including inter SP
should use BFD to mitigate quickly prolonged outages where every second

> But in general, being stuck or out of sync is a problem.

   Gyan> Agreed.  I think in general the stuck state is generally
 management plane TCP socket related and stale could be many other reasons.

> But similarly, in general, the cost of dropping and re-establishing a
> peering session is very high.  So, there's resistance to knocking a session
> over because it's had some level of "temporary" hiccup.  Your definition of
> "temporary" will vary, and thus part of the motivation for this
> conversation.

   Gyan> I disagree in the case of stuck but not stale.  I can see stale
could recover and normalize on its own possibly, where stuck prevents
convergence between onto an alternate path which once the peer bounces and
is normalized it can now take traffic.  Most ISPs load balance all their
inter ISP connections BGP multipath ECMP paths so if you reset the peer you
are much better of then black hole of traffic during the duration of the
hold timer.

> > In this case we are guessing that the TCP receive buffer is full because
> the link is congested and so cannot process any more packets on the NIC
> including BGP or BFD control packets.
> The fate is potentially shared, but not a guarantee.  If the congestion is
> happening because traffic is selectively dropping for your BGP session, BFD
> may behave fine.  Perhaps you have a congestion issue to your  router's
> CPU, but the line card's BFD is fine.

Gyan> Agreed.  I brought up that scenario below which I was not sure
happened in this particular instance but could happen and how would BFD
help if a localized router management plane issue.

> >
> > So in this particular case with BFD Asynchronous mode enabled let’s say
> with interval 50ms and multiplier 3 as soon as soon as Receiver RTR-B
> misses 3 consecutive BFD control packets it pulls down the BGP session
> within 150ms at which time RTR-B sends notification log message that the
> hold time has expired and TCP RST is sent closing the session to RTR-A.
> This would be way too short for most ISP scenarios.

    Gyan> Agreed.  Just giving an example but for inter ISP would be around
a second like 750ms is reasonable.

> > BFD used UDP 6784 and is checking link integrity liveliness which would
> be fine and not fail if the link is not congested.  So then if BGP is
> having an issue with the TCP session being in a paused state is their IPC
> TCP to BGP to BFD.
> TCP session state is very decoupled from UDP state, so the best inference
> you can make is "BFD works, TCP hopefully can get through?"  But as I noted
> above, there's no guarantee of that.

    Gyan> As BFD is detecting bi directional liveliness if the BFD control
packet is not making it especially with RFC 5880 3 way handshake session
establishment continuity test that if the BFD session cannot establish more
then likely their is a fiber cut L1 issue.  If running S-BFD it still
detects data plane bi directional liveliness but without the 3 way
handshake session establishment continuity test.

> For a different flavor of this type of problem, IS-IS doesn't use IP
> transport.  This means IP forwarding can be broken but you can get ISO
> packets through.

   Gyan> BFD single hop RFC 5881 can still register ISIS with BFD tuning
the timers down for fast RFC 5880 session establishment async mode link
failure detection to bring down ISIS neighbors for convergence to avoid
black hole of traffic.

> > I think this second scenario where the link is not congested and TCP is
> stuck can be easily tested in a lab with a Spirent traffic generator.
> I'd suggest playing with selective packet loss for a link for a busy TCP
> session.  You should find that with no more than 15% of TCP packet loss
> that your throughput becomes terrible, and sessions may simply fail because
> the TCP ACK necessary to advance the window may simply not get through.

    Gyan> Will give it a shot

I think overall for link congestion or failure where bidirectional
continuity needs to be detected their is tremendous gain to using BFD
single hop async for BGP, OSPF or ISIS convergence.

It would be nice maybe if BFD or S-BFD or IPPM IOAM internally on the
router maybe it could run on router management plane to detect the control
plane health of socket establishment.  I would have to noodle if possible
but that could be a new innovative draft.

> -- Jeff
> --


*Gyan Mishra*

*Network Solutions A**rchitect *

*M 301 502-134713101 Columbia Pike *Silver Spring, MD