Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Robert Raszuk <> Fri, 11 December 2020 21:22 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 014A33A0EF5 for <>; Fri, 11 Dec 2020 13:22:25 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.098
X-Spam-Status: No, score=-2.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id u026y7Rv4ZuC for <>; Fri, 11 Dec 2020 13:22:22 -0800 (PST)
Received: from ( [IPv6:2a00:1450:4864:20::243]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 404D23A0EBB for <>; Fri, 11 Dec 2020 13:22:20 -0800 (PST)
Received: by with SMTP id x23so12535025lji.7 for <>; Fri, 11 Dec 2020 13:22:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=qygu1Pa2FNS9Z4aAQVJ/O5L0o9znVp+7VYk7utKMRrE=; b=KummSV7ExEhkcXVVOU7sqjYIP9/9MK4xuHWHyn51actU309ZQith5qOPMyGd/zxTl7 weeePjshx7O7GUABzhXoFPwiqIwgVrlL/cL9du8AICM9rGktiwUdfZpqlKcCrm091Q0D pa0hquQqriLTI9/vLeolpCoXvH8T9et9PwFXHtU2pIpwSNKAuWX11L/sA+z+7GXFJjuW JmkhKBVSbfAXcaPp55FvYkKN1dzaRqWRtYoZkR/tHVCaSfT6snmmF002vdenv7gyVxZ+ TKnjndtc3ZCrSLSUKYYx5qWeUix36zyneurYZsnZnjp3DhQw0dAxlW25B66QdveGaE9O ARyA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qygu1Pa2FNS9Z4aAQVJ/O5L0o9znVp+7VYk7utKMRrE=; b=Cx36bqbmbjgXxZ12Pn2LXsiHOsuLzc+8alk7dW2QwEE48F3RUQnfbdTHKepP2jUckv zzGSKLiRIP23VfWykDJAwgHvqMol4dPdix6L+b5tKwOjkr4WgHLdBzw2LFZOse17lfkH qhFGiOE1NpbJ1/e5ZqO22xtOmJPHldQQd8PmLigBh/M2G0XTfdN1epKjQBc2qOD2Yj8/ pKpvdmPV3p0dMlcmevMUd0rTa6fXABx4wkLbZRUOBfXM0QYC1r63qN2IPJ3TenluchOO 5qANoMu3sZTqyofpKmP4IubcMBe6q7z7bxVL9pZJi7D4qPPpifZKai603YGc64XvOrdp aSew==
X-Gm-Message-State: AOAM533TwzwYfRAVW7Z+XgkepiezAGHATvOmYAWRcy1Hce4lUWXWWU+x 6Amn3VHT+J7dscUhpZPhzgbCEugmUI/8xmEmcw8YhQ==
X-Google-Smtp-Source: ABdhPJzM2slBPAaLhxz0+SeTPqHjuPPzy2S6eQwNI0dwfJr0/Kt9lkO1rEV2tKdukaGBrCpScB2LK/ZkyHkojwEHNWI=
X-Received: by 2002:a2e:8053:: with SMTP id p19mr5909951ljg.321.1607721737919; Fri, 11 Dec 2020 13:22:17 -0800 (PST)
MIME-Version: 1.0
References: <> <>
In-Reply-To: <>
From: Robert Raszuk <>
Date: Fri, 11 Dec 2020 22:22:07 +0100
Message-ID: <>
To: Job Snijders <>
Cc: "" <>
Content-Type: multipart/alternative; boundary="00000000000016312005b636e31a"
Archived-At: <>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 11 Dec 2020 21:22:25 -0000


Couple of observations/questions:

* I do share some concerns that this will make BGP peering less stable.

* Is the "unable to send" only possible under Window = 0 ? What if there is
a local NIC buffer full and we keep dropping it locally ? If we are going
there perhaps we could say "unable to successfully send" meaning send and
get an ACK for it ?

* The proposal is about reusing the HOLD TIME value to bring BGP down when
you are still receiving keepalives however peer sent ACK for the last
segment indicating zero window - is this right ?

* What happens if our side is stuck and not able to process subsequent ACKs
which potentially increase the window on the peer ?

* Most deployments use BFD to make sure peer is reachable. As BFD is often
offloaded from CPU this is not an indication of health of TCP or BGP path.
But what if some deployments can not use BFD (say not supported by the
peer) ? Then those typically reduce BGP HOLD TIME. Would it not be too
fragile in some cases to bring a TCP down in such cases too fast ? Aren't
we overloading HOLD TIME value here a bit ?

* Assume we bring TCP down ... when does it attempt to go up again ? Are we
ok to bring it up only after manual/script action from such a state ?

* As proposed it seems that the change will affect all AFI/SAFIs using
given single TCP session. Even if perhaps all are perfectly healthy and
running on separate cores. If each SAFI would run on a separate TCP session
this would not have such an impact.

* The change seems applicable to both iBGP and eBGP right ?

In summary the attempt here is to fix application issues by cutting the
transport. Sure half broken transport for whatever reason may not be a good
thing to keep in UP state. Especially when redundancy is in place. But my
main concerns are that we are only trying to focus on a single low level
trigger to detect it.

Wouldn't per AFI/SAFI heartbeat be a better option to detect if a peer's
BGP stack is still up and running fine for all applications ?


On Fri, Dec 11, 2020 at 9:04 PM John Scudder <jgs=> wrote:

> [all hats on]
> Hi Job,
> Thanks for bringing this up.
> To take the liberty of summarizing your wall of text :-) you’re saying
> that you believe BGP should tear down its session if it’s unable to send a
> message for the duration of the hold time.
> Given that the conversation last time was inconclusive I think this is a
> good thing for the WG to discuss again. If you want to, you (or someone)
> could turn the idea into a short draft that updates RFC 4271, and we could
> have a WG adoption discussion about it. It might help focus the discussion
> but it’s not mandatory.
> I’ll point out a few things to start with —
> - Making it mandatory to apply hold time to the sending of messages would
> potentially make BGP peerings less stable. It clearly can’t make them
> *more* stable. Of course one can argue that if you haven’t been able to
> send a message for the hold time, the session has failed its metric of
> usefulness anyway, so any veneer of stability at this point is a harmful
> sham.
> - If I recall correctly, RST doesn’t work (or may not work) if you’re
> using the MD5 TCP option. Nothing much to be done, but be aware.
> - There is nothing stopping an implementation from doing what you describe
> now. The formalism that keeps you within the letter of 4271 would be that
> the implementation supplies a configuration option, that you set to enable
> the behavior. Once you’ve done that, when the implementation notices that
> the hold time has been exceeded in the outbound direction, it generates a
> ManualStop event for the session.
> Thanks,
> —John
> > On Dec 11, 2020, at 2:23 PM, Job Snijders <> wrote:
> >
> >
> > Dear group,
> >
> > Not too long ago an incident [1] in one Autonomous System resulted in
> > the global Internet being unusable in many parts of the world for
> > multiple hours. Some have reported the root cause was a 'configuration
> > error', however I believe much of the observed communication blackouts
> > in the global routing system stemmed from a pre-existing condition: a
> > specific implementation property present in multiple implementations
> > currently in use in the default-free zone.
> >
> > Usually when an incident happens in one AS, affected parties can through
> > unilateral action 'route around the problem', but the ability to 'route
> > around problems' critically depends on the ability to distribute
> > WITHDRAW or UPDATE messages. When messages are not processed, what
> > generally was assumed to be a unilaterally solvable problem, now requires
> > coordination between *all* neighbors of the suffering AS.
> >
> > The global routing system requires every participant to process BGP
> > messages, because the alternative is intervention on thousands of BGP
> > devices to manually shutdown thousands of BGP sessions disconnecting the
> > AS suffering from an incident, to help the rest of the default-free
> > zone. I speak from experience when saying that coordinating a
> disconnection
> > of an AS at global scale is incredibly hard and slow, any many approval
> > levels must be worked through. It takes *hours* of phone calls & email
> > chains, a time window during which internet traffic is routed towards
> > stale (now blackholing) locations.
> >
> > In the average ISP's network design using IBGP Route Reflectors, these
> > blackout effects are aggravated when BGP sessions landing in such
> > devices are not terminated when TCP causes the BGP session to stall.
> >
> > The problem of how TCP and BGP-4 can interact has been discussed before,
> > but I'm not sure the working group followed up with any publication
> > detailing the problem and the solution.
> >
> >
> >
> > Does everyone agree BGP-4 sessions MUST be terminated using a TCP RST
> > (instead of a BGP-4 Cease NOTIFICATION) if the peer has indicated for
> > the duration of the Hold Timer that the TCP receive window is zero?
> > I'm fine with there being buttons to make this different, but the
> > default for routers in the global Internet routing system should be to
> > consider the remote peer to be 'a lost cause' when it won't accept new
> > BGP messages for the duration of the hold timer.
> >
> > Perhaps RFC 4271 Section 6.5 should be amended as following:
> >
> > OLD:
> >    If a system does not receive successive KEEPALIVE, UPDATE, and/or
> >    NOTIFICATION messages within the period specified in the Hold Time
> >    field of the OPEN message, then the NOTIFICATION message with the
> >    Hold Timer Expired Error Code is sent and the BGP connection is
> >    closed.
> >
> > NEW:
> >    If a system does not receive (or is unable to send) successive
> >    KEEPALIVE, UPDATE, and/or NOTIFICATION messages within the period
> >    specified in the Hold Time field of the OPEN message, then the
> >    NOTIFICATION message with the Hold Timer Expired Error Code is sent
> >    and the BGP connection is closed. If the NOTIFICATION message cannot
> >    be send the BGP connection is closed.
> >
> > This is an ongoing problem. I suspect the BGP Nyancat's discoloration at
> > the left most eye might have been caused by an active TCP session
> > keeping a stale BGP session alive. But also the observations from "BGP
> > Zombies: an Analysis of Beacons Stuck Routes" [3] could be explained by
> > the problematic interaction between TCP and BGP.
> >
> > I appreciate the work the IDR working group has done to *SOFTEN* the
> > blow from implementation defects on global routing (RFC 7606 is a
> > brilliant example of this), but I fear in this case there is no subtle
> > way to say goodbye when the peer doesn't process messages in a timely
> > fashion. It might be good to document this.
> >
> > Kind regards,
> >
> > Job
> >
> > [1]:
> > [2]:
> > [3]:
> >
> > _______________________________________________
> > Idr mailing list
> >
> >
> _______________________________________________
> Idr mailing list