Re: [Idr] draft-spaghetti-idr-bgp-sendholdtimer - Feedback requested

Jeffrey Haas <> Wed, 28 April 2021 13:10 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id E44E63A0901 for <>; Wed, 28 Apr 2021 06:10:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id kDoz7FIqCAr7 for <>; Wed, 28 Apr 2021 06:10:28 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 706043A0900 for <>; Wed, 28 Apr 2021 06:10:28 -0700 (PDT)
Received: by (Postfix, from userid 1001) id B6CD51E455; Wed, 28 Apr 2021 09:33:46 -0400 (EDT)
Date: Wed, 28 Apr 2021 09:33:46 -0400
From: Jeffrey Haas <>
To: Adam Chappell <>
Cc: Ben Cox <>, IETF IDR WG <>
Message-ID: <>
References: <> <>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <>
User-Agent: Mutt/1.5.21 (2010-09-15)
Archived-At: <>
Subject: Re: [Idr] draft-spaghetti-idr-bgp-sendholdtimer - Feedback requested
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 28 Apr 2021 13:10:32 -0000


On Wed, Apr 28, 2021 at 01:01:34PM +0200, Adam Chappell wrote:
> It is significant to me that at least two production implementations
> I've seen offer visible CLI instrumentation when keepalives haven't
> been written on schedule, so it seems that implementors are definitely
> aware of the weakness, even if not acting on the situation today and
> could do.

For scaled systems, scheduler slips for timers can be a serious challenge.
This is even without the headache of whether your network stack will
actually manage to get the scheduled packet to the remote device on time.

> I read your compelling blog post[2] and investigations but I have
> missed the conclusive link between the phenomenon of zombies
> (undisputed) and TCP sessions regularly stalled in this way. I
> understand that you can synthesise the problem by creating a BGP
> speaker that starves his peer of the oxygen to talk thus polluting
> your own RIB; but I missed the pointer to evidence that it is indeed
> this that is generally occurring and causing zombies. Not disputing
> that it may be.

"Stuck route" or "missing route/withdraw" bugs of various sorts are an
indisputable fact of prior and sometimes current BGP implementations.

Some of the circumstances described in this thread are, as you had noted,
perhaps not the issue itself but a visible manifestation of the issue.  An
implementation can't tell when it sees zero-windowing whether there's a TCP
issue, a starving reader on the remote side, or related problems.  It simply
knows that it can't push routing state that it wants to.

Prior and ongoing discussions about BGP troubleshooting (see for example the
thread in grow about the BGP Looking Glass Capability) can be partially
informed by these discussions.  This form of congestion means that an
in-band solution to such troubleshooting is problematic for some scenarios.

-- Jeff