Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Tony Li <> Fri, 11 December 2020 22:31 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 8C26A3A0FD8 for <>; Fri, 11 Dec 2020 14:31:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.5
X-Spam-Status: No, score=-1.5 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.249, RCVD_IN_DNSWL_BLOCKED=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id 8qo7ytvcrhiC for <>; Fri, 11 Dec 2020 14:31:08 -0800 (PST)
Received: from ( [IPv6:2607:f8b0:4864:20::634]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 040353A0FD7 for <>; Fri, 11 Dec 2020 14:31:08 -0800 (PST)
Received: by with SMTP id x12so4844377plr.10 for <>; Fri, 11 Dec 2020 14:31:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=sender:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=B43Z+9JIJ4Ea60AYelOE4u18rdmWu6Iu31UMQ/zHIs4=; b=lZ20FZquYJk0OxP57QdfCn7V/RoR9otNMGgm+aJrJ1dt2RUcfVgSqcr2nHzzR9Qtec as05mrOlQUc5HSwFNx5ofu/8W61KMZRPPCavgY+b8ERdPKO30Obi3VJOq8VyaZyyrzci R0D4ppsOKttc+3P1cSp/Pw97xAuJ4myBEsFPgD/nQHXkh8gU+WZ/+Pk6ii2igPOugx9a DWreIRoVd9aCAPgnG7HUOb3wUkY8tIjMXjx9Q/d2OLMnHA60FXWCz/D3pYz/KIbYLvie Fe45Kz1UNpSd4gb0ZnPD1m/0Pg1IWrK1zvhzL1u2UH1BuFqMX0nQ2KVkbJSt5S65Gzeh Y2xQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:sender:mime-version:subject:from:in-reply-to :date:cc:content-transfer-encoding:message-id:references:to; bh=B43Z+9JIJ4Ea60AYelOE4u18rdmWu6Iu31UMQ/zHIs4=; b=LZcFjn0shbXNrkwjEQ2i8u1s6z8VwZOJaI84Aivj/LtIXKhCSl0pDcwYArsrenfK7W Ax4ACuhITnTrcIvFzGas047aqjFts4yinBbeLUwFoD9aBwQ5fUJSS+1vJQYTHdatF+FS 03Yt2oNv+s2DzJNE5t/kmD3VxkjkXrwhcoJWlzYeKDdq66WoUevByzgekbbVCQ1ECzHa SOt+99d+c0ckSPu6blmSQ/ZBVSUqr57BQcBcX7bTC2h+QoyH58J9XUqlRqXUtTvmiV8L XTsM9TKYKjlhPhJbk0tmYBhQi5BdkUadogGHnQcCr85sG7QwgwpIY+4nE+ZNSQQwWSr9 VxNw==
X-Gm-Message-State: AOAM533UyS6A24CRC573b9wPIIjwSIaKpLnhjTKYh3tLb28b/g2z/a/R sYqITWIb04DQb0USZ3sC79KSVfFKSgg=
X-Google-Smtp-Source: ABdhPJy06ghXar6YAP3y1rPaN+z3oz/yvVBtwin3l9o05EWWCgn0C41/t3C1veAExLQyVQKfADHc7w==
X-Received: by 2002:a17:902:bf0b:b029:da:274:c754 with SMTP id bi11-20020a170902bf0bb02900da0274c754mr12690679plb.43.1607725867332; Fri, 11 Dec 2020 14:31:07 -0800 (PST)
Received: from [] ( []) by with ESMTPSA id v17sm11947070pga.58.2020. (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2020 14:31:06 -0800 (PST)
Sender: Tony Li <>
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.\))
From: Tony Li <>
In-Reply-To: <>
Date: Fri, 11 Dec 2020 14:31:04 -0800
Cc: Job Snijders <>, "" <>
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <>
To: Robert Raszuk <>
X-Mailer: Apple Mail (2.3608.
Archived-At: <>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 11 Dec 2020 22:31:11 -0000

Hi Robert,

> * Is the "unable to send" only possible under Window = 0 ? What if there is a local NIC buffer full and we keep dropping it locally ? If we are going there perhaps we could say "unable to successfully send" meaning send and get an ACK for it ? 

There are many, many reasons why we might not be able to exchange bits. The specifics aren’t particularly relevant. The point is that we’re not able to make progress, so the session is clearly broken.

> * The proposal is about reusing the HOLD TIME value to bring BGP down when you are still receiving keepalives however peer sent ACK for the last segment indicating zero window - is this right ? 

More generally, the proposal is that we apply the HOLD TIME on the transmit side as well as the receive side. If we are not able to transmit for that period of time, the receiver should give up and so should the transmitter. The session is broken, updates cannot flow, and we no longer have (eventual) consistency.

> * What happens if our side is stuck and not able to process subsequent ACKs which potentially increase the window on the peer ? 

Then our side has some kind of bug. But the result is the same: the session is stuck and is not helpful.  Tearing down the session and starting over is probably the best that we can do.

> * Most deployments use BFD to make sure peer is reachable. As BFD is often offloaded from CPU this is not an indication of health of TCP or BGP path. But what if some deployments can not use BFD (say not supported by the peer) ? Then those typically reduce BGP HOLD TIME. Would it not be too fragile in some cases to bring a TCP down in such cases too fast ? Aren't we overloading HOLD TIME value here a bit ? 

The HOLD TIME is already enforced by the receiver.  The only clarification that this makes is that the transmitter should also enforce it.  

> * Assume we bring TCP down ... when does it attempt to go up again ? Are we ok to bring it up only after manual/script action from such a state ? 

That is (and has always been) at the discretion of the implementation. Automatic periodic recovery would be preferable as a means of minimizing managerial overhead.

> * As proposed it seems that the change will affect all AFI/SAFIs using given single TCP session. Even if perhaps all are perfectly healthy and running on separate cores. If each SAFI would run on a separate TCP session this would not have such an impact. 

Creating more TCP sessions is not likely to improve the behavior of a TCP receiver.

> * The change seems applicable to both iBGP and eBGP right ? 

Yes.  It’s fundamental.

> In summary the attempt here is to fix application issues by cutting the transport. Sure half broken transport for whatever reason may not be a good thing to keep in UP state. Especially when redundancy is in place. But my main concerns are that we are only trying to focus on a single low level trigger to detect it. 

The point here is simply a clarification for robustness. If a (half) session is not making progress, then the receiver should terminate the session. With this clarification, we make it explicit that the transmitter may do so as well. The likely scenarios where this would come into play are serious software bugs where the transmitter or receiver is not able to make progress. This could be due to transport issues or infrastructure issues. As you note, trying to continue to work with the session is unlikely to be beneficial.

> Wouldn't per AFI/SAFI heartbeat be a better option to detect if a peer's BGP stack is still up and running fine for all applications ? 

That would add considerable complexity and still not address the stuck transmitter.