Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Jeffrey Haas <jhaas@pfrc.org> Wed, 16 December 2020 21:44 UTC

Return-Path: <jhaas@slice.pfrc.org>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8013E3A1122 for <idr@ietfa.amsl.com>; Wed, 16 Dec 2020 13:44:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.901
X-Spam-Level:
X-Spam-Status: No, score=-1.901 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_PASS=-0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1xVMZ4MGnOpP for <idr@ietfa.amsl.com>; Wed, 16 Dec 2020 13:43:59 -0800 (PST)
Received: from slice.pfrc.org (slice.pfrc.org [67.207.130.108]) by ietfa.amsl.com (Postfix) with ESMTP id D91EA3A110D for <idr@ietf.org>; Wed, 16 Dec 2020 13:43:59 -0800 (PST)
Received: by slice.pfrc.org (Postfix, from userid 1001) id B39511E356; Wed, 16 Dec 2020 17:01:22 -0500 (EST)
Date: Wed, 16 Dec 2020 17:01:22 -0500
From: Jeffrey Haas <jhaas@pfrc.org>
To: Brian Dickson <brian.peter.dickson@gmail.com>
Cc: "Jakob Heitz (jheitz)" <jheitz=40cisco.com@dmarc.ietf.org>, "idr@ietf.org" <idr@ietf.org>
Message-ID: <20201216220122.GE24940@pfrc.org>
References: <9D6268BD-C555-4B9A-A883-9B55EEB5D5DA@juniper.net> <91D9B9F7-0DBE-45E6-84D5-2E3D9F8C44A1@tix.at> <X9kweQ5EtTL7tOAM@bench.sobornost.net> <CAOj+MMFySPXpE8QxcO+7szKzQ78faQASYKnBUYg_h_aLd=P4Lg@mail.gmail.com> <BYAPR11MB3207412804697588E4AA3F03C0C60@BYAPR11MB3207.namprd11.prod.outlook.com> <20201216093614.GI68083@diehard.n-r-g.com> <4E9BEA12-998A-4AD1-B342-4F26AA6EBA69@cisco.com> <20201216174319.GM68083@diehard.n-r-g.com> <BYAPR11MB320759EE6ABC8AB863BC1838C0C50@BYAPR11MB3207.namprd11.prod.outlook.com> <CAH1iCipjgS4-NPTjNhc7Cj73bitWgTcw=ufax7iOCCnT+xGiZQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAH1iCipjgS4-NPTjNhc7Cj73bitWgTcw=ufax7iOCCnT+xGiZQ@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/ru2hNpNQHdIRVEY99W4onMsnv3c>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Dec 2020 21:44:01 -0000

Brian,

On Wed, Dec 16, 2020 at 11:17:37AM -0800, Brian Dickson wrote:
> Which is to say, there is EVERY reason to delete forwarding state.
> If a peer's router is so messed up that it is not accepting any TCP
> packets, the only safe assumption is that the problem is AS-wide for that
> peer.

While your later text does cover concerns about AS-wide events, I'd like to
suggest that your assumption doesn't necessarily hold.

A situation where we enter a half-duplex state and simply can't get a
response to packets we're pushing might simply impact a single peering
session.  The demonstration machinery Job mentions elsewhere in-thread is
an example of this.  Active attacks against TCP windowing mechanisms are
another.

> While this is my opinion on the best way to handle it, the underlying facts
> aren't arguable.
> An AS-wide situation (stuck receivers with no TCP progress) would never
> result in the AS sending withdrawals.

For the incident in question, it's not possible for a single BGP
implementation to decide that something AS-wide is happening.  And even so,
auto-mitigation of this triggered by a single session on a single device
would be unwise.

> It has occured and can occur, ergo it needs to be handled outside of the
> state machine proper.

For the demonstrated case, a coordinated response was needed.  To some
extent, the argument is for tooling to permit easy shutdown of sessions.

Operators have plenty of provisioning machinery.  Writing up the use case
and motivations for encouraging such a thing seems like something
appropriate to an operational forum.

-- Jeff