Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Brian Dickson <brian.peter.dickson@gmail.com> Wed, 16 December 2020 20:58 UTC

Return-Path: <brian.peter.dickson@gmail.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 489C23A0E9B for <idr@ietfa.amsl.com>; Wed, 16 Dec 2020 12:58:16 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mrZ8hE8Goa0D for <idr@ietfa.amsl.com>; Wed, 16 Dec 2020 12:58:14 -0800 (PST)
Received: from mail-ua1-x936.google.com (mail-ua1-x936.google.com [IPv6:2607:f8b0:4864:20::936]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 347DC3A0FAD for <idr@ietf.org>; Wed, 16 Dec 2020 12:58:13 -0800 (PST)
Received: by mail-ua1-x936.google.com with SMTP id y25so4903729uaq.7 for <idr@ietf.org>; Wed, 16 Dec 2020 12:58:13 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gMSsgIhiNh6YWrxIOSd1yccUM2bweYeSRmbwxBiaAls=; b=HYToQYMav+QvJ48COueFE9mcbg8OXpBKumUpGq6BCxRnb4wkqq2ROHOh5Mc/1NqPnt +E73Nu77XUfmEfF7/Nym6aXg2DjNLswCJjziEmhi80JGWrywMQiPY8OHDheA6ePcXRC7 aFLMw3eEFuUne/+Yx51amPkUEkjJMKT/AQhJm4ml+6PDlr6mHsoYZh5XJhsmP7K9AUpp T+m62NAirvJXhLVSIzjBTT5SfpQH9cLmjViJNlCwubsTrjf6QLyq+i0hramYTsYbX1Ol iRvfyIUxMReH9UjOS+W7j/ws6wyNKsNWwblYY2cEUqwKls7UMvya26jnSCpOiTjk6/Kp Jz8w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gMSsgIhiNh6YWrxIOSd1yccUM2bweYeSRmbwxBiaAls=; b=c2eyv2GDzZz85pHi822C7jGcBR0pNquvXRrSnybBo8+NTKg79vNF9+CWCI+cG4VP0a ltnrb+KMGTw7FGVLmwnxOLzN/+rvxacH0rGp7qk38SbslLeCeqrMS9AVmiYaaIXiXXFv +Oj8NP+B9e0bVHCWTH22KYbBl0wL41Xg5tHEr6r2yHFDHdFLu56TcVsgbIJPw6o19t75 hQIXgMw2yzSVQyVAUMwjqVWEMzcAFPxG0bwninlR2R73UevDUFU1myh0zXJyq25XHFBG 8u+vYH+k9A64FYrQWbmWttQuDsuWqARvSx9elv6M7+YfPzSldlJqpI6zzavPQDDlDiWf szBQ==
X-Gm-Message-State: AOAM533bTXh1isq7xrXgzoYCfWdhVQCyI7N5Vm0MJEbh+QtFTVz3zl6H u39volwpdn0lUNexYPSmBfBOA4JHZCH/j0llRiauP/JZI4w=
X-Google-Smtp-Source: ABdhPJyWCPdJAWN7MH9CqD3LC4ZyJ0Kqb/O1sq7IYGIhjdA9MjEi51oZNeFj2pT7ac45XszBcydQm4hCCsqh5yeeJNY=
X-Received: by 2002:ab0:3b59:: with SMTP id o25mr34338485uaw.62.1608152293050; Wed, 16 Dec 2020 12:58:13 -0800 (PST)
MIME-Version: 1.0
References: <6F7C5906-51A8-43C2-8AEC-3DB74CB9941F@tix.at> <1B4E7C9D-BBFE-4865-87F9-133ACE55D122@cisco.com> <22C381D0-2174-4828-A724-FD97B2FE0BCB@tix.at> <9D6268BD-C555-4B9A-A883-9B55EEB5D5DA@juniper.net> <91D9B9F7-0DBE-45E6-84D5-2E3D9F8C44A1@tix.at> <X9kweQ5EtTL7tOAM@bench.sobornost.net> <CAOj+MMFySPXpE8QxcO+7szKzQ78faQASYKnBUYg_h_aLd=P4Lg@mail.gmail.com> <BYAPR11MB3207412804697588E4AA3F03C0C60@BYAPR11MB3207.namprd11.prod.outlook.com> <20201216093614.GI68083@diehard.n-r-g.com> <4E9BEA12-998A-4AD1-B342-4F26AA6EBA69@cisco.com> <20201216174319.GM68083@diehard.n-r-g.com> <BYAPR11MB320759EE6ABC8AB863BC1838C0C50@BYAPR11MB3207.namprd11.prod.outlook.com> <CAH1iCipjgS4-NPTjNhc7Cj73bitWgTcw=ufax7iOCCnT+xGiZQ@mail.gmail.com> <BYAPR11MB32076892E23403C8A6E61715C0C50@BYAPR11MB3207.namprd11.prod.outlook.com>
In-Reply-To: <BYAPR11MB32076892E23403C8A6E61715C0C50@BYAPR11MB3207.namprd11.prod.outlook.com>
From: Brian Dickson <brian.peter.dickson@gmail.com>
Date: Wed, 16 Dec 2020 12:58:02 -0800
Message-ID: <CAH1iCiqXro3E7bHvf1GZj7n5JPrKxGrrjt9zwgJSo1mS-0utuw@mail.gmail.com>
To: "Jakob Heitz (jheitz)" <jheitz@cisco.com>
Cc: Claudio Jeker <cjeker@diehard.n-r-g.com>, "idr@ietf.org" <idr@ietf.org>
Content-Type: multipart/alternative; boundary="0000000000002c0dfb05b69b2298"
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/x7NbqQdG3vNvkLXDleoPrylHYfE>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Dec 2020 20:58:16 -0000

On Wed, Dec 16, 2020 at 11:26 AM Jakob Heitz (jheitz) <jheitz@cisco.com>
wrote:

> Brian,
>
>
>
> We went through this discussion when developing RFC 7606
> <https://tools.ietf.org/html/rfc7606>.
>
> What errors deserve what consequences?
>
> In implementations, we ended up making them all configurable.
>
> We can certainly make this one optional too.
>
>
>
The main thing is, even if it is configurable, it needs to have a sane
default. This is analogous to the requirement to have a configured routing
policy (any policy at all).
It should be, IMNSHO, extremely rare that anyone should want to do that
(make that override) given the global consequences of a stuck ASN.

Pretty much by definition, this is the stuff that normal control plane
stuff (including RFC 7606) could never handle per se.

A router which is slow handling updates would not trigger this issue.
It would only occur if it completely stopped working for UPDATE timer
duration.
At which point, it is arguable as to whether the routes it has are still
reasonable and legitimate.

It may be forwarding packets, but there is zero ability to guarantee it
isn't causing a routing loop.
It is quite likely that it would/could, which would further exacerbate any
problems the router is experiencing.

In which case, the result is likely better in 99.99% of potential
scenarios, to tear down the session.
I'd say 100% but someone may have a counter-example...

Brian


> Regards,
>
> Jakob.
>
>
>
> *From:* Brian Dickson <brian.peter.dickson@gmail.com>
> *Sent:* Wednesday, December 16, 2020 11:18 AM
> *To:* Jakob Heitz (jheitz) <jheitz@cisco.com>
> *Cc:* Claudio Jeker <cjeker@diehard.n-r-g.com>om>; idr@ietf.org
> *Subject:* Re: [Idr] TCP & BGP: Some don't send terminate BGP when
> holdtimer expired, because TCP recv window is 0
>
>
>
>
>
>
>
> On Wed, Dec 16, 2020 at 10:51 AM Jakob Heitz (jheitz) <jheitz=
> 40cisco.com@dmarc.ietf.org> wrote:
>
> The restarting speaker in this case did not actually restart.
> It just restarted this one session. There is no reason for it
> to delete any forwarding state. There is no evidence of any
> problem with its received routes, only with the routes it sent
> to the stuck peer. It may still set the (F) bit to force the
> stuck peer to WITHDRAW.
>
>
>
> The transitive nature of bgp pretty much requires the safest choice should
> be taken.
>
>
>
> Which is to say, there is EVERY reason to delete forwarding state.
>
> If a peer's router is so messed up that it is not accepting any TCP
> packets, the only safe assumption is that the problem is AS-wide for that
> peer.
>
> In which case, the only data point available (the TCP session) indicates
> the problem is very likely affecting other routing announcements towards
> that ASN.
>
> And the only SAFE behavior is to tear down the session completely,
> including removing received routes from the IN-RIB, update the RIB and FIB,
> and go on with life.
>
>
>
> The incident that I believe lead to this proposal was super nasty, and
> only a fix like this could handle that.
>
> There is no reason to believe this can't/won't happen again in future.
>
> As Jared notes, there are likely other implementations that would fare as
> poorly if they encountered the triggering situation.
>
>
>
> While this is my opinion on the best way to handle it, the underlying
> facts aren't arguable.
>
> An AS-wide situation (stuck receivers with no TCP progress) would never
> result in the AS sending withdrawals.
>
> The current handling of that is demonstrably broken.
>
> The assumption that the BGP state machine can be fully relied upon is no
> longer sound.
>
>
>
> It probably never was a completely safe assumption.
>
> IFF the state machine is working correctly, this corner case would never
> occur.
>
> It has occured and can occur, ergo it needs to be handled outside of the
> state machine proper.
>
>
>
> Brian
>