Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Brian Dickson <brian.peter.dickson@gmail.com> Wed, 16 December 2020 23:09 UTC

Return-Path: <brian.peter.dickson@gmail.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D1AB13A1288 for <idr@ietfa.amsl.com>; Wed, 16 Dec 2020 15:09:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9ia3Wmuz4P4b for <idr@ietfa.amsl.com>; Wed, 16 Dec 2020 15:08:59 -0800 (PST)
Received: from mail-ua1-x930.google.com (mail-ua1-x930.google.com [IPv6:2607:f8b0:4864:20::930]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 817F43A128E for <idr@ietf.org>; Wed, 16 Dec 2020 15:08:59 -0800 (PST)
Received: by mail-ua1-x930.google.com with SMTP id n18so8551294ual.9 for <idr@ietf.org>; Wed, 16 Dec 2020 15:08:59 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=lNAnOq4lQvtnaIckNa/XwesUXqiIWI40gvo5HRkqgvk=; b=egSgWvhbPQ0HgUXCbXxzEZnp5nVWGQexdGd/B+UXWmO+uS0tE332M4ifoiI1yg/IaE 4QcG6HJ9sp0IysVOwgD/BrdQed49Lg7pPXUTZV0gOpbP9eNNQ6ZXrp1rlzU+l/mTaaOY b/c3Z/BQAwbcdgqyX92FUnR9SJHLHeAt4sg86fwqAogYFLFC/qGm/OZLeH9aPnyjLVIG UiDJPXT9I5YQE2FCBMZiPBrOfH8HPP954oKDTNP8ZDXjTWFX9SYUCSOAteLHnNi3pxQy alPClpM7Me0OLa+IBLUKCZEYEXJk8gC8iOfqLtDZhNrQJtUemLDPlZszLwTtAdUlFmJC jUcw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=lNAnOq4lQvtnaIckNa/XwesUXqiIWI40gvo5HRkqgvk=; b=rFa3h8A41LVJUekP57w+Z7bR+E6Q99Yd0LDEXhTMJCqVwnc5RoIenBMn2+ONUvreay ceEEzIRuO68VExgbTj5P7H/itSpX5Gi+avHB4bIes1CivbIM0fCgonv+ip/ovqRz6EB3 s72l6oBxTPRQTttzj6ww/sfKYyxemVrHu/ppFBrOQH2pZw/szRtC8BClm2N7JtdZ/jz/ UFLTovUHtCmahv86U4b9KEnMiRRetUbVE2UAB6cybJChs1NMYEdYL7umY9cfqYPZTAH5 S/44ekYcE2vvsz4IiI1sCGtuhBLGcQOhFXpg6Os48Nf8Lk3tLDTed22z3JKI4VQpI/AY rFzg==
X-Gm-Message-State: AOAM5323AFmUUvTENlom1/5UrUAKWeNrlXE1KQzDU2sO95aD9GWBd8j8 RUxdGiwKSzvdTyUfq81PfxG01Mn3jomdV2hHv7Ohm6+KChDYXg==
X-Google-Smtp-Source: ABdhPJyHHg/wCAWyKSONZ48C3BDlR4AcAdYavYZJUGRknfX3AAoM20V8fydVGrC0FCg0wnEDcYFilMxzflAa0a9Q6EQ=
X-Received: by 2002:ab0:2e9:: with SMTP id 96mr34239539uah.87.1608160138570; Wed, 16 Dec 2020 15:08:58 -0800 (PST)
MIME-Version: 1.0
References: <9D6268BD-C555-4B9A-A883-9B55EEB5D5DA@juniper.net> <91D9B9F7-0DBE-45E6-84D5-2E3D9F8C44A1@tix.at> <X9kweQ5EtTL7tOAM@bench.sobornost.net> <CAOj+MMFySPXpE8QxcO+7szKzQ78faQASYKnBUYg_h_aLd=P4Lg@mail.gmail.com> <BYAPR11MB3207412804697588E4AA3F03C0C60@BYAPR11MB3207.namprd11.prod.outlook.com> <20201216093614.GI68083@diehard.n-r-g.com> <4E9BEA12-998A-4AD1-B342-4F26AA6EBA69@cisco.com> <20201216174319.GM68083@diehard.n-r-g.com> <BYAPR11MB320759EE6ABC8AB863BC1838C0C50@BYAPR11MB3207.namprd11.prod.outlook.com> <CAH1iCipjgS4-NPTjNhc7Cj73bitWgTcw=ufax7iOCCnT+xGiZQ@mail.gmail.com> <20201216220122.GE24940@pfrc.org>
In-Reply-To: <20201216220122.GE24940@pfrc.org>
From: Brian Dickson <brian.peter.dickson@gmail.com>
Date: Wed, 16 Dec 2020 15:08:47 -0800
Message-ID: <CAH1iCiotC-9tQcfNkcJKH=OcEovi1ztZoJ_eiKg_mA-Wp+FJNw@mail.gmail.com>
To: Jeffrey Haas <jhaas@pfrc.org>
Cc: "Jakob Heitz (jheitz)" <jheitz=40cisco.com@dmarc.ietf.org>, "idr@ietf.org" <idr@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000cd302f05b69cf53e"
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/38oT9X1kqZ2V_PnEdA_8pfy_Tgs>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Dec 2020 23:09:02 -0000

On Wed, Dec 16, 2020 at 1:43 PM Jeffrey Haas <jhaas@pfrc.org> wrote:

> Brian,
>
> On Wed, Dec 16, 2020 at 11:17:37AM -0800, Brian Dickson wrote:
>
> > While this is my opinion on the best way to handle it, the underlying
> facts
> > aren't arguable.
> > An AS-wide situation (stuck receivers with no TCP progress) would never
> > result in the AS sending withdrawals.
>
> For the incident in question, it's not possible for a single BGP
> implementation to decide that something AS-wide is happening.  And even so,
> auto-mitigation of this triggered by a single session on a single device
> would be unwise.
>
>
Fair enough.

Thinking a bit bigger-picture, who could or should be able to (a) detect,
and (b) respond to, a situation like this in future?
What are the pros/cons of different approaches, in terms of risk (of
accidental or malicious outages induced), or effectiveness?

I'll start:
A large network peering with another large network, is likely to have more
visibility.
If all of the sessions are stuck (but still up), that's a much stronger
indicator, and maybe that'd be a good situation when auto-reaction would be
appropriate.
Assuming the auto-reaction was limited to large peers of a large network, I
think this is less risky, and still very effective.
(There would still be a challenge of how to share this state discovery
across an ASN, but that's a more constrained problem to solve, IMHO.)




> > It has occured and can occur, ergo it needs to be handled outside of the
> > state machine proper.
>
> For the demonstrated case, a coordinated response was needed.  To some
> extent, the argument is for tooling to permit easy shutdown of sessions.
>
> Operators have plenty of provisioning machinery.  Writing up the use case
> and motivations for encouraging such a thing seems like something
> appropriate to an operational forum.
>
>
A large ASN may want a reliable, secure method of shutting down peers for
some modest duration. Is that also something to consider developing a
solution for?
Basically, a "Shoot me now, shoot me now." kind of thing.
I don't see that possible without signatures a la RPKI, but don't know if
it's something anyone would really want to have available.
The conceptual model would be that of a rapid administrative shutdown of
peering sessions, possibly with a pre-configured timer to re-enable
sessions and/or a start time?
Maybe signed with both a personal PGP key and an RPKI key, so as to have a
great deal of control over its use and trustworthiness.

Brian