Re: [Idr] WG Adoption call for draft-spaghetti-idr-bgp-sendholdtimer-09 (2/28/2023 to 3/14/2023)

Nick Hilliard <nick@foobar.org> Tue, 07 March 2023 12:19 UTC

Return-Path: <nick@foobar.org>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6CCB5C151AF4 for <idr@ietfa.amsl.com>; Tue, 7 Mar 2023 04:19:32 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id eXrLp_HqOG4F for <idr@ietfa.amsl.com>; Tue, 7 Mar 2023 04:19:30 -0800 (PST)
Received: from mail.netability.ie (mail.netability.ie [IPv6:2a03:8900:0:100::5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3B792C151AF1 for <idr@ietf.org>; Tue, 7 Mar 2023 04:19:28 -0800 (PST)
Received: from cupcake.local (unknown [89.101.195.156]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.netability.ie (Postfix) with ESMTPSA id D60249CC37; Tue, 7 Mar 2023 12:19:25 +0000 (GMT)
To: Robert Raszuk <robert@raszuk.net>
Cc: Mikael Abrahamsson <swmike=40swm.pp.se@dmarc.ietf.org>, idr@ietf.org
References: <BYAPR08MB4872FD426205CAC6F82D22BEB3AD9@BYAPR08MB4872.namprd08.prod.outlook.com> <AM7PR07MB6248673BB25E0C0BCDBEE480A0B69@AM7PR07MB6248.eurprd07.prod.outlook.com> <CAOj+MMHF9G5-CmGPJpWja=1kgBrV=EYtzyhQr9La1722=D+ugA@mail.gmail.com> <m2edq1ac7s.wl-randy@psg.com> <ZAZSyywxxg0HkaDw@snel> <CAOj+MMGiB8iiqKbj40kZsSFAQXCQ+baGQWuA7oQ1D0qF5aygeQ@mail.gmail.com> <alpine.DEB.2.20.2303070725390.2636@uplift.swm.pp.se> <CAOj+MMGUfxd1LLta9=_HU+uMKcbVVE6ijkG84-ST0LDo3m2MYQ@mail.gmail.com> <alpine.DEB.2.20.2303070953000.2636@uplift.swm.pp.se> <CAOj+MMF8gELjxXB=kmn3eTu8X96vP7ueOTSA6Q+V_086wfO=NQ@mail.gmail.com> <alpine.DEB.2.20.2303071107360.2636@uplift.swm.pp.se> <3caaea46-cc66-f084-ec9b-98783d6daa49@foobar.org> <CAOj+MME=-drWX_1=9T8jqBGvEfwB59PmjLoh65i8wvdppKFKYg@mail.gmail.com>
From: Nick Hilliard <nick@foobar.org>
Message-ID: <78b9f06f-ab7c-adc6-f9a6-05697532b249@foobar.org>
Date: Tue, 07 Mar 2023 12:19:24 +0000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:52.0) Gecko/20100101 PostboxApp/7.0.59
MIME-Version: 1.0
In-Reply-To: <CAOj+MME=-drWX_1=9T8jqBGvEfwB59PmjLoh65i8wvdppKFKYg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Language: en-GB
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/vaN72cmrtHXS4aBVPrwCFJ-1TMg>
Subject: Re: [Idr] WG Adoption call for draft-spaghetti-idr-bgp-sendholdtimer-09 (2/28/2023 to 3/14/2023)
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Mar 2023 12:19:32 -0000

So the thing is that SendHoldTimer is a last resort timer, which is 
triggered when the control plane is wedged to the point that it's 
obviously broken.

You'd need to ask the authors about what sort of value they would be 
happy to see as a min value. Personally, I don't see why an operator 
should be forced into accepting 1h as a minimum if they feel that there 
is a problem which could be triggered sooner.  I'd suggest that if a 
local host detects that a session has been wedged for 15m, then it would 
be fair to assume it's dead. Even 15m makes debugging twice as slow as a 
minimum 8m timer.

The other thing is that a recommended value is not the same as a minimum 
value.  Maybe the following approach would work:

1. SendHoldTimer MUST be strictly greater than 2x HoldTimer 
[explanation: this is the timer of last resort]

2. SendHoldTimer MUST be >= 10m [explanation: wedged for 10m probably 
means that there is a control plane problem, but gives the operator the 
ability to drop this value for debugging purposes if they need to do this]

3. default value for SendHoldTimer of 20m [explanation: 2x 
MinSendHoldTimer provide a more conservative default]

There are no magic numbers here.

Nick

Robert Raszuk wrote on 07/03/2023 11:16:
> Hi Nick,
> 
> Great - so would co-authors be ok to add MUST to the draft that MIN 
> value of SendHoldTimer  should be 60 minutes ?
> 
> If so I think it would clear any of my concerns.
> 
> Thx,
> R.
> 
> 
> 
> On Tue, Mar 7, 2023 at 12:11 PM Nick Hilliard <nick@foobar.org 
> <mailto:nick@foobar.org>> wrote:
> 
>     Mikael Abrahamsson wrote on 07/03/2023 10:09:
>      > On Tue, 7 Mar 2023, Robert Raszuk wrote:
>      >> But we discussed this ... If network is stable and you are only
>     putting
>      >> keepalives say every 60 sec it will be hours before the socket
>     buffer
>      >> get's
>      >> filled and you get first notification about it to local BGP process.
>      >>
>      >> How is the solution presented in this draft helpful ?
>      >
>      > Our network (which has had this happen to us) sends data all the
>     time,
>      > and even if it takes hours it's still better than the alternative
>     of not
>      > detecting it at all.
> 
>     this draft is about unsticking connections which have been jammed for
>     time periods longer than hours [bgpzombies], and it's normal for BGP
>     session uptime to be measured in months rather than hours. For example,
>     here's the output of an entirely unremarkable dfz leaf router:
> 
>      > Neighbor      V           AS MsgRcvd MsgSent   TblVer  InQ OutQ
>     Up/Down  State/PfxRcd
>      > 185.x.x.a     4         zzzz  226429  218447 1698392790    0    0
>     9w6d            2
>      > 185.x.x.b     4         zzzz 1230618  591626 1698392790    0    0
>     26w6d       10543
>      > 185.x.x.c     4         zzzz  247092  103288 1698392790    0    0
>     4w4d        10543
>      > 185.x.x.d     4         zzzz  643980  670503 1698392790    0    0
>     30w3d           0
>      > 185.x.x.e     4         zzzz 23580593  525601 1698392790    0   
>     0 23w6d      906760
>      > 185.x.x.f     4         zzzz  363536  376941 1698392790    0    0
>     17w1d          25
> 
>     Nick
>