Re: [Idr] Fwd: New Version Notification for draft-spaghetti-idr-bgp-sendholdtimer-05.txt

Job Snijders <job@fastly.com> Wed, 03 August 2022 14:36 UTC

Return-Path: <job@fastly.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A4145C15A73A for <idr@ietfa.amsl.com>; Wed, 3 Aug 2022 07:36:41 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.107
X-Spam-Level:
X-Spam-Status: No, score=-7.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=fastly.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id eitPCcLOunDo for <idr@ietfa.amsl.com>; Wed, 3 Aug 2022 07:36:38 -0700 (PDT)
Received: from mail-ej1-x636.google.com (mail-ej1-x636.google.com [IPv6:2a00:1450:4864:20::636]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0A300C15A739 for <idr@ietf.org>; Wed, 3 Aug 2022 07:36:38 -0700 (PDT)
Received: by mail-ej1-x636.google.com with SMTP id gk3so19667893ejb.8 for <idr@ietf.org>; Wed, 03 Aug 2022 07:36:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fastly.com; s=google; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc; bh=GP+4xbfq/F0JDX5npOzAUi+yVqG20GyozKR2w770uKw=; b=L97ObEs742J+c11Yhm2tAZWJdYh0a5Hnn7bw5wuEpsh0H4iwmPB8PCKMH35wIqzyxi al4LuzrPaFPHw4kjQPqiSycie3DQmvgK0mtnSdKadh1BUQO08Z/f5EjlqLc4DjNh5HIp Fij1I6ZPDK960Bnjm1PEV9jn0rZ5ONexzgZaU=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc; bh=GP+4xbfq/F0JDX5npOzAUi+yVqG20GyozKR2w770uKw=; b=Xv1GY/TXLNYN4yPlPFb8MZF2kGhMOq7dqp60MWxmv0YNb3HjQmEH5NOeZG+u4mlYKs 7dj0MNd+Fme1dSvDTpmdLhzUbaXKz0VFe88ydx8NSYZRf6wQfmlXK6lT+66N9jxqUnGQ vXSTd26YJIrTTdhOWe+WyOBnZSYiReESuJ94N+SAnUuYJX9PWluqsBPn4Z528pGf8KPV hxBPe2LRPw+WGhKxWdKRQkFaJGAoetaj3YT53QhU/32rSgVw7PAkZpBFdqo672DJDaMK hLgMu9E7aeq4Wr2V3vsR5gka0MuggG4IiNkAq9qoUAzrLzYhoHkJM3COFskE27DoB4DL hDTw==
X-Gm-Message-State: AJIora92m0FP2Qfpe4W9Cw6hLoA9Ye473kC1NJRg1C4dNondWt17HZ3c HDFjNopdQdmOXlyHyf5N44LGdg==
X-Google-Smtp-Source: AGRyM1syX2Hdke+B6tHhpO+RWqP8LWQ4CAvF1qbZvOjwXRARQJ8gUXIbT0niUUruTEmEDRd6bfonjQ==
X-Received: by 2002:a17:906:9b86:b0:6f8:24e7:af7d with SMTP id dd6-20020a1709069b8600b006f824e7af7dmr20519439ejc.295.1659537395788; Wed, 03 Aug 2022 07:36:35 -0700 (PDT)
Received: from snel ([2a10:3781:276:2:16f6:d8ff:fe47:2eb7]) by smtp.gmail.com with ESMTPSA id q21-20020a170906941500b00715a02874acsm2730067ejx.35.2022.08.03.07.36.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Aug 2022 07:36:35 -0700 (PDT)
Date: Wed, 03 Aug 2022 16:36:33 +0200
From: Job Snijders <job@fastly.com>
To: Robert Raszuk <robert@raszuk.net>
Cc: heasley <heas@shrubbery.net>, "idr@ietf. org" <idr@ietf.org>
Message-ID: <YuqH8WxXWofLzkLf@snel>
References: <CAOj+MME7XnW7kDXL4muh4Qp1UvabQ9amUoU0Sn3h2axqKzswzA@mail.gmail.com> <77F3E1F0-486F-47DF-ABE4-EFDB9C2FB6D8@gmail.com> <CAOj+MMGR4f3eLEDZY++1m4Lpo9joG4L9OrWbeF6kREn-9a9onA@mail.gmail.com> <c6e44213-7667-0f67-71a4-634411cd102b@foobar.org> <CAOj+MMFajL6E42WCzC0ZqrfSBZjU-0B=ZzmtvCRPkuMzU8z5QA@mail.gmail.com> <Yun6e5jSb0OYZGAX@shrubbery.net> <CAOj+MMFRJr=cs+5DVOp72BVn_j3NgANwNftyj=jRbdsvPpg-wA@mail.gmail.com> <Yupp6uYxNVsBlL07@snel> <CAOj+MMF9MwsQR_c_vOenQrWFuJT5Td0kp_85WZV7ekjRw3GnMg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAOj+MMF9MwsQR_c_vOenQrWFuJT5Td0kp_85WZV7ekjRw3GnMg@mail.gmail.com>
X-Clacks-Overhead: GNU Terry Pratchett
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/2vLPDO1_aaH947NdoBX-OP7gmNg>
Subject: Re: [Idr] Fwd: New Version Notification for draft-spaghetti-idr-bgp-sendholdtimer-05.txt
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Aug 2022 14:36:41 -0000

On Wed, Aug 03, 2022 at 03:11:33PM +0200, Robert Raszuk wrote:
> Ok Many thx for this clarification. I think it would be great to add it to
> the draft as far as send() returning an error.

The point is that send() does not return an error, it blocks. (A
blocking system call is one that must wait until the action can be
completed.)

> While I appreciate the proposal I am still not sure if this is a good
> idea.  We are dealing with peer who in spite of not able to receive
> our BGP messages UPDATES or KEEPALIVES keeps session for
> "hours/days/weeks". Isn't this behaviour a direct manifestation of
> ignoring normative aspects of RFC4271 Section 6.5 ?
> 
> 6.5.  Hold Timer Expired Error Handling
> 
>    If a system does not receive successive KEEPALIVE, UPDATE, and/or
>    NOTIFICATION messages within the period specified in the Hold Time
>    field of the OPEN message, then the NOTIFICATION message with the
>    Hold Timer Expired Error Code is sent and the BGP connection is
>    closed.

The subtlety here is that the above detection mechanism cannot cope with
a bad peer that is periodically sending KEEPALIVEs (which could happen
if the KEEPALIVE generation is done in a separate process/thread); while
not processing inbound messages from the healthy peer. There are two
aspects to consider:

1/ A system might be unable to close the BGP connection, because the
final NOTIFICATION message is stuck in the "SendQ", this in turn also
means the local system can't proceed to generate WITHDRAW messages for
routes pointing to the broken peer. See this animation for an
illustration: https://twitter.com/JobSnijders/status/1337803996657016833

2/ The § 6.5 Error Handling is a asynchronous: moving on to the next
state of the FSM depends on successfully receiving successive
KEEPALIVEs; not on successfully writing data to the remote (bad) peer.

> But worse, we are allowing broken peer to keep running, potentially
> poisoning the network and just cutting ourselves from it. He may have
> ignored our withdrawals and hundreds of peers may use wrong information.

Which is *exactly* the reason to cut the poisonous system off the
network.

> IMO such stuck peers should get HIGH SEVERITY syslog msg and NOC
> should take an action by calling the other party or manually bringing
> the peer down. If you say this state is being ignored for
> "hours/days/weeks" I am afraid this is not protocol problem but
> operational issue modulo lack of proper alarming.

Why wait for manually bringing down the peer? :-)

I'm happy to see you now agree the problem scenario is a "HIGH SEVERITY"
situation!

Kind regards,

Job