Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Robert Raszuk <robert@raszuk.net> Sat, 19 December 2020 23:46 UTC

Return-Path: <robert@raszuk.net>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 679DE3A0C44 for <idr@ietfa.amsl.com>; Sat, 19 Dec 2020 15:46:07 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.088
X-Spam-Level:
X-Spam-Status: No, score=-2.088 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=raszuk.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HwhSb43Ve6f0 for <idr@ietfa.amsl.com>; Sat, 19 Dec 2020 15:46:03 -0800 (PST)
Received: from mail-lf1-x12d.google.com (mail-lf1-x12d.google.com [IPv6:2a00:1450:4864:20::12d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DAD453A0C45 for <idr@ietf.org>; Sat, 19 Dec 2020 15:46:02 -0800 (PST)
Received: by mail-lf1-x12d.google.com with SMTP id s26so14934641lfc.8 for <idr@ietf.org>; Sat, 19 Dec 2020 15:46:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=raszuk.net; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=iZAQI9Dg/Rc+L46kYaF09PAH96snDF5e2oN721HPJzA=; b=JBGvOA3DSxYmoX7pFN3iYkjzLtOkpki0ZrHfqBLGJv6VIiD+RS5WSMz3Vpi+TQ218b YCC+sl1ZATxAGva3VtIksmSr616d4m1aEMg9wE0ve5XzCwA0ACQUBPhBndacABXgjAfY N2PE6QfwCJtR2Fq00G2nzSq8bOAg0/yPDMfGnp6fr8tw23DxUlkTvYNnboENlKqIRjt+ 42PB5UnkHdq7Bnc+U3DBgKGz3+V79U35UkmSX2kCl4Ue9C/Xg4Qkg9B7wXTlD0HJCMGq Ljv5GezrhXpw+WG3PyM/5jFH0H0d8LfoTsWDbYCYL1PA+bpczVHTsPYqvV/FJxGXOVMr CTWg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=iZAQI9Dg/Rc+L46kYaF09PAH96snDF5e2oN721HPJzA=; b=e1raC/AkVlPdl/SgRzdcLJ0sRZEOLzn+VxV2nbPHdsPOuO5bzx13o/WvFoL0AYdI9Z QGmoyR2PGCtN2Bl8HNFke7QjqWhjC9uVirzX9XM2hcAvOm0YBrRjdZJyYHql529wvnnq 6VhU4wijRZ7ewE7ygjYMIrUo5qnSiOMquLNZ7KOop26TOgOF1+ry/x708Rb3IqpOpRTI MNn9/tIOMaYOkFO3TXAbNombmdXl6/oXE6VVePsptNk1OqoktrvLFEKUbZkRx94yCgGN 9NIHmBcAWP7yitmNuwMCnHFx7R/BEQ2Zd3x7D9esRKuUu18bvgHTWwm4wK9AWc1r2dIA v/Fw==
X-Gm-Message-State: AOAM530uZB4Fx3WhqFR8NiyVEKERqMMXApx2hRwRTcuYkOkRtR0igqIv zbb6gC5sQFv4WFmgvXJFlAiP2DmxGjjg9xsGzM/AbA==
X-Google-Smtp-Source: ABdhPJxcBM+ahZbzuGWprztAz/BmVwTWfaiQkA9Cvqi+e7MioTm8bi8hqUM28dYkBVdGZq2HFFxJbZOTczIhXblmjcM=
X-Received: by 2002:ac2:4147:: with SMTP id c7mr3794909lfi.396.1608421560489; Sat, 19 Dec 2020 15:46:00 -0800 (PST)
MIME-Version: 1.0
References: <CANJ8pZ-WMDotkQvhN-NuP7ivZkPRR-9S2KJSar=6463U0VKkow@mail.gmail.com> <EFC56A31-1276-4DAB-9526-9C2F24814D2C@pfrc.org> <CANJ8pZ_LnDna_jtipcLJq9rrS3MM32rLdxRW8ntC2aEi9VvzMg@mail.gmail.com> <722A787A-5B83-4802-A9F4-AB2957BB3305@juniper.net> <CA+eZshBse4g6jUBMxs4bJiE+uvWScwv7ggLNOMJbUiL1YsaisQ@mail.gmail.com> <CABNhwV1ikHAknsfNDw6GJ8BngHDNjNdCxmgipJvJ7G3rxmnZVA@mail.gmail.com> <CAOj+MMHM0bHHL9UfVZC2QWy6=W5F7QtEq9v-rndcUG0u7CLi1Q@mail.gmail.com> <CABNhwV3CaWn5gsFGr4HNi_qoE4V1N1CA44KN+fFFvVCYr1YMgw@mail.gmail.com>
In-Reply-To: <CABNhwV3CaWn5gsFGr4HNi_qoE4V1N1CA44KN+fFFvVCYr1YMgw@mail.gmail.com>
From: Robert Raszuk <robert@raszuk.net>
Date: Sun, 20 Dec 2020 00:45:52 +0100
Message-ID: <CAOj+MMHfn0cPhxNmXNprGdMRVkpSv0cJJrL=fq7rHb89owj6zA@mail.gmail.com>
To: Gyan Mishra <hayabusagsm@gmail.com>
Cc: Enke Chen <enchen@paloaltonetworks.com>, John Scudder <jgs=40juniper.net@dmarc.ietf.org>, William McCall <william.mccall@gmail.com>, "idr@ietf. org" <idr@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000c3293905b6d9d3fe"
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/xVMlvTz3c2NZB62U4JimbRGDeqk>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2020 23:46:08 -0000

Hi Gyan,

> Something simple otter then messing with TCP parameters, if instead of
using the default 90 second BGP dead timer,  if that was reduced down a bit
to like 10 / 30

Sorry but again this is not the issue here.

The issue is that rcv peer is not terminating the session after holdtime
expires.

The sender can still keep receiving updates or keepalives just fine. This
is unidirectional issue.

The ask here is to have BGP trigger session RST or termination at TCP level
when we can no longer write to a TCP socket for N seconds.

- - -

To summarize watching this thread it seems that most folks agree that if we
do that the HOLD_SEND should be different then HOLD_RCV.

There is ongoing discussion to keep this at TCP level.

There is an apparent ask to make it a default with a knob to disable it.

Mechanics proposed seems to be to keep per peer HOLD_SEND timer and start
it at each socket write failure then stop+reset it at each socket write
success.

The other day I asked how often BGP is retrying to write to socket in most
widely deployed implementations - but did not get any answer :(

Best,
R.


On Sun, Dec 20, 2020 at 12:20 AM Gyan Mishra <hayabusagsm@gmail.com> wrote:

>
> Hi Robert
>
> On the other thread it was not quite clear.
>
> So if this scenario is completely devoid of link congestion and purely a
> management plane TCP control plane processing BGP socket processing issue
> then I agree BFD won’t help at all.
>
> I agree with the poor RP design of management plane that either lead to RP
> being overwhelmed high cpu and memory and or possibly memory leak or bug.
>
> Do we know which vendor?
>
> Something simple otter then messing with TCP parameters, if instead of
> using the default 90 second BGP dead timer,  if that was reduced down a bit
> to like 10 / 30, that could limit the time traffic is black hole and not
> rerouted to alternate path until the hold timer expires.
>
>
> Kind Regards
>
> Gyan
>
> On Sat, Dec 19, 2020 at 5:18 PM Robert Raszuk <robert@raszuk.net> wrote:
>
>> Hi Gyan,
>>
>> > Going down this path of does seem a lot more complicated and risker
>> then using BFD.
>>
>> But BFD is not going to help at all to the problem at hand.
>>
>> BFD is in the vast majority of cases distributed (and that is feature not
>> a bug) and responses are handled by line cards.
>>
>> Here we are dealing with RE/RP based subsystems bugs regardless if those
>> are in TCP or BGP layer.
>>
>> Thx,
>> R.
>>
>>
>>
>>
>>
>>
>> On Sat, Dec 19, 2020 at 10:36 PM Gyan Mishra <hayabusagsm@gmail.com>
>> wrote:
>>
>>>
>>> Here is the RFC 5482 TCP User timeout options from TCPM WG.
>>>
>>> https://tools.ietf.org/html/rfc5482
>>>
>>> TCPM has a bis draft update to 793 that has more info then the original.
>>>
>>> https://datatracker.ietf.org/wg/tcpm/documents/
>>>
>>> https://tools.ietf.org/html/draft-ietf-tcpm-rfc793bis-19#page-42
>>>
>>>
>>> From quick read there are caveats with devices supporting or not
>>> supporting the option.
>>>
>>> Also I guess setting the value is tricky as well not too low or too high
>>> that either could make matters worse with instability.
>>>
>>> Going down this path of does seem a lot more complicated and risker then
>>> using BFD.
>>>
>>>
>>> Kind Regards
>>>
>>> Gyan
>>>
>>>
>>> On Sat, Dec 19, 2020 at 5:38 AM William McCall <william.mccall@gmail.com>
>>> wrote:
>>>
>>>> On Fri, Dec 18, 2020 at 10:33 PM John Scudder
>>>> <jgs=40juniper.net@dmarc.ietf.org> wrote:
>>>> >
>>>> > On Dec 18, 2020, at 1:09 PM, Enke Chen <enchen@paloaltonetworks.com>
>>>> wrote:
>>>> > >
>>>> > > No, I am not assuming that packets are getting somewhere. The
>>>> TCP_USER_TIMEOUT would work as long as there is "pending data" (either
>>>> unacked, or locally queued). The data can be from the local BGP Keepalives
>>>> or the TCP_KEEPALIVE.
>>>> >
>>>> > Apart from the other objections to relying on TCP_USER_TIMEOUT, which
>>>> I think are sufficient, it’s not clear to me that implementations will
>>>> provide the desired semantics. RFC 793 seems like it specifies the right
>>>> semantics (“get this data to the peer within N seconds or close”):
>>>> >
>>>> >         The timeout, if present, permits the caller to set up a
>>>> timeout
>>>> >         for all data submitted to TCP.  If data is not successfully
>>>> >         delivered to the destination within the timeout period, the
>>>> TCP
>>>> >         will abort the connection.  The present global default is five
>>>> >         minutes.
>>>> >
>>>> > However the Linux man page documents different semantics:
>>>> >
>>>> >        TCP_USER_TIMEOUT (since Linux 2.6.37)
>>>> >               This option takes an unsigned int as an argument.  When
>>>> the
>>>> >               value is greater than 0, it specifies the maximum
>>>> amount of
>>>> >               time in milliseconds that transmitted data may remain
>>>> >               unacknowledged before TCP will forcibly close the
>>>> >               corresponding connection and return ETIMEDOUT to the
>>>> >               application.  If the option value is specified as 0,
>>>> TCP will
>>>> >               use the system default.
>>>> >
>>>> > The important difference being that whereas 793 implies data written
>>>> to the socket, the Linux man page says “transmitted” data, which seems like
>>>> it must mean data TCP has written to the network. These are two very
>>>> different things! If Linux (or another stack) implements what the man page
>>>> seems to say, it’s not useful for our purposes.
>>>> >
>>>> > —John
>>>> > _______________________________________________
>>>> > Idr mailing list
>>>> > Idr@ietf.org
>>>> > https://www.ietf.org/mailman/listinfo/idr
>>>>
>>>> I was curious too. I read the manpage, relevant linux kernel code, the
>>>> RFC, and hacked up a test case (unicast me if you want the code).
>>>> Also, Cloudflare published a relevant blog entry[0]. For this specific
>>>> scenario, see under the sub-heading "Zero window ESTAB is...
>>>> forever?".
>>>>
>>>> TCP_USER_TIMEOUT doesn't appear to kick in until there is unACKed
>>>> data, meaning that it has already been transmitted from TCP's
>>>> perspective. Stuff hanging around in the buffers due to persist state
>>>> doesn't seem to count, per the test results and the docs. Confirms
>>>> your thoughts from the reading I think.
>>>>
>>>> [0] https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
>>>>
>>>> --
>>>> William McCall
>>>>
>>>> _______________________________________________
>>>> Idr mailing list
>>>> Idr@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/idr
>>>>
>>> --
>>>
>>> <http://www.verizon.com/>
>>>
>>> *Gyan Mishra*
>>>
>>> *Network Solutions A**rchitect *
>>>
>>>
>>>
>>> *M 301 502-134713101 Columbia Pike
>>> <https://www.google.com/maps/search/13101+Columbia+Pike%C2%A0+Silver+Spring,+MD?entry=gmail&source=g>*Silver
>>> Spring, MD
>>> <https://www.google.com/maps/search/13101+Columbia+Pike%C2%A0+Silver+Spring,+MD?entry=gmail&source=g>
>>>
>>> _______________________________________________
>>> Idr mailing list
>>> Idr@ietf.org
>>> https://www.ietf.org/mailman/listinfo/idr
>>>
>> --
>
> <http://www.verizon.com/>
>
> *Gyan Mishra*
>
> *Network Solutions A**rchitect *
>
>
>
> *M 301 502-134713101 Columbia Pike *Silver Spring, MD
>
>