Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Robert Raszuk <robert@raszuk.net> Sun, 20 December 2020 12:25 UTC

Return-Path: <robert@raszuk.net>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 72CF43A0FB0 for <idr@ietfa.amsl.com>; Sun, 20 Dec 2020 04:25:33 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.088
X-Spam-Level:
X-Spam-Status: No, score=-2.088 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=raszuk.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zoT3uLaiuO2J for <idr@ietfa.amsl.com>; Sun, 20 Dec 2020 04:25:30 -0800 (PST)
Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 56BCF3A0FAE for <idr@ietf.org>; Sun, 20 Dec 2020 04:25:30 -0800 (PST)
Received: by mail-lf1-x12e.google.com with SMTP id h22so7520322lfu.2 for <idr@ietf.org>; Sun, 20 Dec 2020 04:25:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=raszuk.net; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0CbNYxk+VyY+dCq5CoEXS7oFEbOw/MTA/gPH0VKMuhc=; b=dBBkhluBwaHG0qAO+7hBbcOm5KJ3R+krxWN4NOOqbmwe3jVJeVZZdFNb5lYBbP5KvH E6Jh5oCFKU0HmYkDCpaOn8K5fkGKQmb8xEOjoGb3rcetH36/OlubKZV5fEWt8OHr/a+7 Qf8zylfBQvdwl5FLE7q9WKeWlqX/JZGlY+pyLLthPhqxQG1rY2HOKl29rTdT+3bcYB/P KtUc1WkJD1YEQDkF9AeN7jQrHIQP8fviapRg0CKJwHKsyqwJEk0el1+MVjkRv5ov+tRN z5aOfd98vvSRHqzl6uCgNPb11xdoS4E/H6kD6vySpi60pba/BeqXX7Q4ckuzpYldO1ve ZUTg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0CbNYxk+VyY+dCq5CoEXS7oFEbOw/MTA/gPH0VKMuhc=; b=qu0bEQd7PjOkf+hiw7o8IzKoPVHaOu9Xg8DSGj+IHvRfG1Sz8TkVC2rGZWm0udSBJo R7Rq2u7/e6CU2UeJXia7WKZKbADmMKFSIODb/Au3oTguBXgTeVvBUxnbEK5h2qSg3L4J dFW9VEitrJ8NQfUqut7XInnEr6mPkphSE04clCzHVlI2Z1JqeyQPeB9z76DwDmGn0bPr z4VowUA6bEgfzY9M8lwi/9gG9+O6IU5zbYlhZWDMtyb5p7x8xm1Q7MB/CbrBG+FMCuLJ BbfSCkZHZmPk7F+ruAkqwysm/TjE4u4hjvqqrr3gzI+Hl3rtdCRTmLUv+iB7e2wddRTf gYpA==
X-Gm-Message-State: AOAM532XWsQA6w0PguQyxwgNDZWvwY/sNfim0U0emqqJwkv4Zmrv9AI1 dxIWlfH0hr8cT9yqAwJUwaOCMLCZk68XGIUxXrYCHA==
X-Google-Smtp-Source: ABdhPJwpI9lTw+qhTkccNpKrId2EZqPTSNYGRgzU5L/tHwpiOgLchKOSG+K70Xjz/cMqmKdjIoKZNER8dNkkIj7Waas=
X-Received: by 2002:a2e:9906:: with SMTP id v6mr5516821lji.361.1608467128174; Sun, 20 Dec 2020 04:25:28 -0800 (PST)
MIME-Version: 1.0
References: <CANJ8pZ-WMDotkQvhN-NuP7ivZkPRR-9S2KJSar=6463U0VKkow@mail.gmail.com> <EFC56A31-1276-4DAB-9526-9C2F24814D2C@pfrc.org> <CANJ8pZ_LnDna_jtipcLJq9rrS3MM32rLdxRW8ntC2aEi9VvzMg@mail.gmail.com> <722A787A-5B83-4802-A9F4-AB2957BB3305@juniper.net> <CA+eZshBse4g6jUBMxs4bJiE+uvWScwv7ggLNOMJbUiL1YsaisQ@mail.gmail.com> <CABNhwV1ikHAknsfNDw6GJ8BngHDNjNdCxmgipJvJ7G3rxmnZVA@mail.gmail.com> <CAOj+MMHM0bHHL9UfVZC2QWy6=W5F7QtEq9v-rndcUG0u7CLi1Q@mail.gmail.com> <CABNhwV3CaWn5gsFGr4HNi_qoE4V1N1CA44KN+fFFvVCYr1YMgw@mail.gmail.com> <CAOj+MMHfn0cPhxNmXNprGdMRVkpSv0cJJrL=fq7rHb89owj6zA@mail.gmail.com> <CABNhwV27cmF4Z=_cRa_VRLwPDV_-0s27HNaAz5QkmXRVuab2jA@mail.gmail.com>
In-Reply-To: <CABNhwV27cmF4Z=_cRa_VRLwPDV_-0s27HNaAz5QkmXRVuab2jA@mail.gmail.com>
From: Robert Raszuk <robert@raszuk.net>
Date: Sun, 20 Dec 2020 13:25:21 +0100
Message-ID: <CAOj+MMEDuUBqo2UOj8OnbEkgwTvfioD_k2t-y9S_V+59=jMURA@mail.gmail.com>
To: Gyan Mishra <hayabusagsm@gmail.com>
Cc: Enke Chen <enchen@paloaltonetworks.com>, John Scudder <jgs=40juniper.net@dmarc.ietf.org>, William McCall <william.mccall@gmail.com>, "idr@ietf. org" <idr@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000ced76d05b6e46f10"
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/dCeP3VRMmsulduA4SGIM44MZKFU>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Dec 2020 12:25:34 -0000

Gyan,

Not sure what you are really trying to highlight in your last note, but If
we would apply the same HOLD TIME used today to detect BGP liveness to
detect "stuck" peer it would be pretty bad choice for obvious reasons.

Thx,
R

On Sun, Dec 20, 2020 at 8:47 AM Gyan Mishra <hayabusagsm@gmail.com> wrote:

>
> Hi Robert
>
> Thank you for the summarization.
>
> I agree this is a critical issue to be solved as it can happen to any
> global  routing system.
>
> +1 with Tony, Jeff and others to apply the hold time to the transmit size,
> and plan to updated text for RFC 4271 section 6.5 to reflect.
>
> Excerpt from Tony:
>
> “More generally, the proposal is that we apply the HOLD TIME on the
> transmit side as well as the receive side. If we are not able to transmit
> for that period of time, the receiver should give up and so should the
> transmitter. The session is broken, updates cannot flow, and we no longer
> have (eventual) consistency.”
>
> Kind Regards
>
> Gyan
>
> On Sat, Dec 19, 2020 at 6:46 PM Robert Raszuk <robert@raszuk.net> wrote:
>
>> Hi Gyan,
>>
>> > Something simple otter then messing with TCP parameters, if instead of
>> using the default 90 second BGP dead timer,  if that was reduced down a bit
>> to like 10 / 30
>>
>> Sorry but again this is not the issue here.
>>
>> The issue is that rcv peer is not terminating the session after holdtime
>> expires.
>>
>> The sender can still keep receiving updates or keepalives just fine. This
>> is unidirectional issue.
>>
>> The ask here is to have BGP trigger session RST or termination at TCP
>> level when we can no longer write to a TCP socket for N seconds.
>>
>> - - -
>>
>> To summarize watching this thread it seems that most folks agree that if
>> we do that the HOLD_SEND should be different then HOLD_RCV.
>>
>> There is ongoing discussion to keep this at TCP level.
>>
>> There is an apparent ask to make it a default with a knob to disable it.
>>
>> Mechanics proposed seems to be to keep per peer HOLD_SEND timer and start
>> it at each socket write failure then stop+reset it at each socket write
>> success.
>>
>> The other day I asked how often BGP is retrying to write to socket in
>> most widely deployed implementations - but did not get any answer :(
>>
>> Best,
>> R.
>>
>>
>> On Sun, Dec 20, 2020 at 12:20 AM Gyan Mishra <hayabusagsm@gmail.com>
>> wrote:
>>
>>>
>>> Hi Robert
>>>
>>> On the other thread it was not quite clear.
>>>
>>> So if this scenario is completely devoid of link congestion and purely a
>>> management plane TCP control plane processing BGP socket processing issue
>>> then I agree BFD won’t help at all.
>>>
>>> I agree with the poor RP design of management plane that either lead to
>>> RP being overwhelmed high cpu and memory and or possibly memory leak or
>>> bug.
>>>
>>> Do we know which vendor?
>>>
>>> Something simple otter then messing with TCP parameters, if instead of
>>> using the default 90 second BGP dead timer,  if that was reduced down a bit
>>> to like 10 / 30, that could limit the time traffic is black hole and not
>>> rerouted to alternate path until the hold timer expires.
>>>
>>>
>>> Kind Regards
>>>
>>> Gyan
>>>
>>> On Sat, Dec 19, 2020 at 5:18 PM Robert Raszuk <robert@raszuk.net> wrote:
>>>
>>>> Hi Gyan,
>>>>
>>>> > Going down this path of does seem a lot more complicated and risker
>>>> then using BFD.
>>>>
>>>> But BFD is not going to help at all to the problem at hand.
>>>>
>>>> BFD is in the vast majority of cases distributed (and that is feature
>>>> not a bug) and responses are handled by line cards.
>>>>
>>>> Here we are dealing with RE/RP based subsystems bugs regardless if
>>>> those are in TCP or BGP layer.
>>>>
>>>> Thx,
>>>> R.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Dec 19, 2020 at 10:36 PM Gyan Mishra <hayabusagsm@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Here is the RFC 5482 TCP User timeout options from TCPM WG.
>>>>>
>>>>> https://tools.ietf.org/html/rfc5482
>>>>>
>>>>> TCPM has a bis draft update to 793 that has more info then the
>>>>> original.
>>>>>
>>>>> https://datatracker.ietf.org/wg/tcpm/documents/
>>>>>
>>>>> https://tools.ietf.org/html/draft-ietf-tcpm-rfc793bis-19#page-42
>>>>>
>>>>>
>>>>> From quick read there are caveats with devices supporting or not
>>>>> supporting the option.
>>>>>
>>>>> Also I guess setting the value is tricky as well not too low or too
>>>>> high that either could make matters worse with instability.
>>>>>
>>>>> Going down this path of does seem a lot more complicated and risker
>>>>> then using BFD.
>>>>>
>>>>>
>>>>> Kind Regards
>>>>>
>>>>> Gyan
>>>>>
>>>>>
>>>>> On Sat, Dec 19, 2020 at 5:38 AM William McCall <
>>>>> william.mccall@gmail.com> wrote:
>>>>>
>>>>>> On Fri, Dec 18, 2020 at 10:33 PM John Scudder
>>>>>> <jgs=40juniper.net@dmarc.ietf.org> wrote:
>>>>>> >
>>>>>> > On Dec 18, 2020, at 1:09 PM, Enke Chen <enchen@paloaltonetworks.com>
>>>>>> wrote:
>>>>>> > >
>>>>>> > > No, I am not assuming that packets are getting somewhere. The
>>>>>> TCP_USER_TIMEOUT would work as long as there is "pending data" (either
>>>>>> unacked, or locally queued). The data can be from the local BGP Keepalives
>>>>>> or the TCP_KEEPALIVE.
>>>>>> >
>>>>>> > Apart from the other objections to relying on TCP_USER_TIMEOUT,
>>>>>> which I think are sufficient, it’s not clear to me that implementations
>>>>>> will provide the desired semantics. RFC 793 seems like it specifies the
>>>>>> right semantics (“get this data to the peer within N seconds or close”):
>>>>>> >
>>>>>> >         The timeout, if present, permits the caller to set up a
>>>>>> timeout
>>>>>> >         for all data submitted to TCP.  If data is not successfully
>>>>>> >         delivered to the destination within the timeout period, the
>>>>>> TCP
>>>>>> >         will abort the connection.  The present global default is
>>>>>> five
>>>>>> >         minutes.
>>>>>> >
>>>>>> > However the Linux man page documents different semantics:
>>>>>> >
>>>>>> >        TCP_USER_TIMEOUT (since Linux 2.6.37)
>>>>>> >               This option takes an unsigned int as an argument.
>>>>>> When the
>>>>>> >               value is greater than 0, it specifies the maximum
>>>>>> amount of
>>>>>> >               time in milliseconds that transmitted data may remain
>>>>>> >               unacknowledged before TCP will forcibly close the
>>>>>> >               corresponding connection and return ETIMEDOUT to the
>>>>>> >               application.  If the option value is specified as 0,
>>>>>> TCP will
>>>>>> >               use the system default.
>>>>>> >
>>>>>> > The important difference being that whereas 793 implies data
>>>>>> written to the socket, the Linux man page says “transmitted” data, which
>>>>>> seems like it must mean data TCP has written to the network. These are two
>>>>>> very different things! If Linux (or another stack) implements what the man
>>>>>> page seems to say, it’s not useful for our purposes.
>>>>>> >
>>>>>> > —John
>>>>>> > _______________________________________________
>>>>>> > Idr mailing list
>>>>>> > Idr@ietf.org
>>>>>> > https://www.ietf.org/mailman/listinfo/idr
>>>>>>
>>>>>> I was curious too. I read the manpage, relevant linux kernel code, the
>>>>>> RFC, and hacked up a test case (unicast me if you want the code).
>>>>>> Also, Cloudflare published a relevant blog entry[0]. For this specific
>>>>>> scenario, see under the sub-heading "Zero window ESTAB is...
>>>>>> forever?".
>>>>>>
>>>>>> TCP_USER_TIMEOUT doesn't appear to kick in until there is unACKed
>>>>>> data, meaning that it has already been transmitted from TCP's
>>>>>> perspective. Stuff hanging around in the buffers due to persist state
>>>>>> doesn't seem to count, per the test results and the docs. Confirms
>>>>>> your thoughts from the reading I think.
>>>>>>
>>>>>> [0] https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
>>>>>>
>>>>>> --
>>>>>> William McCall
>>>>>>
>>>>>> _______________________________________________
>>>>>> Idr mailing list
>>>>>> Idr@ietf.org
>>>>>> https://www.ietf.org/mailman/listinfo/idr
>>>>>>
>>>>> --
>>>>>
>>>>> <http://www.verizon.com/>
>>>>>
>>>>> *Gyan Mishra*
>>>>>
>>>>> *Network Solutions A**rchitect *
>>>>>
>>>>>
>>>>>
>>>>> *M 301 502-134713101 Columbia Pike
>>>>> <https://www.google.com/maps/search/13101+Columbia+Pike%C2%A0+Silver+Spring,+MD?entry=gmail&source=g>*Silver
>>>>> Spring, MD
>>>>> <https://www.google.com/maps/search/13101+Columbia+Pike%C2%A0+Silver+Spring,+MD?entry=gmail&source=g>
>>>>>
>>>>> _______________________________________________
>>>>> Idr mailing list
>>>>> Idr@ietf.org
>>>>> https://www.ietf.org/mailman/listinfo/idr
>>>>>
>>>> --
>>>
>>> <http://www.verizon.com/>
>>>
>>> *Gyan Mishra*
>>>
>>> *Network Solutions A**rchitect *
>>>
>>>
>>>
>>> *M 301 502-134713101 Columbia Pike
>>> <https://www.google.com/maps/search/13101+Columbia+Pike%C2%A0+Silver+Spring,+MD?entry=gmail&source=g>*Silver
>>> Spring, MD
>>> <https://www.google.com/maps/search/13101+Columbia+Pike%C2%A0+Silver+Spring,+MD?entry=gmail&source=g>
>>>
>>> --
>
> <http://www.verizon.com/>
>
> *Gyan Mishra*
>
> *Network Solutions A**rchitect *
>
>
>
> *M 301 502-134713101 Columbia Pike *Silver Spring, MD
>
>