Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

Enke Chen <enchen@paloaltonetworks.com> Mon, 21 December 2020 20:51 UTC

Return-Path: <enchen@paloaltonetworks.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 86AE43A118F for <idr@ietfa.amsl.com>; Mon, 21 Dec 2020 12:51:11 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.096
X-Spam-Level:
X-Spam-Status: No, score=-2.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=paloaltonetworks.com header.b=WDF4W4QZ; dkim=pass (2048-bit key) header.d=paloaltonetworks-com.20150623.gappssmtp.com header.b=VIQv13lL
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YJZ8IQG5jWAf for <idr@ietfa.amsl.com>; Mon, 21 Dec 2020 12:51:08 -0800 (PST)
Received: from mx0b-00169c01.pphosted.com (mx0a-00169c01.pphosted.com [67.231.148.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B8B9E3A0E97 for <idr@ietf.org>; Mon, 21 Dec 2020 12:51:07 -0800 (PST)
Received: from pps.filterd (m0045114.ppops.net [127.0.0.1]) by mx0a-00169c01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 0BLKiGQA018386 for <idr@ietf.org>; Mon, 21 Dec 2020 12:51:07 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=paloaltonetworks.com; h=mime-version : references : in-reply-to : from : date : message-id : subject : to : cc : content-type; s=PPS12012017; bh=XN1DKt9Tqas3TF7S2gzCxnXMN11PgMtCp2Z6NU+zFQ8=; b=WDF4W4QZ927/bleBTAI8C6FbXVtAPrqNTNTbA4tPgdG4U4eHHu8BlYl6/2qyf93k0vnX bLWLoHsvJLuDdIJ4bFB7I5VQlP6o4s1ClzGpqPTcL0fwk57PFPL5bRObJMNuhC3NaSdY cCuoj6SQUxuERwAR06j+LBBJbln+PeZ6dnE8P3m2rBuOvW0SpncVv2WgodSp48utA4Jo PXnhfm11SWmWby91P+7SsmDmLXaQB//nDha8qa44UreVIZnUKD1kxnoYsZScsgJTuYa+ SPdzPubGyArx5wUhqcbHU0xdIEYs1qwgaPKLZsuAtdqCac8IUcM7+yNP7HqpJhfPZswY yw==
Received: from mail-lf1-f70.google.com (mail-lf1-f70.google.com [209.85.167.70]) by mx0a-00169c01.pphosted.com with ESMTP id 35k0dv8vrx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for <idr@ietf.org>; Mon, 21 Dec 2020 12:51:06 -0800
Received: by mail-lf1-f70.google.com with SMTP id w11so12360616lff.22 for <idr@ietf.org>; Mon, 21 Dec 2020 12:51:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=paloaltonetworks-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=XN1DKt9Tqas3TF7S2gzCxnXMN11PgMtCp2Z6NU+zFQ8=; b=VIQv13lLGEnMdDXs6+fB0AcVy/1P/vbaZhqvw1h1cBAGqnh7bvadMfp9AFA4/BuIxp WDZjQ5gYxgYxLxy4C4u/fgc3zAIwaMPWydmoot9qf068R8LYvty0xzN6fvn12HCWTRrU v8/rrazUKOG0aJ+yhop4Sum8r91DHoRpGwU4GGA0VHrGWjS1pJZfrDEO3tcpvXlPz+jp 6OlcIo1xN48P8GNQvSPxo+uyImFy8qVjlpqAfxvij8CPLx33KvnSu0AAFnHZXzUS18Nh LVjbhO1eSRVkKY5w6ujJc3dynRq9El7N5DN2egyDUhOaimFy7TeFK9SmUBgf0FaQEHhn G7Kg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XN1DKt9Tqas3TF7S2gzCxnXMN11PgMtCp2Z6NU+zFQ8=; b=hNpW731Ch8L7KBxybd70S64ouYDnR21bsYmcWqC9KVkRSxt7YdaLFLKZAEbWwPFWaN PJcJNDzBsiKxVVPwjKZ9/af79x/RA/iITbEV9B0o9YyRbe6VsROxkBfls0toEOTGDEA7 Y6ISgweGke9KI3iRCMNrAfdkAwl46MfczM6rYZd5uSUOqq+6Uv/mpfiKIwBshl4hmydI aG6SATngnCbPgqp3GaWov+ERvm1qWQcd4W0ple6Zvn1e7tAy9MK/BiclGy4q3mbsp1FU I5Hipqyfm90vGhOaLJwqyXJKZLfs77WJe8TQ3u+ugwNFwznWZAgXsdtoEz6/exxZOSXI 9YKQ==
X-Gm-Message-State: AOAM532wMI7eborOCDcwMHQUHQedWC7eNJnNyJtIBfBjljjUowzrEcY6 w9WOSz37+bbtKRFP2pQaaUTCZ1pb9Iy2Bm1ANwtOKKAJq4xLV64yYS9JDyMRb8qZ/zUb/9BmNZD 76/xfO/xwKdZn/soi5sI=
X-Received: by 2002:a19:a40a:: with SMTP id q10mr7082530lfc.39.1608583864530; Mon, 21 Dec 2020 12:51:04 -0800 (PST)
X-Google-Smtp-Source: ABdhPJxgbGAoBZAdu36fsC4xEKD8LuwKu80sioJUV/XmlGJngIz00m8TwtFwO3Jtk2zmL4A7l6ylUtN7J7oYhuqtCAs=
X-Received: by 2002:a19:a40a:: with SMTP id q10mr7082520lfc.39.1608583864187; Mon, 21 Dec 2020 12:51:04 -0800 (PST)
MIME-Version: 1.0
References: <CANJ8pZ-WMDotkQvhN-NuP7ivZkPRR-9S2KJSar=6463U0VKkow@mail.gmail.com> <EFC56A31-1276-4DAB-9526-9C2F24814D2C@pfrc.org> <CANJ8pZ_LnDna_jtipcLJq9rrS3MM32rLdxRW8ntC2aEi9VvzMg@mail.gmail.com> <722A787A-5B83-4802-A9F4-AB2957BB3305@juniper.net> <CA+eZshBse4g6jUBMxs4bJiE+uvWScwv7ggLNOMJbUiL1YsaisQ@mail.gmail.com> <CANJ8pZ9LfsNfqU5Sq88HTHx71BjdrfJfTrWGVyhgajKv6ACfew@mail.gmail.com>
In-Reply-To: <CANJ8pZ9LfsNfqU5Sq88HTHx71BjdrfJfTrWGVyhgajKv6ACfew@mail.gmail.com>
From: Enke Chen <enchen@paloaltonetworks.com>
Date: Mon, 21 Dec 2020 12:50:52 -0800
Message-ID: <CANJ8pZ_Zqj_knpRSoWjaVV3f4rzqOAoVoJsYvNQzVti3LybXEw@mail.gmail.com>
To: William McCall <william.mccall@gmail.com>
Cc: "idr@ietf. org" <idr@ietf.org>, jgs@juniper.net
Content-Type: multipart/alternative; boundary="000000000000d11d4405b6ff9d82"
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2020-12-21_11:2020-12-21, 2020-12-21 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_spam_notspam policy=outbound_spam score=0 suspectscore=0 mlxscore=0 impostorscore=0 malwarescore=0 phishscore=0 adultscore=0 spamscore=0 bulkscore=0 mlxlogscore=999 clxscore=1015 priorityscore=1501 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2012210141
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/Gt5t6j_qsGuTI4B_OfhTe-l9rvA>
Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 21 Dec 2020 20:51:12 -0000

Hi, William:

There have been multiple fixes to the TCP_USER_TIMEOUT feature in LInux.  I
listed several (see below) that look important.

One of the commits has the following comment:

    In addition, we change TCP_USER_TIMEOUT to cover (life or dead)

    sockets stalled on zero-window probes. This changes the semantics

    of TCP_USER_TIMEOUT slightly because it previously only applies

    when the socket has pending transmission.


Not sure which Linux version you used, and whether it has these fixes.


Thanks.   -- Enke


------


commit 8c72c65b426b47b3c166a8fef0d8927fe5e8a28d

Author: Eric Dumazet <edumazet@googl.com>

Date:   Wed Sep 13 20:30:39 2017 -0700


    tcp: update skb->skb_mstamp more carefully


    liujian reported a problem in TCP_USER_TIMEOUT processing with a patch

    in tcp_probe_timer() :

          https://www.spinics.net/lists/netdev/msg454496.html

    ...


commit 4ab688793e086ef6d1744a0f803fe9770a1ae5d0

Author: Eric Dumazet <edumazet@google.com>

Date:   Sun May 21 10:39:00 2017 -0700


    tcp: fix tcp_probe_timer() for TCP_USER_TIMEOUT


    TCP_USER_TIMEOUT is still converted to jiffies value in

    icsk_user_timeout


    So we need to make a conversion for the cases HZ != 1000

    ...


commit b248230c34970a6c1c17c591d63b464e8d2cfc33

Author: Yuchung Cheng <ycheng@google.com>

Date:   Mon Sep 29 13:20:38 2014 -0700


    tcp: abort orphan sockets stalling on zero window probes



    Currently we have two different policies for orphan sockets

    that repeatedly stall on zero window ACKs. If a socket gets

    a zero window ACK when it is transmitting data, the RTO is

    used to probe the window. The socket is aborted after roughly

    tcp_orphan_retries() retries (as in tcp_write_timeout()).



    But if the socket was idle when it received the zero window ACK,

    and later wants to send more data, we use the probe timer to

    probe the window. If the receiver always returns zero window ACKs,

    icsk_probes keeps getting reset in tcp_ack() and the orphan socket

    can stall forever until the system reaches the orphan limit (as

    commented in tcp_probe_timer()). This opens up a simple attack

    to create lots of hanging orphan sockets to burn the memory

    and the CPU, as demonstrated in the recent netdev post "TCP

    connection will hang in FIN_WAIT1 after closing if zero window is

    advertised." http://www.spinics.net/lists/netdev/msg296539.html



    This patch follows the design in RTO-based probe: we abort an orphan

    socket stalling on zero window when the probe timer reaches both

    the maximum backoff and the maximum RTO. For example, an 100ms RTT

    connection will timeout after roughly 153 seconds (0.3 + 0.6 +

    .... + 76.8) if the receiver keeps the window shut. If the orphan

    socket passes this check, but the system already has too many orphans

    (as in tcp_out_of_resources()), we still abort it but we'll also

    send an RST packet as the connection may still be active.



    In addition, we change TCP_USER_TIMEOUT to cover (life or dead)

    sockets stalled on zero-window probes. This changes the semantics

    of TCP_USER_TIMEOUT slightly because it previously only applies

    when the socket has pending transmission.

    ...

On Sun, Dec 20, 2020 at 10:52 PM Enke Chen <enchen@paloaltonetworks.com>
wrote:

> Hi, John and William:
>
> My reading of the Linux function tcp_probe_timer()  is that the data in
> the socket buffer is checked. More specifically, when there is no
> un-acked data, but there is data in the socket buffer,
> the "icsk_user_timeout" would be checked, and the probe timer would be set
> again in tcp_send_probe0().
>
> I am not sure what could have caused the failure that William observed. We
> will need someone who is familiar with the TCP code to take a look.
> There might be one potential issue in tcp_check_probe_timer() where the
> probe timer is not started (please see below).
>
> Thanks.   -- Enke
>
> ----------------------
> Linux v5.10-rc7-149-g33dc961
>
> *diff --git a/include/net/tcp.h b/include/net/tcp.h*
>
> *index d4ef5bf..0b28af1 100644*
>
> *--- a/include/net/tcp.h*
>
> *+++ b/include/net/tcp.h*
>
> @@ -1328,7 +1328,8 @@ static inline unsigned long tcp_probe0_when(const
> struct sock *sk,
>
>
>
>  static inline void tcp_check_probe_timer(struct sock *sk)
>
>  {
>
> -       if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending)
>
> +       if (!tcp_sk(sk)->packets_out &&
>
> +           (inet_csk(sk)->icsk_pending != ICSK_TIME_PROBE0))
>
>                 tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
>
>                                      tcp_probe0_base(sk), TCP_RTO_MAX);
>  }
>
> ----------
>
>
> static void tcp_probe_timer(struct sock *sk)
>
> {
>
>         struct inet_connection_sock *icsk = inet_csk(sk);
>
>         struct sk_buff *skb = tcp_send_head(sk);
>
>         struct tcp_sock *tp = tcp_sk(sk);
>
>         int max_probes;
>
>
>         if (tp->packets_out || !skb) {
>
>                 icsk->icsk_probes_out = 0;
>
>                 return;
>
>         }
>
>
>         /* RFC 1122 4.2.2.17 requires the sender to stay open
> indefinitely as
>
>          * long as the receiver continues to respond probes. We support
> this by
>
>          * default and reset icsk_probes_out with incoming ACKs. But if
> the
>
>          * socket is orphaned or the user specifies TCP_USER_TIMEOUT, we
>
>          * kill the socket when the retry count and the time exceeds the
>
>          * corresponding system limit. We also implement similar policy
> when
>
>          * we use RTO to probe window in tcp_retransmit_timer().
>
>          */
>
>         if (icsk->icsk_user_timeout) {
>
>                 u32 elapsed = tcp_model_timeout(sk, icsk->icsk_probes_out,
>
>                                                 tcp_probe0_base(sk));
>
>
>                 if (elapsed >= icsk->icsk_user_timeout)
>
>                         goto abort;
>
>         }
>
>
> On Sat, Dec 19, 2020 at 2:38 AM William McCall <william.mccall@gmail.com>
> wrote:
>
>> On Fri, Dec 18, 2020 at 10:33 PM John Scudder
>> <jgs=40juniper.net@dmarc.ietf.org> wrote:
>> >
>> > On Dec 18, 2020, at 1:09 PM, Enke Chen <enchen@paloaltonetworks.com>
>> wrote:
>> > >
>> > > No, I am not assuming that packets are getting somewhere. The
>> TCP_USER_TIMEOUT would work as long as there is "pending data" (either
>> unacked, or locally queued). The data can be from the local BGP Keepalives
>> or the TCP_KEEPALIVE.
>> >
>> > Apart from the other objections to relying on TCP_USER_TIMEOUT, which I
>> think are sufficient, it’s not clear to me that implementations will
>> provide the desired semantics. RFC 793 seems like it specifies the right
>> semantics (“get this data to the peer within N seconds or close”):
>> >
>> >         The timeout, if present, permits the caller to set up a timeout
>> >         for all data submitted to TCP.  If data is not successfully
>> >         delivered to the destination within the timeout period, the TCP
>> >         will abort the connection.  The present global default is five
>> >         minutes.
>> >
>> > However the Linux man page documents different semantics:
>> >
>> >        TCP_USER_TIMEOUT (since Linux 2.6.37)
>> >               This option takes an unsigned int as an argument.  When
>> the
>> >               value is greater than 0, it specifies the maximum amount
>> of
>> >               time in milliseconds that transmitted data may remain
>> >               unacknowledged before TCP will forcibly close the
>> >               corresponding connection and return ETIMEDOUT to the
>> >               application.  If the option value is specified as 0, TCP
>> will
>> >               use the system default.
>> >
>> > The important difference being that whereas 793 implies data written to
>> the socket, the Linux man page says “transmitted” data, which seems like it
>> must mean data TCP has written to the network. These are two very different
>> things! If Linux (or another stack) implements what the man page seems to
>> say, it’s not useful for our purposes.
>> >
>> > —John
>> > _______________________________________________
>> > Idr mailing list
>> > Idr@ietf.org
>> >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ietf.org_mailman_listinfo_idr&d=DwIFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=OPLTTSu-451-QhDoSINhI2xYdwiMmfF5A2l8luvN11E&m=P-eZWmrFtootouPUugKAk40aIyuZdrP9wLMCSS7GUTU&s=6oYcnalNTtK-8ktoh-vivM6BlWM0bCrW3WuHw19s7zo&e=
>>
>> I was curious too. I read the manpage, relevant linux kernel code, the
>> RFC, and hacked up a test case (unicast me if you want the code).
>> Also, Cloudflare published a relevant blog entry[0]. For this specific
>> scenario, see under the sub-heading "Zero window ESTAB is...
>> forever?".
>>
>> TCP_USER_TIMEOUT doesn't appear to kick in until there is unACKed
>> data, meaning that it has already been transmitted from TCP's
>> perspective. Stuff hanging around in the buffers due to persist state
>> doesn't seem to count, per the test results and the docs. Confirms
>> your thoughts from the reading I think.
>>
>> [0]
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.cloudflare.com_when-2Dtcp-2Dsockets-2Drefuse-2Dto-2Ddie_&d=DwIFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=OPLTTSu-451-QhDoSINhI2xYdwiMmfF5A2l8luvN11E&m=P-eZWmrFtootouPUugKAk40aIyuZdrP9wLMCSS7GUTU&s=M-HzefvcFBD2FU8OVERU_vTL_ObzcdQdlk0BUrADphk&e=
>>
>> --
>> William McCall
>>
>