Re: [tcpm] Possible deadlock scenario with retransmission on both sides at the same time

Yoshifumi Nishida <nishida@sfc.wide.ad.jp> Fri, 30 September 2016 08:06 UTC

Return-Path: <nishida@sfc.wide.ad.jp>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 53AD312B15B for <tcpm@ietfa.amsl.com>; Fri, 30 Sep 2016 01:06:13 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.716
X-Spam-Level:
X-Spam-Status: No, score=-3.716 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_SORBS_SPAM=0.5, RP_MATCHES_RCVD=-2.316, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uXQw1xbBFwE5 for <tcpm@ietfa.amsl.com>; Fri, 30 Sep 2016 01:06:11 -0700 (PDT)
Received: from mail.sfc.wide.ad.jp (shonan.sfc.wide.ad.jp [203.178.142.130]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BFE0D12B074 for <tcpm@ietf.org>; Fri, 30 Sep 2016 01:06:10 -0700 (PDT)
Received: from mail-ua0-f178.google.com (mail-ua0-f178.google.com [209.85.217.178]) by mail.sfc.wide.ad.jp (Postfix) with ESMTPSA id BFDFE2D98F1 for <tcpm@ietf.org>; Fri, 30 Sep 2016 17:06:08 +0900 (JST)
Received: by mail-ua0-f178.google.com with SMTP id v7so13232037uaa.0 for <tcpm@ietf.org>; Fri, 30 Sep 2016 01:06:08 -0700 (PDT)
X-Gm-Message-State: AA6/9RllAlEmHRk84xQKNOlcNGC4k2hEwno4WzngGLFbA16hyvDb08VmS10Pw/9q2loZQ80B6Qx/QzKBGxy3vA==
X-Received: by 10.159.55.138 with SMTP id q10mr4459980uaq.131.1475222767110; Fri, 30 Sep 2016 01:06:07 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.159.33.210 with HTTP; Fri, 30 Sep 2016 01:06:06 -0700 (PDT)
In-Reply-To: <CAO249ydtAPCa2U4A19r6bRUDXsEuGJ-bcN_yQHLQ9q6MDW8URQ@mail.gmail.com>
References: <MWHPR11MB1374A50BC599B093EA09668984070@MWHPR11MB1374.namprd11.prod.outlook.com> <7070553C-65D1-46EE-95F4-DAE82E1F5A5E@weston.borman.com> <CY4PR11MB187848FCCEF4DB140F85913E841A0@CY4PR11MB1878.namprd11.prod.outlook.com> <2D524A8D-A5CA-45A6-B94D-FA1DA0CEE609@weston.borman.com> <CY4PR11MB1878E7E911194A1032E32428841E0@CY4PR11MB1878.namprd11.prod.outlook.com> <CAO249yeTNRFuFcWn5ga854g24GM_7DAjeKOSw3d7h=HQQYKX1Q@mail.gmail.com> <CADVnQymeyst9De4F8Zqc6wGfLEdzmGsypSC-ZKZ7bXT6PO=J4g@mail.gmail.com> <CAO249yeeEo3B9H3SyGqA8aiWOWjoHYji=JXEh1GHOHSsf+RszA@mail.gmail.com> <CADVnQym-wP=7pgSQ3ziWS-WmU9T-q2NVr1XSpB5ZkYOD8NYbAA@mail.gmail.com> <CAO249ydtAPCa2U4A19r6bRUDXsEuGJ-bcN_yQHLQ9q6MDW8URQ@mail.gmail.com>
From: Yoshifumi Nishida <nishida@sfc.wide.ad.jp>
Date: Fri, 30 Sep 2016 01:06:06 -0700
X-Gmail-Original-Message-ID: <CAO249yd3cSoCYsNx8h8FgzSzn+R5U-ybh-z2=oXJ46ZHmZCtcQ@mail.gmail.com>
Message-ID: <CAO249yd3cSoCYsNx8h8FgzSzn+R5U-ybh-z2=oXJ46ZHmZCtcQ@mail.gmail.com>
To: Yoshifumi Nishida <nishida@sfc.wide.ad.jp>
Content-Type: multipart/alternative; boundary="94eb2c041f22ff9945053db5104d"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/6p9q2Z9Sd8XFrVHpFs3jnHaIvVo>
Cc: Kobby Carmona <kobby.Carmona@qlogic.com>, David Borman <dab@weston.borman.com>, "tcpm@ietf.org" <tcpm@ietf.org>
Subject: Re: [tcpm] Possible deadlock scenario with retransmission on both sides at the same time
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Sep 2016 08:06:13 -0000

Hello,

I'm not very sure how many folks are interested in this, but anyway...
I have tested this with packetdrill on linux, netbsd, freebsd and openbsd.

As far as I tested, linux works just like Neal described.
On the other hand, I found a minor issue in others' behavior.
In their tcp input functions, they haven't checked sequence number when
they update ack value while it is checked later in the code. Because of
this, it behaves like the followings.

00:42:58.395100 192.0.2.1.54613 > 192.168.0.1.8080: Flags [S],  seq 0, win
20000, options [mss 1000], length 0
00:42:58.395120 192.168.0.1.8080 > 192.0.2.1.54613: Flags [S.], seq
4183682776, ack 1, win 65535, length 0
00:43:01.422728 192.168.0.1.8080 > 192.0.2.1.54613: Flags [S.], seq
4183682776, ack 1, win 65535, length 0
00:43:01.422801 192.0.2.1.54613 > 192.168.0.1.8080: Flags [.],  seq 1, ack
1, win 20000, length 0
00:43:02.443166 192.0.2.1.54613 > 192.168.0.1.8080: Flags [.],  seq
2001:3001, ack 1, win 20000, length 1000
00:43:02.468148 192.168.0.1.8080 > 192.0.2.1.54613: Flags [.],  seq 1, ack
1, win 65535, length 0
00:43:04.471409 192.168.0.1.8080 > 192.0.2.1.54613: Flags [P.], seq 1:1001,
ack 1, win 65535, length 1000
00:43:06.531462 192.0.2.1.54613 > 192.168.0.1.8080: Flags [.],  seq 1:1001,
ack 1001, win 20000, length 1000
00:43:06.555492 192.168.0.1.8080 > 192.0.2.1.54613: Flags [.],  seq
1001:2001, ack 1001, win 65000, length 1000
00:43:06.555973 192.168.0.1.8080 > 192.0.2.1.54613: Flags [P.], seq
2001:3001, ack 1001, win 65000, length 1000
00:43:07.056276 192.0.2.1.54613 > 192.168.0.1.8080: Flags [.],  seq
1001:2001, ack 1001, win 20000, length 1000
00:43:07.084910 192.168.0.1.8080 > 192.0.2.1.54613: Flags [.],  seq 3001,
ack 3001, win 63000, length 0
00:43:07.585918 192.0.2.1.54613 > 192.168.0.1.8080: Flags [.],  seq
1001:2001, ack 2001, win 20000, length 1000
00:43:07.607117 192.168.0.1.8080 > 192.0.2.1.54613: Flags [.],  seq 3001,
ack 3001, win 63000, length 0
00:43:08.109788 192.0.2.1.54613 > 192.168.0.1.8080: Flags [.],  seq 2001,
ack 2001, win 20000, length 0
00:43:13.908734 192.168.0.1.8080 > 192.0.2.1.54613: Flags [P.], seq
2001:3001, ack 3001, win 63000, length 1000

In this dump file, src addr 192.0.2.1 are the packets generated by
packetdrill and src addr 192.168.0.1 are the packets from OSes. While the
ack values in packets at 00:43:07.585918 and 00:43:13.908734 should be
ignored, it is actually used to update snd_nxt.
As the result, we saw seq 2001:3001 was retransmited after timeout instead
of 1001:2001.
(But, it also means this logic won't create the deadlock situation Kobby
mentioned)

When I added seq num check in updating ack value in FreeBSD code, it works
just like expected.
--
Yoshi


On Tue, Aug 23, 2016 at 12:06 AM, Yoshifumi Nishida <nishida@sfc.wide.ad.jp>
wrote:

> Hi Neal,
>
> Oh. I see. Thanks for the explanation.
> I'd like to think about it a bit more.
> Thanks,
> --
> Yoshi
>
>
> On Sun, Aug 21, 2016 at 12:44 PM, Neal Cardwell <ncardwell@google.com>
> wrote:
>
>> I am pretty sure Linux does not have the issue Kobby pointed out in this
>> thread.
>>
>> At a high level Linux should be OK because it follows the principle
>> David Borman laid out in his August 16 email: "ACK-only packets should
>> be sent with the largest in-window sequence number that has ever been
>> sent."
>>
>> Linux obeys that principle by using tp->snd_nxt to store the largest
>> sequence number that has ever been sent, and having
>> tcp_acceptable_seq() use tp->snd_nxt but clamp the outgoing sequence
>> number to make sure it is in-window. To be able to do this, in Linux,
>> the sender does not rewind tp->snd_nxt on retransmissions.
>>
>> neal
>>
>> On Sun, Aug 21, 2016 at 1:25 PM, Yoshifumi Nishida
>> <nishida@sfc.wide.ad.jp> wrote:
>> > Hi Neal,
>> >
>> > Thanks for the info.
>> > So, it seems to me that the linux code has the issue Kobby pointed out.
>> > Or, am I missing something?
>> > --
>> > Yoshi
>> >
>> >
>> > On Mon, Aug 15, 2016 at 6:18 PM, Neal Cardwell <ncardwell@google.com>
>> wrote:
>> >>
>> >> On Mon, Aug 15, 2016 at 6:39 PM, Yoshifumi Nishida
>> >> <nishida@sfc.wide.ad.jp> wrote:
>> >> > Hello,
>> >> > I personally think this is an interesting corner case for discussion.
>> >> > It looks a minor one, but I'm not not very sure if we can leave it
>> for
>> >> > each
>> >> > implementation.
>> >> > I also guess a question would be if the BSD's fix is the best way for
>> >> > the
>> >> > issue.
>> >>
>> >> Yes, I agree this is an interesting case for discussion.
>> >>
>> >> FWIW, as a point of comparison for discussion, Linux's approach is a
>> >> little different: in Linux, the sender does not rewind SND.NXT on
>> >> retransmissions (RTO or Fast Recovery). Then the sender usually uses
>> >> SND.NXT for the seq field of outgoing pure ACKs. I say "usually"
>> >> because the Linux code has some code to deal with the case where the
>> >> receiver has withdrawn the receive window, so that SND.NXT is now
>> >> beyond the receive window. The code tcp_send_ack() uses to pick a seq
>> >> for outgoing pure ACKs looks like:
>> >>
>> >> /* SND.NXT, if window was not shrunk.
>> >>  * If window has been shrunk, what should we make? It is not clear at
>> >> all.
>> >>  * Using SND.UNA we will fail to open window, SND.NXT is out of
>> >> window. :-(
>> >>  * Anything in between SND.UNA...SND.UNA+SND.WND also can be already
>> >>  * invalid. OK, let's make this for now:
>> >>  */
>> >> static inline __u32 tcp_acceptable_seq(const struct sock *sk)
>> >> {
>> >>         const struct tcp_sock *tp = tcp_sk(sk);
>> >>
>> >>         if (!before(tcp_wnd_end(tp), tp->snd_nxt))
>> >>                 return tp->snd_nxt;
>> >>         else
>> >>                 return tcp_wnd_end(tp);
>> >> }
>> >>
>> >> neal
>> >
>> >
>>
>
>