Re: [tcpm] Possible deadlock scenario with retransmission on both sides at the same time

David Borman <dab@weston.borman.com> Tue, 16 August 2016 05:09 UTC

Return-Path: <dab@weston.borman.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 076AE12B043 for <tcpm@ietfa.amsl.com>; Mon, 15 Aug 2016 22:09:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.147
X-Spam-Level:
X-Spam-Status: No, score=-3.147 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-1.247] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vbMk93gyprRc for <tcpm@ietfa.amsl.com>; Mon, 15 Aug 2016 22:09:36 -0700 (PDT)
Received: from frantic.weston.borman.com (frantic.weston.borman.com [70.57.156.33]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (112/168 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EDD8912B019 for <tcpm@ietf.org>; Mon, 15 Aug 2016 22:09:35 -0700 (PDT)
Received: from [192.168.1.36] (local-36.weston.borman.com [192.168.1.36]) by frantic.weston.borman.com (8.14.7/8.14.7) with ESMTP id u7G4gLkW013295; Mon, 15 Aug 2016 23:42:22 -0500 (CDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: David Borman <dab@weston.borman.com>
In-Reply-To: <CAO249yeTNRFuFcWn5ga854g24GM_7DAjeKOSw3d7h=HQQYKX1Q@mail.gmail.com>
Date: Tue, 16 Aug 2016 00:09:36 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <DD8B8D73-A282-4C15-9123-A155B45977BD@weston.borman.com>
References: <MWHPR11MB1374A50BC599B093EA09668984070@MWHPR11MB1374.namprd11.prod.outlook.com> <7070553C-65D1-46EE-95F4-DAE82E1F5A5E@weston.borman.com> <CY4PR11MB187848FCCEF4DB140F85913E841A0@CY4PR11MB1878.namprd11.prod.outlook.com> <2D524A8D-A5CA-45A6-B94D-FA1DA0CEE609@weston.borman.com> <CY4PR11MB1878E7E911194A1032E32428841E0@CY4PR11MB1878.namprd11.prod.outlook.com> <CAO249yeTNRFuFcWn5ga854g24GM_7DAjeKOSw3d7h=HQQYKX1Q@mail.gmail.com>
To: Yoshifumi Nishida <nishida@sfc.wide.ad.jp>
X-Mailer: Apple Mail (2.3124)
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/YFd94q50_M1XhvXjIy5Q5BT27j8>
Cc: Kobby Carmona <kobby.Carmona@qlogic.com>, "tcpm@ietf.org" <tcpm@ietf.org>
Subject: Re: [tcpm] Possible deadlock scenario with retransmission on both sides at the same time
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Aug 2016 05:09:39 -0000

Probably the more correct way of stating it is that ACK-only packets should be sent with the largest in-window sequence number that has ever been sent.  It is the only way to guarantee that the sequence number of the ACK will be within the window at the receiving side.

That being said, probe packets can be out of window.  And if the window is truly closed, you don’t want to send and ACK-only packet with that sequence number, as it will also be out of window.  Hence, send ACK-only packets with the largest in window sequence number.  In the BSD code probes are only one byte, and in persist state SND.NXT will be the largest in-window that has ever been sent.  But if we are in persist state because the ACK was lost or the window just opened and the probe data is accepted, the ACK will be to the left of the window by the size of the probe packet.  Hence the receiver needs to accept ACK packets one byte to the left of the window to prevent an ACK war.  The more general case is that you need to accept ACKs of at least N bytes to the left of the window, where N represents the largest size of probe that you will ever send.  That, along with sending ACKs with the largest in-window sequence number, will guarantee that you don’t get into an ACK war, even the two sides have different values for N  (only one side needs to accept the ACK to stop the ACK-war, and that will happen on the side with the larger value for N).

			-David Borman

> On Aug 15, 2016, at 5:39 PM, Yoshifumi Nishida <nishida@sfc.wide.ad.jp> wrote:
> 
> Hello,
> I personally think this is an interesting corner case for discussion. 
> It looks a minor one, but I'm not not very sure if we can leave it for each implementation. 
> I also guess a question would be if the BSD's fix is the best way for the issue.
> 
> Thanks,
> --
> Yoshi
> 
> On Thu, Aug 11, 2016 at 1:46 PM, Kobby Carmona <kobby.Carmona@qlogic.com> wrote:
> Hi David,
> This makes a lot of sense. We will fix our code.
> Thanks for your help on this,
> 
> BTW,
> Is this issue of mentioned in any RFC? If not do you see a point in adding explicit note on the SEQ of pure ACK in case of retransmission?
> 
>        Kobby
> 
> -----Original Message-----
> From: David Borman [mailto:dab@weston.borman.com]
> Sent: Monday, August 08, 2016 12:35 AM
> To: Kobby Carmona <kobby.Carmona@qlogic.com>
> Cc: tcpm@ietf.org
> Subject: Re: [tcpm] Possible deadlock scenario with retransmission on both sides at the same time
> 
>> On Aug 7, 2016, at 3:37 AM, Kobby Carmona <kobby.Carmona@qlogic.com> wrote:
>> 
>> Hi David,
>> The code below is true for fast-retransmit.
>> But in case of retransmit timer expiration (TCPT_REXMT) snd_nxt is set to snd_una. And in this case CQND will be set to 1MSS so in the example below the transmitters can send only a single segment from sequence 2000/12000.
> 
> Your underlying problem is that the ACK-only packets are being sent with the wrong sequence number, and that is what is causing them to be dropped.  One way to fix that is to put SND.NXT back to its previous value after do the the retransmit.  But you are correct, in the BSD code it only does that in the fast retransmit code; when the timer based retransmit code fires, it just pulls back SND.NXT to SND.UNA.  However, in the BSD code that I’m looking at, it keeps track of the largest sequence number sent in SND.MAX, and in the tcp_output() path there is this bit of code:
> 
>       /*
>        * If we are doing retransmissions, then snd_nxt will
>        * not reflect the first unsent octet.  For ACK only
>        * packets, we do not want the sequence number of the
>        * retransmitted packet, we want the sequence number
>        * of the next unsent octet.  So, if there is no data
>        * (and no SYN or FIN), use snd_max instead of snd_nxt
>        * when filling in th_seq.  But if we are in persist
>        * state, snd_max might reflect one byte beyond the
>        * right edge of the window, so use snd_nxt in that
>        * case, since we know we aren't doing a retransmission.
>        * (retransmit and persist are mutually exclusive...)
>        */
>       if (len || (flags & (TH_SYN|TH_FIN)) || tp->t_timer[TCPT_PERSIST])
>               th->th_seq = htonl(tp->snd_nxt);
>       else
>               th->th_seq = htonl(tp->snd_max);
> 
> That is what your implementation appears to be missing, and what is causing your ACK storm.  So yes, some variant of your problem has been seen before, and this is how the BSD code fixed it.
> 
>                        -David Borman
> 
>> 
>>      Kobby
>> 
>> -----Original Message-----
>> From: David Borman [mailto:dab@weston.borman.com]
>> Sent: Thursday, August 04, 2016 10:37 PM
>> To: Kobby Carmona <kobby.Carmona@qlogic.com>
>> Cc: tcpm@ietf.org
>> Subject: Re: [tcpm] Possible deadlock scenario with retransmission on
>> both sides at the same time
>> 
>> After you pull back SND.NXT and do the retransmission, you should then restore SND.NXT back to where it was, not leave it at the backed off value; then the ACKs wouldn’t be dropped, since they wouldn’t have old seq values.  For example, in the 4.4BSD fast retransmit code it had:
>> 
>>                      tcp_seq onxt = tp->snd_nxt;
>>                      ...
>>                      tp->snd_nxt = th->th_ack;
>>                      ...
>>                      (void) tcp_output(tp);
>>                      ...
>>                      if (SEQ_GT(onxt, tp->snd_nxt))
>>                              tp->snd_nxt = onxt;
>> 
>> 
>>                      -David Borman
>> 
>>> On Aug 4, 2016, at 4:08 AM, Kobby Carmona <kobby.Carmona@qlogic.com> wrote:
>>> 
>>> Hi all,
>>> While running a bidirectional scenario with random drops in a network simulator of our (QLogic's NIC) TCP stack we found a case where it seems there is deadlock in the TCP protocol (the connection will keep sending pure acks from both sides until RTO will expire multiple times and a RST will sent to close the connection).
>>> The scenario is as follows (there is an example with numbers for each stage assuming the MSS and each packet is 1000B):
>>> 1. Both sides are transmitting data and a single packet is dropped on either side and the next two packets are received properly
>>>     Side A - SND.MAX=3000, SND.NXT=3000, SND.UNA=1000, RCV.NXT=11000, out-of-order block 12000-13000
>>>     Side B - SND.MAX =13000, SND.NXT =13000, SND.UNA=11000,
>>> RCV.NXT=1000, out-of-order block 2000-3000 2. RTO timer expires on both sides
>>>     Side A - SND.MAX=3000, SND.NXT=1000, SND.UNA=1000, RCV.NXT=11000, out-of-order block 12000-13000
>>>     Side B - SND.MAX =13000, SND.NXT=11000, SND.UNA=11000, RCV.NXT=1000,
>>> out-of-order block 2000-3000 3. Both sides transmit a single packet to the peer:
>>>     A->B - pkt.seq=1000, pkt.ack=11000, len=1000
>>>     B->A - pkt.seq=11000, pkt.ack=1000, len=1000 3. Both sides receive
>>> the packets and update the receive context:
>>>     Side A - SND.MAX=3000, SND.NXT=2000, SND.UNA=1000, RCV.NXT=13000
>>>     Side B - SND.MAX=13000, SND.NXT=12000, SND.UNA=11000, RCV.NXT=3000 4.
>>> Both sides send another segment:
>>>     A->B - pkt.seq=2000, pkt.ack=13000, len=1000
>>>     B->A - pkt.seq=12000, pkt.ack=3000, len=1000 5. Both sides don't
>>> accept the packet (and don't update SND.UNA) since the sequence on the packet is less than RCV.NXT (sequence number check in page 69 of RFC793) and send a pure ACK instead
>>>     A->B - pkt.seq=2000, pkt.ack=13000, len=0 (pure ACK)
>>>     B->A - pkt.seq=12000, pkt.ack=3000, len=0 (pure ACK) 6. This will
>>> continue forever (until the connection will be terminated by RST) since every packet that ends before RCV.NXT (even a retransmit from SND.UNA) will be dropped.
>>> 
>>> Did anyone encountered this issue before? Is the anything we missed on this sequence?
>>> If this is indeed a real deadlock, there might be several solutions to this which will require a modification in receive processing of RFC793. But I would like to know if you think this is a real issue before dealing with solutions.
>>> Thanks,
>>> 
>>>     Kobby
>>> 
>>> 
>>> _______________________________________________
>>> tcpm mailing list
>>> tcpm@ietf.org
>>> https://www.ietf.org/mailman/listinfo/tcpm
>> 
>> _______________________________________________
>> tcpm mailing list
>> tcpm@ietf.org
>> https://www.ietf.org/mailman/listinfo/tcpm
> 
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm
> 
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm