Re: [tcpm] Possible deadlock scenario with retransmission on both sides at the same time

David Borman <dab@weston.borman.com> Sun, 07 August 2016 21:35 UTC

Return-Path: <dab@weston.borman.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 34EA912D69B for <tcpm@ietfa.amsl.com>; Sun, 7 Aug 2016 14:35:11 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.147
X-Spam-Level:
X-Spam-Status: No, score=-3.147 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-1.247] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id sXMoQKhz934M for <tcpm@ietfa.amsl.com>; Sun, 7 Aug 2016 14:35:09 -0700 (PDT)
Received: from frantic.weston.borman.com (frantic.weston.borman.com [70.57.156.33]) (using TLSv1 with cipher EDH-RSA-DES-CBC3-SHA (112/168 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 19D4512D589 for <tcpm@ietf.org>; Sun, 7 Aug 2016 14:35:09 -0700 (PDT)
Received: from local-54.weston.borman.com (local-54.weston.borman.com [192.168.1.54]) by frantic.weston.borman.com (8.14.7/8.14.7) with ESMTP id u77LZ4X5004649; Sun, 7 Aug 2016 16:35:05 -0500 (CDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: David Borman <dab@weston.borman.com>
In-Reply-To: <CY4PR11MB187848FCCEF4DB140F85913E841A0@CY4PR11MB1878.namprd11.prod.outlook.com>
Date: Sun, 07 Aug 2016 16:35:04 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <2D524A8D-A5CA-45A6-B94D-FA1DA0CEE609@weston.borman.com>
References: <MWHPR11MB1374A50BC599B093EA09668984070@MWHPR11MB1374.namprd11.prod.outlook.com> <7070553C-65D1-46EE-95F4-DAE82E1F5A5E@weston.borman.com> <CY4PR11MB187848FCCEF4DB140F85913E841A0@CY4PR11MB1878.namprd11.prod.outlook.com>
To: Kobby Carmona <kobby.Carmona@qlogic.com>
X-Mailer: Apple Mail (2.3124)
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/5TIgQbXy__MobilFFz0ORQ9Ou2E>
Cc: "tcpm@ietf.org" <tcpm@ietf.org>
Subject: Re: [tcpm] Possible deadlock scenario with retransmission on both sides at the same time
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 07 Aug 2016 21:35:11 -0000

> On Aug 7, 2016, at 3:37 AM, Kobby Carmona <kobby.Carmona@qlogic.com> wrote:
> 
> Hi David,
> The code below is true for fast-retransmit.
> But in case of retransmit timer expiration (TCPT_REXMT) snd_nxt is set to snd_una. And in this case CQND will be set to 1MSS so in the example below the transmitters can send only a single segment from sequence 2000/12000.

Your underlying problem is that the ACK-only packets are being sent with the wrong sequence number, and that is what is causing them to be dropped.  One way to fix that is to put SND.NXT back to its previous value after do the the retransmit.  But you are correct, in the BSD code it only does that in the fast retransmit code; when the timer based retransmit code fires, it just pulls back SND.NXT to SND.UNA.  However, in the BSD code that I’m looking at, it keeps track of the largest sequence number sent in SND.MAX, and in the tcp_output() path there is this bit of code: 

       /*
        * If we are doing retransmissions, then snd_nxt will
        * not reflect the first unsent octet.  For ACK only
        * packets, we do not want the sequence number of the 
        * retransmitted packet, we want the sequence number
        * of the next unsent octet.  So, if there is no data
        * (and no SYN or FIN), use snd_max instead of snd_nxt
        * when filling in th_seq.  But if we are in persist
        * state, snd_max might reflect one byte beyond the
        * right edge of the window, so use snd_nxt in that
        * case, since we know we aren't doing a retransmission.
        * (retransmit and persist are mutually exclusive...)
        */
       if (len || (flags & (TH_SYN|TH_FIN)) || tp->t_timer[TCPT_PERSIST])
               th->th_seq = htonl(tp->snd_nxt);
       else
               th->th_seq = htonl(tp->snd_max);

That is what your implementation appears to be missing, and what is causing your ACK storm.  So yes, some variant of your problem has been seen before, and this is how the BSD code fixed it.

			-David Borman

> 
> 	Kobby
> 
> -----Original Message-----
> From: David Borman [mailto:dab@weston.borman.com] 
> Sent: Thursday, August 04, 2016 10:37 PM
> To: Kobby Carmona <kobby.Carmona@qlogic.com>
> Cc: tcpm@ietf.org
> Subject: Re: [tcpm] Possible deadlock scenario with retransmission on both sides at the same time
> 
> After you pull back SND.NXT and do the retransmission, you should then restore SND.NXT back to where it was, not leave it at the backed off value; then the ACKs wouldn’t be dropped, since they wouldn’t have old seq values.  For example, in the 4.4BSD fast retransmit code it had:
> 
> 			tcp_seq onxt = tp->snd_nxt;
> 			...
> 			tp->snd_nxt = th->th_ack;
> 			...
> 			(void) tcp_output(tp);
> 			...
> 			if (SEQ_GT(onxt, tp->snd_nxt))  
> 				tp->snd_nxt = onxt;
> 
> 
> 			-David Borman
> 
>> On Aug 4, 2016, at 4:08 AM, Kobby Carmona <kobby.Carmona@qlogic.com> wrote:
>> 
>> Hi all,
>> While running a bidirectional scenario with random drops in a network simulator of our (QLogic's NIC) TCP stack we found a case where it seems there is deadlock in the TCP protocol (the connection will keep sending pure acks from both sides until RTO will expire multiple times and a RST will sent to close the connection).
>> The scenario is as follows (there is an example with numbers for each stage assuming the MSS and each packet is 1000B):
>> 1. Both sides are transmitting data and a single packet is dropped on either side and the next two packets are received properly
>> 	Side A - SND.MAX=3000, SND.NXT=3000, SND.UNA=1000, RCV.NXT=11000, out-of-order block 12000-13000
>> 	Side B - SND.MAX =13000, SND.NXT =13000, SND.UNA=11000, RCV.NXT=1000, 
>> out-of-order block 2000-3000 2. RTO timer expires on both sides
>> 	Side A - SND.MAX=3000, SND.NXT=1000, SND.UNA=1000, RCV.NXT=11000, out-of-order block 12000-13000
>> 	Side B - SND.MAX =13000, SND.NXT=11000, SND.UNA=11000, RCV.NXT=1000, 
>> out-of-order block 2000-3000 3. Both sides transmit a single packet to the peer:
>> 	A->B - pkt.seq=1000, pkt.ack=11000, len=1000
>> 	B->A - pkt.seq=11000, pkt.ack=1000, len=1000 3. Both sides receive 
>> the packets and update the receive context:
>> 	Side A - SND.MAX=3000, SND.NXT=2000, SND.UNA=1000, RCV.NXT=13000
>> 	Side B - SND.MAX=13000, SND.NXT=12000, SND.UNA=11000, RCV.NXT=3000 4. 
>> Both sides send another segment:
>> 	A->B - pkt.seq=2000, pkt.ack=13000, len=1000
>> 	B->A - pkt.seq=12000, pkt.ack=3000, len=1000 5. Both sides don't 
>> accept the packet (and don't update SND.UNA) since the sequence on the packet is less than RCV.NXT (sequence number check in page 69 of RFC793) and send a pure ACK instead
>> 	A->B - pkt.seq=2000, pkt.ack=13000, len=0 (pure ACK)
>> 	B->A - pkt.seq=12000, pkt.ack=3000, len=0 (pure ACK) 6. This will 
>> continue forever (until the connection will be terminated by RST) since every packet that ends before RCV.NXT (even a retransmit from SND.UNA) will be dropped.
>> 
>> Did anyone encountered this issue before? Is the anything we missed on this sequence?
>> If this is indeed a real deadlock, there might be several solutions to this which will require a modification in receive processing of RFC793. But I would like to know if you think this is a real issue before dealing with solutions.
>> Thanks,
>> 
>> 	Kobby
>> 
>> 
>> _______________________________________________
>> tcpm mailing list
>> tcpm@ietf.org
>> https://www.ietf.org/mailman/listinfo/tcpm
> 
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm