Re: TCP and Link Layer Retransmissions

Dr G Fairhurst <gorry@erg.abdn.ac.uk> Tue, 20 February 2001 18:01 UTC

Message-ID: <3A92B0F7.F9174287@erg.abdn.ac.uk>
Date: Tue, 20 Feb 2001 18:01:27 +0000
From: Dr G Fairhurst <gorry@erg.abdn.ac.uk>
Reply-To: gorry@erg.abdn.ac.uk
Organization: erg.abdn.ac.uk
X-Mailer: Mozilla 4.73C-CCK-MCD {C-UDP; EBM-APPLE} (Macintosh; U; PPC)
X-Accept-Language: en
MIME-Version: 1.0
To: Reiner Ludwig <Reiner.Ludwig@ericsson.com>
CC: pilc@grc.nasa.gov
Subject: Re: TCP and Link Layer Retransmissions
References: <5.0.2.1.0.20010219140023.01f0a4f0@chapelle.ericsson.se>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: owner-pilc@lerc.nasa.gov
Precedence: bulk
Status: RO
Content-Length: 8797
Lines: 181

This seems like a clear description of high persistent ARQ for 
terrestrial cellular applications at modest bit rates, based on 
your talk at the San Diego IETF, thanks for writing this.

I was trying to figure out the details of the link you were
describing: What bit rate and link RTT were you describing in the 
text?  I'd also be interested in what bit error rate (or packet loss) 
you observed outside of the outages which you describe - is the coding 
such that you never see any appreciable packet loss except in outages
and then 100% loss?


Reiner Ludwig wrote:
> 
> Hi,
> 
> as agreed in San Diego, I have rewritten the section "Reliability and Error
> Control" of
> http://search.ietf.org/internet-drafts/draft-ietf-pilc-link-design-04.txt.
> 
> I changed the section title to "TCP and Link Layer Retransmissions". Below,
> is my first draft. Please, send comments.
> 
> ///Reiner
> 
> --------------
> 
> TCP and Link Layer Retransmissions
> 
> Error recovery denotes the function of retransmitting lost or corrupted
> data. It is implemented at a protocol layer that provides a reliable
> service. Error recovery is both an end-to-end and a link layer function.
> Examples of protocols implementing end-to-end error recovery include TCP,
> SCTP, and reliable multicast protocols. Examples of protocols implementing
> link layer error recovery are numerous but many of those are derived from
> HDLC [???], the "mother" of all link layer protocols. Error recovery at the
> link layer is often referred to as Automatic Repeat reQuest (ARQ), and we
> therefore use the abbreviation LL ARQ in some cases.
> 
> As motivated in the introduction, end-to-end error recovery is required to
> provide a reliable service to the application. On the other hand, some
> wireless links  require link layer error recovery for performance reasons.
> For example, the characteristics of many cellular links are such that the
> link's optimal size of the retransmission unit is often less than 100
> bytes. Comparing this with the minimum MTU of 1280 bytes required for IPv6
> makes obvious that relying solely on end-to-end error recovery is not
> possible (or at least only with prohibitive end-to-end performance) when
> the end-to-end path includes such a link. Even on links where the link's
> optimal size of the retransmission unit is larger than the path's MTU,
> e.g., on many satellite links, link layer error recovery is often an
> attractive alternative to increase end-to-end performance. Thus, link layer
> and end-to-end error recovery often operate simultaneously leading to the
> so-called problem of "competing error recovery".
> 

Which sort of satellite links do you mean?
- There's a pretty wide range of links out there from SCPC ad hoc links,
to DAMA TDMA, and high speed (e.g. DVB).  Many satellite links actually
use frames much smaller than IP packet MTU.


> Competition is introduced when error recovery is run both at the link layer
> and end-to-end. It might lead to the following wasteful situation. The link
> layer is retransmitting one or more packets, i.e., delaying the packet(s)
> in the network. Simultaneously, one of the network end-points considers the
> packet(s) lost, and falsely triggers that the same packets are
> retransmitted end-to-end. Those spurious end-to-end retransmissions reduce
> the end-to-end performance and put unnecessary load onto the network. It
> may even occur that two or more copies of the same packet reside in the
> send buffer of the sending link layer at the same time. In general, one
> could say the competing error recovery is caused by an inner control loop
> (link layer error recovery) reacting to the same signal as an outer control
> loop (end-to-end error recovery), while there is no coordination between
> both loops.
> 

Competing error recovery is one issue, and I guess you are right it is a 
major impact for a small numbers of TCP flows sharing a common link 
- Some of your papers mention per-flow methods, I wonder have you
practical 
expeience of using a  scheme which supports large numbers of
simultaneous 
TCP/UDP flows?
- That would surely be really interesting to the PILC WG.

> The problem of competing error recovery raises the question of how
> persistently a link layer should be retransmitting. We define as LL ARQ
> persistency the maximum time (in milliseconds) that a single IP packet may
> reside at a particular link layer. It is the maximum delay permitted for an
> IP packet between a sending and a receiving IP layer before the packet must
> be discarded by the link layer. Note, that this definition says nothing
> about maximum number of retransmissions, retransmission strategies, queue
> sizes, queueing disciplines, transmission delays, or the like. The reason
> we use the term LL ARQ persistency instead of a term such as 'maximum link
> layer packet holding time' is that the definition closely relates to link
> layer error recovery. For example, on links that implement straight forward
> error recovery strategies, LL ARQ persistency will often directly translate
> into a maximum number of retransmissions that may be permitted per link
> layer frame.
> 

May be worth saying there is no bandwidth-on-demand or MAC access protocol
involved in this, and therefore retransmission is immediate.

> For link layers that do not (e.g., because it is not supported by the
> implementation) or cannot (e.g., due to network layer encryption)
> differentiate between flows or types flows, the LL ARQ persistency should
> be set to a low value. That way the link layer does not interfere with
> flows that carry delay sensitive data. Any value up to about 50
> milliseconds should be reasonably safe.

Now, here I am confused - where does 50 ms come from? Is this part of
the design spec of the cellular system? - I can't believe UDP breaks
above this, TCP and UDP seem to work OK over longer delay links.

> 
> However, for link layers that treat TCP flows (alternatively, all reliable
> flows including reliable multicast) separately from non-TCP flows, the LL
> ARQ persistency for the TCP flows should be high while it should be low (up
> to 50 ms - see above) for the non-TCP flows. The LL ARQ persistency for the
> TCP flows should be as high as the largest link outage period expected on a
> particular link, but no higher than TCP's maximum retransmission timer
> value of 64 seconds. 


I really don't understand the 64 second number.  Isn't TCP max RTO timer 
based on the total PATH RTT experienced by the TCP flow - so if this is
to be used to perform link level calculations shouldn't it be divided by
the number of links along the path - or at least by two, if you assume 
terrestrial cellular may never be a transit network and only may be used
at the edge. Is 64 seconds the right number to use in a link level design?


> The benefits of a high LL ARQ persistency for TCP
> flows is that it more efficiently utilizes a link's radio spectrum and a
> mobile device's transmission power (battery lifetime), e.g., see [LKJK01],
> while it commonly also provides higher end-to-end performance in the face
> of transient link outages (see Section "Recovery from Subnetwork Outages").
> A transient link outage is likely to cause a spurious timeout in TCP which
> in turn forces the TCP sender into a go-back-N-style retransmission mode.
> That is the TCP sender will unnecessarily retransmit all outstanding data.
> However, this is commonly still better than a potentially long idle period
> due TCP's retransmission timer that might have gone through multiple stages
> of exponential backoff. And besides, solutions have been proposed to remove
> the spurious go-back-N retransmits from TCP [LK00].
> 

Does DSACK not also help remove spurious retransmissions, this seems nearer
to being adopted by the IETF?


> [LK00] R. Ludwig, R. H. Katz, "The Eifel Algorithm: Making TCP Robust
> Against Spurious Retransmissions", ACM Computer Communication Review, Vol.
> 30, No. 1, January 2000.
> 
> [LKJK01] R. Ludwig, A. Konrad, A. D. Joseph, R. H. Katz, "Optimizing the
> End-to-End Performance of Reliable Flows over Wireless Links", To appear in
> ACM/Baltzer Wireless Networks Journal (Special issue: Selected papers from
> ACM/IEEE MOBICOM 99), available at
> http://iceberg.cs.berkeley.edu/publications.html.



In the talk, you also noted that you agreed with MOST of the conclusions
to the ID below:

http://search.ietf.org/internet-drafts/draft-ietf-pilc-link-arq-issues-00.txt

I think this implies that you have some points of debate - could you let
me 
know which bits you agree (or disagree) with?  We're revising it at the 
moment, based on other feedback received.


Thanks for the input,

Gorry

-- 
------------------------------
http://www.erg.abdn.ac.uk/users/gorry