SUGGESTED ARQ TEXT Advice for Internet Subnetwork Designers ID

Dr G Fairhurst <gorry@erg.abdn.ac.uk> Tue, 24 April 2001 12:51 UTC

Message-ID: <3AE576D2.24827CD@erg.abdn.ac.uk>
Date: Tue, 24 Apr 2001 13:51:30 +0100
From: Dr G Fairhurst <gorry@erg.abdn.ac.uk>
Reply-To: gorry@erg.abdn.ac.uk
Organization: erg.abdn.ac.uk
X-Mailer: Mozilla 4.75 (Macintosh; U; PPC)
X-Accept-Language: en
MIME-Version: 1.0
To: pilc@grc.nasa.gov
Subject: SUGGESTED ARQ TEXT Advice for Internet Subnetwork Designers ID
References: <0703A3E1D430D411866100508BDFCF3328B89A@CTOEXCH1>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: owner-pilc@lerc.nasa.gov
Precedence: bulk
Status: RO
Content-Length: 8500
Lines: 170

Some people have asked me on progress with clarification of the 
ARQ text included in the current link ID.

Following the IETF meeting in Minneapolis, the following co-authors
suggest a replacement of the ARQ text in the link draft. This is based 
on the text in the last issued link ID.  We hope this new
text clarifies the debate on ARQ persistency at IETF-49 and IETF-50.
We also intend it to be consistent with the remainder of the link draft
and 
with the related ARQ draft (draft-ietf-pilc-link-arq-issues-XX.txt). 

The replacement text below has not changed since 12th April 2001.

Gorry Fairhurst
Lloyd Wood
Reiner Ludwig


In the coming weeks, a revision of the ARQ draft,
based on feedback received - and any further comments - will 
be issued to give more detailed discussions of the ARQ issues
(draft-ietf-pilc-link-arq-issues-XX.txt).

--------

TCP vs Link Layer Retransmission

Error recovery generally involves the retransmission of lost or
corrupted data when explicitly or implicitly requested by the
receiver. It can also involve the generation and transmission of
redundant information that lets the receiver regenerate or correct
some amount of lost or corrupted data without needing explicit
retransmission of that amount.

The retransmission approach, widely known as "ARQ" (Automatic ReQuest
repeat) for largely historical reasons, is found in many computer
networking protocols.

The redundant information approach, using error control coding (of
which Forward Error Correction, or FEC, is a well-known
example) takes place in the data-link layer, very close to the
physical layer. Many link layers use a combination of both
coding and ARQ retransmissions to improve performance.

Depending on the layer where it is implemented, error control can
operate on an end-to-end basis or over a shorter span such as a
single link.  TCP is the most important example of an end-to-end
protocol that uses an ARQ strategy.

A large number of link layer protocols use ARQ, most
often some flavor of HDLC [ISO3309]. Examples include the X.25 link
layer, the AX.25 protocol used in amateur packet radio, 802.11
wireless LANs, and the reliable link layer specified in IEEE 802.2.

As explained in the introduction, only end-to-end error recovery can
ensure a reliable service to the application. But some subnetworks
(e.g., many wireless links) also require link layer error recovery as
a performance enhancement.  For example, many cellular links have
small physical frame sizes (< 100 bytes) and relatively high frame
loss rates. Relying entirely on end-to-end error recovery clearly
yields a performance degradation, as retransmissions across the end-to-end
path take much longer to be received than when link-local
retransmissions are used. Thus, link-layer error recovery can often 
increase end-to-end performance. As a result, link-layer and end-to-end 
recovery often co-exist; this can lead to the possibility of inefficient
interactions between the two layers of ARQ protocols.

This inter-layer "competition" might lead to the following wasteful
situation. When the link layer retransmits a packet, the link latency
momentarily increases. Since TCP bases its retransmission timeout on
prior measurements of end-to-end latency, including that of the link
in question, this sudden increase in latency may trigger an
unnecessary retransmission by TCP of a packet that the link layer is
still retransmitting.  Such spurious end-to-end retransmissions
generate unnecessary load and reduce end-to-end throughput. One may
even have multiple copies of the same packet in the same link queue
at the same time. In general, one could say the competing error
recovery is caused by an inner control loop (link layer error
recovery) reacting to the same signal as an outer control loop (end-
to-end error recovery) without any coordination between both loops.
Note that this is an efficiency issue, TCP continues to provide
reliable end-to-end delivery over such links.

This raises the question of how persistent a link layer sender should be in
performing retransmission. We define as link layer (LL) ARQ persistency the
maximum time that a particular link will spend trying to transfer a
packet before it can be discarded. This deliberately simplified
definition says
nothing about maximum number of retransmissions, retransmission
strategies, queue sizes, queuing disciplines, transmission delays, or
the like. The reason we use the term LL ARQ persistency instead of a
term such as 'maximum link layer packet holding time' is that the
definition closely relates to link layer error recovery. For example,
on links that implement straightforward error recovery strategies,
LL ARQ persistency will often correspond to a maximum number of
retransmissions permitted per link layer frame [ARQ-DRAFT].

For link layers that do not or cannot differentiate between flows
(e.g., due to network layer encryption), the LL ARQ persistency
should be small.  This avoids any harmful effects or performance 
degradation resulting from indiscriminate high persistence.
A detailed discussion of these issues is provided in [ARQ-DRAFT].

However, when a link layer is able to identify separate flows [ARQ-DRAFT]
and isolate the effects of ARQ on different flows sharing the same link
(or all
flows observe common patterns of loss (e.g. an outage).  The link
ARQ persistency for a flow should be high for a flow using reliable unicast
transport protocols (e.g., TCP) and must be low for all other flows. 
Setting the link ARQ persistency larger than the largest link
outage, would allow TCP to rapidly restore transmission without
the need to wait for a retransmission time out, generally improving
TCP performance in the face of transient outages.  However, excessively
high persistence may be disadvantageous (a practical upper limit of
30-60 seconds may be desirable). Implementation of such 
schemes remains a research issue. (See also Section "Recovery from 
Subnetwork Outages"). 

Recovery from Subnetwork Outages

Some types of subnetworks, particularly mobile radio, are subject to
frequent but temporary outages. For example, an active cellular data user
may drive or walk into an area (such as a tunnel) that is out of
range of any base station. No packets will be successfully delivered
until the user returns to an area with coverage.

The Internet protocols currently provide no standard way for a
subnetwork to explicitly notify an upper layer protocol (e.g., TCP)
that it is experiencing an outage, as distinguished from severe
congestion.

Under these circumstances TCP will, after each
unsuccessful retransmission, wait even longer before trying again;
this is its "exponential backoff" algorithm. And since there is also
currently no way for a subnetwork to explicitly notify TCP when it is
again operational, TCP will not discover this until its next
retransmission attempt. If TCP has backed off, this may take some
time.  This can lead to extremely poor TCP performance over such
subnetworks.

It is therefore highly desirable that a subnetwork subject to outages
not silently discard packets during an outage. Ideally, it should
define an interface to the next higher layer (i.e., IP) that allows
it to refuse packets during an outage, and to automatically ask IP
for new packets when it is again able to deliver them. If it cannot
do this, then the subnetwork should hold onto at least some of the
packets it accepts during an outage and attempt to deliver them when
the subnetwork comes back up.

Note that it is *not* necessary to avoid any and all packet drops
during an outage. The purpose of holding onto a packet during an
outage, either in the subnetwork or at the IP layer, is so that its
eventual delivery will implicitly notify TCP that the subnetwork is
again operational.

This is to enhance performance, not to ensure
reliability -- a task that as discussed earlier can only be done
properly on an end-to-end basis.

Only a single packet per TCP connection need be held in this way to
generate a TCP ack to cause the TCP sender to recover from the 
additional losses once the flow resumes.

Because it would be a layering violation (and possibly a performance hit)
for IP or a subnetwork to
look at the TCP headers of the packets it carries (which would in any
event be impossible if IPSEC encryption is in use), it would be
reasonable for the IP or subnetwork layers to choose, as a design
parameter, some small number of packets that it will retain during an
outage.