[aqm] Gaming ECN (again) (was: think once to mark, think twice to drop: draft-ietf-aqm-ecn-benefits-02)

Bob Briscoe <bob.briscoe@bt.com> Wed, 15 April 2015 09:23 UTC

Return-Path: <bob.briscoe@bt.com>
X-Original-To: aqm@ietfa.amsl.com
Delivered-To: aqm@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com []) by ietfa.amsl.com (Postfix) with ESMTP id F14EF1B33BF for <aqm@ietfa.amsl.com>; Wed, 15 Apr 2015 02:23:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.611
X-Spam-Status: No, score=-0.611 tagged_above=-999 required=5 tests=[BAYES_05=-0.5, J_CHICKENPOX_44=0.6, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id X-a7Jgdy4Hru for <aqm@ietfa.amsl.com>; Wed, 15 Apr 2015 02:23:13 -0700 (PDT)
Received: from hubrelay-by-04.bt.com (hubrelay-by-04.bt.com []) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 638AC1B33BC for <aqm@ietf.org>; Wed, 15 Apr 2015 02:23:12 -0700 (PDT)
Received: from EVMHR71-UKRD.domain1.systemhost.net ( by EVMHR04-UKBR.bt.com ( with Microsoft SMTP Server (TLS) id 8.3.348.2; Wed, 15 Apr 2015 10:23:09 +0100
Received: from EPHR02-UKIP.domain1.systemhost.net ( by EVMHR71-UKRD.domain1.systemhost.net ( with Microsoft SMTP Server (TLS) id 8.3.348.2; Wed, 15 Apr 2015 10:23:09 +0100
Received: from bagheera.jungle.bt.co.uk ( by EPHR02-UKIP.domain1.systemhost.net ( with Microsoft SMTP Server id; Wed, 15 Apr 2015 10:23:10 +0100
Received: from BTP075694.jungle.bt.co.uk ([]) by bagheera.jungle.bt.co.uk (8.13.5/8.12.8) with ESMTP id t3F9N4Ns019432; Wed, 15 Apr 2015 10:23:05 +0100
Message-ID: <201504150923.t3F9N4Ns019432@bagheera.jungle.bt.co.uk>
X-Mailer: QUALCOMM Windows Eudora Version
Date: Wed, 15 Apr 2015 10:23:04 +0100
To: David Lang <david@lang.hm>
From: Bob Briscoe <bob.briscoe@bt.com>
In-Reply-To: <alpine.DEB.2.02.1504131441190.11469@nftneq.ynat.uz>
References: <23AFEFE3-4D93-4DD9-A22B-952C63DB9FE3@cisco.com> <BF6B00CC65FD2D45A326E74492B2C19FB75BAA82@FR711WXCHMBA05.zeu.alcatel-lucent.com> <72EE366B-05E6-454C-9E53-5054E6F9E3E3@ifi.uio.no> <55146DB9.7050501@rogers.com> <08C34E4A-DFB7-4816-92AE-2ED161799488@ifi.uio.no> <BF6B00CC65FD2D45A326E74492B2C19FB75BAFA0@FR711WXCHMBA05.zeu.alcatel-lucent.com> <alpine.DEB.2.02.1503271024550.2416@nftneq.ynat.uz> <5d58d2e21400449280173aa63069bf7a@hioexcmbx05-prd.hq.netapp.com> <20150327183659.GI39886@verdi> <72C12F6B-9DDE-4483-81F2-2D9A0F2D3A48@cs.columbia.edu> <alpine.DEB.2.02.1503271211200.19390@nftneq.ynat.uz> <D13AFCE7.46BC%kk@cs.ucr.edu> <alpine.DEB.2.02.1503271257230.19390@nftneq.ynat.uz> <AE342093-DE05-4D93-96DA-EB07E221F1D9@netapp.com> <alpine.DEB.2.02.1503291923300.26044@nftneq.ynat.uz> <201504131511.t3DFBG3R002270@bagheera.jungle.bt.co.uk> <alpine.DEB.2.02.1504131441190.11469@nftneq.ynat.uz>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
X-Scanned-By: MIMEDefang 2.56 on
Archived-At: <http://mailarchive.ietf.org/arch/msg/aqm/66txslu9-8LxRnrUD7NQErVYr5Y>
Cc: "Scheffenegger, Richard" <rs@netapp.com>, Vishal Misra <misra@cs.columbia.edu>, KK <kk@cs.ucr.edu>, John Leslie <john@jlc.net>, "aqm@ietf.org" <aqm@ietf.org>
Subject: [aqm] Gaming ECN (again) (was: think once to mark, think twice to drop: draft-ietf-aqm-ecn-benefits-02)
X-BeenThere: aqm@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion list for active queue management and flow isolation." <aqm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/aqm>, <mailto:aqm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/aqm/>
List-Post: <mailto:aqm@ietf.org>
List-Help: <mailto:aqm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/aqm>, <mailto:aqm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Apr 2015 09:23:18 -0000


At 22:46 13/04/2015, David Lang wrote:
>On Mon, 13 Apr 2015, Bob Briscoe wrote:
>>Returning from a fortnight offlist...
>>I think your conception of how ECN works is incorrect. You describe 
>>ECN as if the AQM marks one packet when it drops another packet. 
>>You say that the ECN-mark speeds up the retransmission of the 
>>dropped packet. On the contrary, the idea of classic ECN [RFC3168] 
>>is that the ECN marks replace the drops. In all known testing 
>>(except pathological cases), classic ECN effectively eliminates 
>>drops for all ECN-capable packets.
>That's what I thought, and if that was the case, then marking 
>packets as ECN-capable would mean that they would have an advantage 
>over non-ECN packets (by not getting dropped, so getting a higher 
>share of bandwidth)

This is a common fallacy. An ECN-capable TCP achieves the same 
throughput as an otherwise identical non-ECN TCP. The fallacy comes 
from people who think that the network caps the rate by removing 
packets. But the source determines the rate by how many drop or mark 
signals it sees. For today's ('classic') ECN, the source's rate 
reduction in response to either is the same; both voluntary as well.

There will be a tiny difference in goodput, because of the 
retransmissions. However, the loss (or marking) probability that TCP 
uses to determine its rate is a fraction of a percent, so this 
difference is in the noise.

>that's what the gaming ECN thread was about, and if I understood the 
>responses, I was being told that marking packets as ECN-capable, but 
>not slowing down (actually responding to ECN) would not let an 
>application get any advantage because the packets would just end up 
>getting dropped anyway, since marking and dropping happen at the 
>same level, even on ECN-capable flows.


I recall that thread. People came up with a number of complicated 
arguments for why it is hard to game ECN, but none were solid. Below 
I have given a simple argument that I think is solid on its own. I 
thought about intevening at the time, but this stuff needs care and 
time that I didn't have then.

1) For the most simple and complete argument, all you need to know 
about the ECN behaviour of AQMs is: under normal load conditions, an 
AQM decides it's time to send a congestion signal irrespective of 
whether the next packet is ECN-capable or not. Then, if the next 
packet is ECN-capable, it marks it, else it drops it. This is from 
RFC3168, which also requires the source to respond equally to either 
a loss or a mark. I call this 'classic' ECN.{Note 1}

2) I will try to correct your misunderstanding about "marking and 
dropping at the same level even on on ECN-capable flows". However, to 
determine whether ECN can be gamed, there's no need to go there. So 
I'll come back to that as a post-script{Note 2}.

3) I will prove that it is as easy to game loss as it is to game ECN, 
first considering sender cheating, then receiver cheating:

3a) Sender Cheating
 From the sender's point of view, the only difference between a loss 
and an ECN mark is that it has to retransmit a loss. But that has 
nothing to do with the rate it can go at. If it has been programmed 
to ignore congestion feedback (and instead to go at a constant 
unresponsive rate{Note 2}), it is as easy for it to ignore loss 
feedback as ECN feedback. See {Note 3} for an example.

3b) Receiver Cheating
* An ECN receiver can best fool an ECN-capable TCP sender into going 
faster by only feeding back a small fraction of ECN marks.{Note 4}
* A non-ECN receiver could fool a non-ECN TCP sender into going 
faster by only revealing a small fraction of the losses. However, it 
would have to ACK undelivered bytes, and most TCP-based apps won't 
work unless all bytes are delivered.{Note 5}

So it seems that it's easier for a receiver to game ECN than loss. However:
* returning to the ECN case, the sender can validate the receiver by 
randomly setting an ECN mark itself on a very small proportion of 
packets (probably only on unusually high rate connections). Then if 
it doesn't see ECN feedback on the ACK of any one of its 
self-inserted marks, it can close the connection.

In summary,
* a sender can't game ECN any more easily than it can game loss.
* a receiver can only game ECN if the sender doesn't take measures to 
prevent it.{Note 6}

>If the packets are just marked, but not dropped, then the 
>ECN-capable flows will occupy a disproportinate share of the 
>available buffer space, since they just get marked instead of dropped.


The arrival rates will be the same, whether or not ECN is used (see 
earlier). And recall  that TCP drives the marking or loss probability 
at very small fractions in all normal conditions.

Example: if there are 10 flows in a 100Mb/s link, 5 ECN and 5 
non-ECN, they will all arrive at the buffer at 10Mb/s (all other 
factors being equal). Then, if the loss or marking probability is 
0.5%, the AQM will be marking but not dropping 1 in 200 packets in 
the ECN flows whereas it would drop 1 in 200 from the non-ECN flows.

So, assuming tail drop, if there were 399 packets in this buffer, on 
average 200 would be ECN-capable (20 in each flow) with one marked; 
and 199 would be non-ECN-capable (20 in each flow except one with 
19). And one of those 199 would be a retransimssion from an earlier loss.

[Of course, we would hope that there would be 4 packets in the 
buffer, not 400. The proportions would still be the same on average. 
I merely used 399 to avoid fractions of packets for the averages.]


{Note 1} To be concrete, I've assumed classic ECN [RFC3168]. The 
argument is similar for research approaches like "think once to mark, 
twice to drop", but let's not make it more complicated than it needs to be.

{Note 2} RFC3168 (and draft-ietf-aqm-recommendation) require that, 
whenever the AQM decides it is time to signal congestion on the next 
packet, if the queue has been persistently long, the AQM must only 
use drop as a congestion signal, irrespective of whether the next 
packet is ECN-capable or not.

So, if a source naively just continues to increase its window until 
it drives the queue into overload, then it would cause the AQM to 
turn off ECN and consequently not be able to game ECN. But the simple 
strategy of sending at a high but constant rate avoids driving the 
queue into overload. So that's the strategy I described for gaming 
ECN. Because strategies that don't work are irrelevant if there's a 
strategy that does work.

{Note 3} Examples to show source cheating is as easy with loss as ECN:
* An ECN source sends at a constant unresponsive 90Mb/s through a 
100Mb/s bottleneck. In parallel some other responsive flows (say 10 
non-ECN TCP flows) squeeze themselves into the remaining 10Mb/s. They 
will cause themselves (say) 0.5% loss probability, while the 
unresponsive flow will experience 0.5% marking and zero loss.
* A non-ECN source can just as easily send unresponsively at 90.5Mb/s 
as 90Mb/s. The other flows will still drive loss to about 0.5%, which 
the unresponsive flow will now experience as well. Nonetheless, after 
it retransmits the 0.5% loss it still achieves goodput of about 90Mb/s.

{Note 4} Again, feeding back no marks at all would be naive, because 
it would drive the bottleneck into overload, causing it to turn off 
ECN (and driving the loss-rate over a cliff). A better strategy is to 
feedback only a small proportion. Because TCP's rate depends on the 
square root of the congestion probability, to download N times 
faster, the receiver should feed back only about 1 in N^2 of the 
marks or losses. E.g. to go 90 times faster, feed back 1 in 8100 
marks (or losses).

{Note 5} There are two classes of apps that use TCP but can get away 
without reliable delivery:
i) Some streaming media apps are designed with a loss-tolerant 
encoding, so they can use TCP but play out the media even if some 
retransmissions haven't arrived yet (e.g. using a raw socket at the receiver).
ii) In the specific case of HTTP, a hacked receiver can open another 
connection to the same server and download the byte-ranges it needs 
to repair the holes in the other connection.

{Note 6} ConEx (congestion exposure [RFC6789]) provides a 
comprehensive framework for the network to prevent senders and 
receivers from cheating. However, for this argument, we don't need to 
go there either.



>David Lang


>>ECN has potential cheating problems, but we have per-customer queues anyway.
>>Using flow as the unit of allocation also has its own problems, 
>>with no proposed solutions.


>>At 05:16 30/03/2015, David Lang wrote:


>>>While AQM makes the network usable, there is still additional room 
>>>for improvement. While dropping packets does result in the TCP 
>>>senders slowing down,and eventually stabilizing at around the 
>>>right speed to keep the link fully utilized, the only way that 
>>>senders have been able to detect problems is to discover that they 
>>>have not received an ack for the traffic within the allowed time. 
>>>This causes a 'bubble' in the flow as teh dropped packet must be 
>>>retransmitted (and sometimes a significant amount of data after 
>>>the dropped packet that did make it to the destination, but could 
>>>not be acked because fo the missing packet).
>>>This "bubble" in the data flow can be greatly compressed by 
>>>configuring the AQM algorithm to send an ECN packet to the sender 
>>>when it drops a packet in a flow. The sender can then adapt 
>>>faster, slowing down it's new data, and re-sending the dropped 
>>>packet without having to wait for the timeout. This has two major 
>>>effects by allowing the sender to retransmit the packet sooner the 
>>>dealy on the dropped data is not as long, and because the 
>>>replacement data can arrive before the timeout of the following 
>>>packets, they may not need to be re-sent. by configuring the AQM 
>>>algorithm to send the ECN notification to the sender only when the 
>>>packet is being dropped, the effect of failure of the ECN packet 
>>>to get through to the sender (the notification packet runs into 
>>>congestion and gets dropped, some network device blocks it, etc) 
>>>is that the ECN enabled case devolves to match the non-ECN case in 
>>>that the sender will still detect the dropped packet via the 
>>>timeout waiting for the ack as if ENCN was not enabled.
>>><insert link to possible problems that can happen here, including 
>>>the potential for an app to 'game' things if packets are marked at 
>>>a different level than when they are dropped.>
>>>So a very strong recommendation to enable Active Queue Management, 
>>>while the different algorithms have different advantages and 
>>>levels of testing, even the 'worst' of the set results in a 
>>>night-and-day improvement for usability compared to unmanaged buffers.
>>>Enabling ECN at the same point as dropping packets as part of 
>>>enabling any AQM algorithm results in a noticable improvement over 
>>>the base algorithm without ECN. When compared to the baseline, the 
>>>improvement added by ECN is tiny compared to the improvement from enabling AQM.
>>>   Is it fair to say that plain aqm vs aqm+ecn variation is on the 
>>> same order of difference as the differences between the different 
>>> AQM algorithms?
>>>Future research items (which others here may already have done, 
>>>and would not be part of my 'elevator pitch')
>>>I believe that currently ECn triggers the exact same slowdown that 
>>>a missed packet does, and it may be appropriate to have the sender 
>>>do a less drastic slowdown.
>>>It would be very interesing to provide soem way for the 
>>>application sending the traffic to detect dropped packets and ECN 
>>>responses. For example, a streaming media source (especially an 
>>>interactive one like video conferencing) could adjust the bitrate 
>>>that it's sending.
>>>David Lang
>>>aqm mailing list
>>Bob Briscoe,                                                  BT
>Bob Briscoe,                                                  BT