Re: [aqm] Review of draft-ietf-aqm-ecn-benefits-03

Gorry Fairhurst <gorry@erg.abdn.ac.uk> Tue, 28 April 2015 12:59 UTC

Return-Path: <gorry@erg.abdn.ac.uk>
X-Original-To: aqm@ietfa.amsl.com
Delivered-To: aqm@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E3CCE1A9100 for <aqm@ietfa.amsl.com>; Tue, 28 Apr 2015 05:59:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.911
X-Spam-Level:
X-Spam-Status: No, score=-3.911 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BWjTfI5BkEV4 for <aqm@ietfa.amsl.com>; Tue, 28 Apr 2015 05:59:02 -0700 (PDT)
Received: from pegasus.erg.abdn.ac.uk (pegasus.erg.abdn.ac.uk [139.133.204.173]) by ietfa.amsl.com (Postfix) with ESMTP id 170721A90F7 for <aqm@ietf.org>; Tue, 28 Apr 2015 05:59:02 -0700 (PDT)
Received: from gorry-mac.erg.abdn.ac.uk (unknown [IPv6:2001:630:241:207:21f:5bff:fe38:7354]) by pegasus.erg.abdn.ac.uk (Postfix) with ESMTPSA id 1366D1B00519; Tue, 28 Apr 2015 13:59:15 +0100 (BST)
Message-ID: <553F8414.7050608@erg.abdn.ac.uk>
Date: Tue, 28 Apr 2015 13:59:00 +0100
From: Gorry Fairhurst <gorry@erg.abdn.ac.uk>
Organization: The University of Aberdeen is a charity registered in Scotland, No SC013683.
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>, aqm@ietf.org, Michael Welzl <michawe@ifi.uio.no>
References: <55392BD6.501@tik.ee.ethz.ch>
In-Reply-To: <55392BD6.501@tik.ee.ethz.ch>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: 8bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/aqm/NoBlrZ-Qosq9GYNGD4clokcNZm8>
Subject: Re: [aqm] Review of draft-ietf-aqm-ecn-benefits-03
X-BeenThere: aqm@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: gorry@erg.abdn.ac.uk
List-Id: "Discussion list for active queue management and flow isolation." <aqm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/aqm>, <mailto:aqm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/aqm/>
List-Post: <mailto:aqm@ietf.org>
List-Help: <mailto:aqm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/aqm>, <mailto:aqm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Apr 2015 12:59:09 -0000

Dear Mirja,

Thank you very much for your detailed review! Answers below:

 > Begin forwarded message:
 >
 > Date: 23. april 2015 kl. 19.28.54 CEST
 > From: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
 > To: <aqm@ietf.org>, Michael Welzl <michawe@ifi.uio.no>, Gorry 
Fairhurst <gorry@erg.abdn.ac.uk>
 > Subject: Review of draft-ietf-aqm-ecn-benefits-03
 >
 > Hi Gorry, hi Michael,
 >
 > as promised here is my review of draft-ietf-aqm-ecn-benefits-03.
 >
 > My overall comment is that even after reading the document (or even 
slightly more than before) I'm not completely sure what the purpose of 
this document is and also what the audience is this documented is 
directed to. Currently this document seems to do two things: 1. it list 
benefits (which is interesting for someone who thinks about enabling 
ECN) and 2. it kind of outlines needed steps for deployment (which would 
be directed to someone who gets the task from his manager to turn on 
ECN). However, the second point is not clearly spelled out and therefore 
it might be rather confusing for some people to read the second part of 
the document. Also the second part is to some extend still 
work-in-progress, therefor I would recommend to only focus this document 
on the first part.
 >
 > For the first part (listing benefits) it might also be good to make 
clear/distinguish who has these benefits. I think all benefits that are 
currently listed are only advantageous for the end host/application. Are 
there any benefits for a network operator? Would it be possible to write 
this document such that I could also use it to point network operators 
to and give them an incentive to enable ECN?

MW/GF: we can only think of one benefit of ECN as currently defined 
(i.e. without basing it on ConEx documents) that obviously targets the 
network operator: making incipient congestion visible (such that it 
could be used e.g. for ConEx). This is addressed in section 3.5. Since 
this is one out of six listed benefits in the table, creating categories 
for end host / application vs. network operator seems unnecessary to us.


 > Another high level comment is that you say in the introduction that 
this document "also identifies some potential problems that might occur 
when ECN is used" but then you don't really discuss them. I think to 
show both sides of the coin in this document would make the document 
more useful (and more honest). One point that you mention slightly here 
is that cheating is easier than with loss by not providing the feedback. 
Another point might be fairness between ECN and non-ECN traffic as 
marking will not reduce the queue length and therefore might lead to a 
higher loss rate for the non-ECN traffic instaed. I guess there are 
papers about this; don't have any by hand right now. Are there any other 
problems that should be mentioned?

MW/GF: This was discussed, and we agreed to remove the "drawbacks" 
discussion, to align with the original proposed work. So, we will remove 
this sentence from the introduction (it is in fact a left-over that 
should have been removed before). As for fairness, it seems to us that 
the related thread has concluded without a clear result. In the absence 
of evidence or references we prefer to stay away from hand-waving about 
this matter in the document.

 > Find more detailed comment by section below:
 >
 > Abstract
 > --------
 > ...says "...potential benefits when applications enable Explicit 
Congestion Notification (ECN)" -> usually an application cannot able ECN 
because usually it's a system setting...?

MW/GF: Good catch! We'll rephrase this as "..when enabling".


 > Section 1
 > ---------
 > ... says "..separate
 >   configuration of the drop and mark thresholds is known to be
 >   supported in some network devices and this is recommended
 >   [RFC2309.bis]."
 > RFC2309bis does not recommend different settings, it only say that it 
should be possible have different configuration of both. Further, I 
think this should not only concern THE threshold (whatever this is) but 
usually there are several parameters you might want to set independent 
of each other, e.g. the max mark/drop probability in RED.

MW/GF: Suggested update:
"While it has often been assumed
that network devices should CE-mark packets at the same level of
congestion at which they would otherwise have dropped them, separate
configuration of the drop and mark conditions. Such separate 
configuration is
known to be supported in some network devices and this is recommended
[RFC2309.bis]."

 > Section 2
 > ----------
 > 1) I'm not sure I understand the purpose of this section or maybe 
just the title is wrong. I'm currently seeing this section rather as a 
section that provides the needed background knowledge than is talking 
about deployment. For this purpose I'd put all references and 
potentially a brief summary to other RFC/drafts on ECN in this section 
including RFC2884, RFC4774, RFC5562, RFC6040, RFC6679, 
draft-briscoe-tsvwg-ecn-encap-guidelines and draft-ietf-tcpm-accecn-reqs 
(and rename it).

MW/GF: This section lists requirements for deployment. Suggestion: 
rename to "ECN deployment requirements"

 >
 > 2) Second paragraph says:
 > "Network devices must not drop packets solely because these 
codepoints are used [RFC2309.bis]."
 > Not sure this is the right document to says this (because currently 
it not seems to be directed to network operator/equipment vendors but 
admins/application developers). However, if it says this, it should also 
say that network devices should not bleach these bits.

MW/GF: suggest: "Network devices must not drop packets solely because 
these codepoints are used or erase these codepoints [RFC2309.bis]."


 > 3) First bullet in list says
 > "A recent survey reported growing support for ECN on common network 
paths [TR15]."
 > This sounds like TR15 shows that ECN is actually used in the 
Internet. However, TR15 only shows that there are only very few cases 
left where ECN packets are dropped or incorrectly altered. Please 
clarify or remove this sentence here.

MW/GF: suggest: "A recent survey reported that incorrect altering of ECN 
bits or consistent dropping of packets carrying the ECN codepoint is 
rare on common network paths [TR15]."


 > 4) You could cite draft-bensley-tcpm-dctcp-00 instead of the DCTCP 
Sigcomm paper (or both).

MW/GF: The paper is a stable reference for now.
But if/when the IETF decides on this, we can add a reference.


 > 5) I would remove the subsection headings (both 2.1 and 2.2) and just 
add the text there to the main part of the section.

MW/GF: OK


 > 6) "An AQM algorithm that supports ECN needs to define
 >   the threshold and algorithm for ECN-marking."
 > This is kind of self-redundant and therefore does not really makes 
sense to me to say; of course an algo that supports ECN needs to say 
something about ECN...

MW/GF: We agree, but suggest to keep it nevertheless, it is a hint to 
document authors to not forget that they should specify ECN rather than 
just assuming some default behaviour.


 > 7) You can use TR15 to provide a reference for the first paragraph in 
section 2.2:
 > "Cases have been noted where a sending endpoint marks a packet with a
 >   non-zero ECN mark, but the packet is received with a zero ECN value
 >   by the remote endpoint."

MW/GF: OK, will add the reference there


 > 8) I'd move the second paragraph of section 2.2. ("The current..") to 
a potentially new problems section, talking about known/previous 
deployment problems.

MW/GF: the document does not accentuate problems in this way, as a 
result of prior discussion. We therefore think that this paragraph is ok 
in its current place.


 > 9) I would simply remove paragraph 3-4 of section 2.2 because this 
was basically as already mentioned by referring to 2309bis and rfc6040 
in section 2.1.

MW/GF: we do think these paragraphs add value here: they describe the 
problem in greater detail than the text before, explaining the problem 
here is different - and, we think, better - than just pointing to 
references.


 > Section 3.2
 > ------------
 > 1) Don't understand why there is a listing here...? Just remove the 
listing and make text out of it...?

MW/GF: This is to help identify the entities that need to collaborate.


 > 2) The sentence "This also
 >      avoids the inefficiency of dropping data that has already made it
 >      across at least part of the network path."
 > does not belong in this section. This sentence should just be moved 
to section 3.1 (or in an own section) and must be further explained, 
saying that dropping packet at the of the path has already blocked 
resources that other traffic could have used otherwise.

MW/GF: agreed. We will insert it at the beginning of section 3.1.


 > Section 3.3
 > -----------
 > 1) I'd say this section misses on part of the discussion. It is true 
that if by chance your last  packet(s) get lost ECN can help. However, 
this section reads a little like, with ECN it is save to send packet 
bursts. Which is not true because even if ECN is used by a network 
device, the queue might be too small to hold the whole burst. I believe 
this case happen very often which might be a reason for the higher tail 
loss probability that sometimes is experienced with IW10. Please add 
this point to the discussion.

MW/GF: we agree that we shouldn't say that "with ECN it is ok to send 
packet bursts" - we want to stay away from such general recommendations 
and just state the potential benefit of ECN when it saves the last 
packet of a burst. See our next comment for more:


 > 2) I don't really get the point of the second paragraph. First of all 
it is confusion that this paragraph starts which "In addition to 
avoiding HOL blocking,.."; I guess that is left over from a previous 
version of this text...? And then you talk about a connection that is 
currently idle, so why is the performance of this connection that is 
currently not sending anything reduced?

MW/GF: indeed it seems that this paragraph has been mangled during 
updates. To address your item 1 and 2, we suggest the following replacement:

***
"While using ECN can never guarantee loss prevention, and thus losses
at the end of a burst can occur with or without ECN, using ECN can increase
the chance for that last packet to be ECN-marked instead of dropped. 
This can allow the
transport to avoid the consequent loss of state about the network path it is
    using, which would have arisen had there been a retransmission
    timeout.  Typical impacts of a transport timeout are to reset path
    estimates such as the RTT, the congestion window, and possibly other
    transport state that can reduce the performance of the transport
    until it again adapts to the path."
***


 > 3) I don't understand what "applications that send intermittent 
bursts of data, and rely upon timer-based recovery of packet loss" 
are...? Isn't the transport responsible to not send bursts and care 
about recovery...?

MW/GF: MPEG-DASH traffic for instance, in particular when used over 
non-paced TCP. UDP-based applications too.


 > 4) For the last paragraph in section 3.3 note that stacks often 
remember RTT measurements for a certain IP address and set the initial 
RTO based on this information.

MW/GF: suggestion: replace:
***
because in this
    case TCP cannot base the timeout period on prior RTT measurements
    from the same connection.
***
with:
***
because in this
    case TCP may not be able to base the timeout period on prior RTT 
measurements.
***


 > Section 3.4
 > -----------
 > You still need FEC or some kind of error concealment even if ECN is 
used because you can never be sure that your packet are not get dropped 
(by non-ECN-enable devices or other reasons). Therefore using ECN will 
clearly not reduce complexity. The only thing you can do is to 
potentially reduce the amount of redundancy you send if you know that a 
certain path is ECN enables or don't see losses at the beginning of a 
connection. This can save network resources but actually might not 
improve user experience; in fact the user experience might be worse in 
case there are sudden losses.

MW/GF: suggestion: remove "add complexity and"


 > Further the text says "negative impact of using loss-hiding 
mechanisms"; I don't really think that FEC has a negative impact as long 
as you've send enough redundancy...? Error concealment might but is used 
less and less. I'd recommend to talk about error concealment only in 
this last paragraph and explain a little further.

MW/GF: error concealment is different from FEC, and it is only mentioned 
in this last paragraph. We suggest to replace "Because this
    reduces the negative impact of using loss-hiding mechanisms,"   with 
"Because this can reduce the potential negative impact that some 
loss-hiding mechanisms can have,"


 > Section 3.5
 > ----------
 > "Recording the presence of CE-marked packets can therefore provide
 >   information about the performance of the network path."
 > Would change to:
 > "Recording the presence of CE-marked packets in absence of loss can 
therefore provide
 >   information about the performance of the network path."

MW/GF: ok


 > And also say more concretely what is meant with 'performance of the 
network path' -> congestion level or no drops by other middleboxes on 
this path...

MW/GF: This intentionally was kept this vague, but we'd welcome a 
concrete recommendation by a ConEx expert      (indeed "or .. or ..." is 
the problem, there are several possibilities here)


 > Section 3.6
 > -----------
 > 1) I like the section but I would phrase it differently; also it's 
not clear who needs to support what in this case. I'd like to propose 
the following text [not sure about the heading...]:
 >
 > "3.6 Opportunity to provide an improved congestion feedback signal
 >
 > Loss and ECN marking are both used as an indication for congestion. 
However, while the amount of feedback that is provided by loss should 
naturally be minimized, this is not the case for ECN. With ECN a network 
node could provide richer and more frequent feedback on the congestion 
state of a link which then could be used by the control mechanisms 
implemented in end host to make a more appropriate decision on how to 
react to congestion and to react faster to changes in congestion state.

MW/GF: ok to add this up to here.


 > Further while drop-based AQM mechanisms usually operate on a smoothed 
queue length estimation (instead of the instantaneous queue length) and 
therefore slightly delay the feedback signal to avoid unnecessary losses 
in case of transient congestion, this would be not necessary for ECN. If 
congestion is only transient due to short traffic bursts that are active 
for less than one RTT, the congestion signal would reach the sender at a 
time where the congestion is already cleared up. However, instead 
delaying the feedback in the network, the end host could reduce its 
sending rate incrementally based on the extend of congestion (that was 
experienced over e.g. the last RTT) similar as DCTCP. In case if the 
congestion is only transient, the end host would only reduce its rate 
slightly and be able to catch up quickly again. However, in case the 
congestion is persistent, this would help to remove additional delays 
from the network and resolve congestion faster which after all reduces 
the average queuing delay.
 >
 > However, current ECN is defined as a 'drop equivalent' in RFC3168. To 
change the semantics of ECN both the AQM in the network nodes and the 
control mechanism in the end hosts would still need to cope with nodes 
or end hosts that rely on the old semantics. Therefore changing the 
semantics can be done more easily in confined environment such as a data 
center. DCTCP is an example that changes both the configuration of the 
used AQM as well as the congestion response in the end host and relies 
on that fact that all nodes in data center are configured the same way. 
[Deployment strategies to change the semantics of ECN in the Internet 
are currently under discussion in the IETF.]"

MW/GF: We think that this goes a bit too far in the direction of hinting 
about implementation and research possibilities that we don't have 
citable proof about (besides: we already refer to DCTCP twice in the 
document, and the 'drop equivalent' semantics are not a MUST in RFC3168).


 > 2) I'd move the 1. and 2. paragraph of section 3.6.1 to the 
background/deployment section or to the intro depending what you going 
to do with section 2.

MW/GF: since we intend section 2 to be about deployment requirements 
only, we don't think this fits and would rather leave these paragraphs 
in section 3.6.1.


 > Sections 4 & 5
 > ---------
 > First sentence talks about "operational
 >   difficulties when the network only partially supports the use of ECN,
 >   or to respond to the challenges due to misbehaving network devices
 >   and/or endpoints".
 > I think these are to very different things. Misbehaving network 
devices is a point for a  problems section (where the lesson learned is 
that we didn't think carefully enough about incremental deployment in 
the first place but do now). However, partial deployment is not a 
problem but is a thing we simply have to cope with. The text sound as if 
the goal would be that every router in the whole Internet would at some 
point of time be ECN-enabled. I don't think this will ever happen and is 
also not the goal for me. Routers that are very unlikely to ever get 
congested should no be required to look at the ECN bits or monitor the 
queue length to calculate a mark/drop probability.

MW/GF: we agree, and suggest to replace this sentence with "Early 
deployment of ECN encountered a number of operational
    difficulties due to misbehaving network devices
    and/or endpoints."


 > However as I said at the beginning I don't really thing that sections 
4 and 5 belong in this document. If you decided to keep them (you have 
to change the abstract) and I'd recommend to rename them e.g 4. 
'Incremental Deployment Strategy' or 'Requirements to enable Incremental 
Deployment' and 5. 'Recommendations for enabling ECN in network nodes 
and end hosts'.

MW/GF: we suggest to insert the fact that we discuss deployment in the 
abstract, and rename these sections to 4.: "Incremental Deployment" and 
"Recommendations for enabling ECN"


 > I hope that's helpful! Let me know if you have any questions!
 >
 > Mirja
 >
 >


Thank you very much,

Michael & Gorry