Re: [aqm] Gen-art LC review of draft-ietf-aqm-recommendation-08

"Fred Baker (fred)" <fred@cisco.com> Thu, 08 January 2015 01:20 UTC

From: "Fred Baker (fred)" <fred@cisco.com>
To: Elwyn Davies <elwynd@dial.pipex.com>
Thread-Topic: Gen-art LC review of draft-ietf-aqm-recommendation-08
Thread-Index: AQHQKMfWNhV+PAOc206GqrQHI+1oJg==
Date: Thu, 08 Jan 2015 01:20:29 +0000
Message-ID: <704AB199-DA52-4B23-BB9A-5049B03763E0@cisco.com>
References: <54947DCF.3030601@scss.tcd.ie> <40842d620667e7d2a33f451dcd8f502b.squirrel@spey.erg.abdn.ac.uk> <30819CFE-21D3-4EF8-ABFE-4C01940399B7@cisco.com> <54ADC3F5.3040706@dial.pipex.com>
In-Reply-To: <54ADC3F5.3040706@dial.pipex.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="Windows-1252"
Content-ID: <CB8E2EA2D0897B4BACD78E9A081031C0@emea.cisco.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Archived-At: http://mailarchive.ietf.org/arch/msg/aqm/LiOdkBObs390SeIUmsEBYcrBnOE
Cc: "gorry@erg.abdn.ac.uk (erg)" <gorry@erg.abdn.ac.uk>, "draft-ietf-aqm-recommendation.all@tools.ietf.org" <draft-ietf-aqm-recommendation.all@tools.ietf.org>, General area reviewing team <gen-art@ietf.org>, "aqm@ietf.org" <aqm@ietf.org>
Subject: Re: [aqm] Gen-art LC review of draft-ietf-aqm-recommendation-08
Precedence: list

> On Jan 7, 2015, at 3:40 PM, Elwyn Davies <elwynd@dial.pipex.com> wrote:
> 
> (Copied to aqm mailing list as suggested by WG chair).
> Hi.
> 
> Thanks for your responses.  Just a reminder... I am not (these days, anyway) an expert in router queue management, so my comments should not be seen as deep critique of the individual items, but things that come to mind as matters of general control engineering and areas where I feel the language needs clarification  - that's what gen-art is for,
> 
> As a matter of interest it might be useful to explain a bit what scale of routing engine you are thinking about in this paper.  This is because I got a feeling from your responses to the buffer bloat question that you are primarily thinking big iron here.  The buffer bloat phenomenon has tended to be in smaller boxes where the AQM stuff may or may not be applicable.

If Dave Taht replies, he’ll mention OpenWRT. Personally, I am thinking about ISP interconnect points and Access equipment such as BRAS and CMTS, which are indeed big iron, but also small CPEs such as might implement Homenet technologies and OpenWRT.

> I don't quite know what your target is here - or if you are thinking over the whole range of sizes.  The responses below clearly indicate that you have some examples in mind (Codel, for example which I know nothing about except (now) that it is an AQM WG product) and I don't know what scale of equipment these are really relevant to.
> 
> Some more responses in line.
> 
> Regards,
> Elwyn
> 
> On 05/01/15 20:32, Fred Baker (fred) wrote:
>> 
>>> On Jan 5, 2015, at 1:13 AM, gorry@erg.abdn.ac.uk wrote:
>>> 
>>> Fred, I've applied the minor edits.
>>> 
>>> I have questions to you on the comments blow (see GF:) before I
>>> proceed.
>>> 
>>> Gorry
>> 
>> Adding Elwyn, as the discussion of his comments should include him -
>> he might b able to clarify his concerns. I started last night to
>> write a note, which I will now discard and instead comment here.
>> 
>>>> I am the assigned Gen-ART reviewer for this draft. For background
>>>> on Gen-ART, please see the FAQ at
>>>> 
>>>> <http://wiki.tools.ietf.org/area/gen/trac/wiki/GenArtfaq>.
>>>> 
>>>> Please resolve these comments along with any other Last Call
>>>> comments you may receive.
>>>> 
>>>> Document: draft-ietf-aqm-recommendation-08.txt Reviewer: Elwyn
>>>> Davies Review Date: 2014/12/19 IETF LC End Date: 2014/12/24 IESG
>>>> Telechat date: (if known) -
>>>> 
>>>> Summary:  Almost ready for BCP.
>>>> 
>>>> Possibly missing issues:
>>>> 
>>>> Buffer bloat:  The suggestions/discussions are pretty much all
>>>> about keeping buffer size sufficiently large to avoid burst
>>>> dropping.  It seems to me that it might be good to mention the
>>>> possibility that one can over provision queues, and this needs to
>>>> be avoided as well as under provisioning.
>>>> 
>>> GF: I am not sure - this to me depends use case.
>> 
>> To me, this is lily-gilding. To pick one example, the Cisco ASR 8X10G
>> line card comes standard from the factory with 200 ms of queue per
>> 10G interface. If we were to implement Codel on it, Codel would try
>> desperately to keep the average induced latency less than five ms. If
>> it tried to make it be 100 microseconds, we would run into the issues
>> the draft talks about - we're trying to maximize rate while
>> minimizing mean latency, and due to TCP's dynamics, we would no
>> longer maximize rate. If 5 ms is a reasonable number (and for
>> intra-continental terrestrial delays I would think it is), and we set
>> that variable to 10, 50, or 100 ms, the only harm would be that we
>> had some probability of a higher mean induced latency than was really
>> necessary - AQM would be a little less effective. In the worst case,
>> (suppose we set Codel's limit to 200 ms), it would revert to tail
>> drop, which is what we already have.
>> 
>> There are two reasonable responses to this. One would be to note that
>> high RTT cases, even if auto-tuning mostly works, manual tuning may
>> deliver better results or tune itself correctly more quickly (on a
>> 650 ms RTT satcom link, I'd start by changing Codel's 100 ms trigger
>> to something in the neighborhood of 650 ms). The other is to simply
>> say that there is no direct harm in increasing the limits, and there
>> may be value in some use cases. But I would also tend to think that
>> anyone that actually operates a network already has a pretty good
>> handle on that fact. So I don't see the value in saying it - which is
>> mostly why it's not there already.

> My take on this would be "make as few assumptions about your audience as possible, and write them down".  Its a generally interesting topic and would interest people who are not deeply skilled in the art - as well as potentially pulling in some new researchers!

I’m still not entirely sure what you’d like to have said. Is this a one sentence “setting limits looser than necessary is neither harmful nor helpful”, or a treatise on the topic?

>>>> Interaction between boxes using different or the same algorithms:
>>>> Buffer bloat seems to be generally about situations where chains
>>>> of boxes all have too much buffer.  One thing that is not
>>>> currently mentioned is the possibility that if different AQM
>>>> schemes are implemented in various boxes through which a flow
>>>> passes, then there could be inappropriate interaction between the
>>>> different algorithms.  The old RFC suggested RED and nothing else
>>>> so that one just had one to make sure multiple RED boxes in
>>>> series didn't do anything bad.  With potentially different
>>>> algorithms in series, one had better be sure that the mechanisms
>>>> don't interact in a bad way when chained together - another
>>>> research topic, I think.
>>> 
>>> GF: I think this could be added as an area for continued research
>>> mentioned in section 4.7. At least I know of some poor
>>> interactions between PIE and CoDel on particular paths - where both
>>> algorithms are triggered. However, I doubt if this is worth much
>>> discussion in this document? thoughts?
>>> 
>>> Suggest: "The Internet presents a wide variety of paths where
>>> traffic can experience combinations of mechanisms that can
>>> potentially interact to influence the performance of applications.
>>> Research therefore needs to consider the interactions between
>>> different AQM algorithms, patterns of interaction in network
>>> traffic and other network mechanisms to ensure that multiple
>>> mechanisms do not inadvertently interact to impact performance."
>> 
>> Mentioning it as a possible research area makes sense. Your proposed
>> text is fine, from my perspective.
>> 
> Yes. I think something like this would be good.   The buffer bloat example is probably an extreme case of things not having AQM at all and interacting badly.  It would maybe be worth mentioning that any AQM mechanism has also got to work in series with boxes that don't have any active AQM - just tail drop. Ultimately, I would say this is just a matter of control engineering principles:  You are potentially making a network in which various control algorithms are implemented on different legs/nodes and the combination of transfer functions could possibly be unstable.  Has anybody applied any of the raft of control theoretic methods to these algorithms?  I have no idea!

Well, PIE basically came out of control theory (a basic equation that describes a phase-locked loop), and I believe that Van and Kathy will say something similar about Codel. But that’s not a question for this paper, it’s a question for the various algorithmic papers.

>> I start by questioning the underlying assumption, though, which is
>> that bufferbloat is about paths in which there are multiple
>> simultaneous bottlenecks. Yes, that occurs (think about paths that
>> include both Cogent and a busy BRAS or CMTS, or more generally, if
>> any link has some probability of congesting, math sophomore
>> statistics course maintained that any pair of links has the product
>> of the two probabilities of being simultaneously congested), but I'd
>> be hard-pressed to make a statistically compelling argument out of
>> it. The research and practice I have seen has been about a single
>> bottleneck.
> Please don't fixate on buffer bloat!

?

AQM is absolutely fixated on buffer bloat. We have called it a lot of things over the years, none of them very pleasant, but the fundamental issue in RFC 2309 and V+K’s RED work was maximizing throughput while minimizing queue occupancy.

>>>> Minor issues: s3, para after end of bullet 3:
>>>>> The projected increase in the fraction of total Internet
>>>>> traffic for more aggressive flows in classes 2 and 3 could pose
>>>>> a threat to the performance of the future Internet.  There is
>>>>> therefore an urgent need for measurements of current conditions
>>>>> and for further research into the ways of managing such flows.
>>>>> This raises many difficult issues in finding methods with an
>>>>> acceptable overhead cost that can identify and isolate
>>>>> unresponsive flows or flows that are less responsive than TCP.
>>>> 
>>>> Question: Is there actually any published research into how one
>>>> would identify class 2 or class 3 traffic in a router/middle box?
>>>> If so it would be worth noting - the text call for "further
>>>> research" seems to indicate there is something out there.
>>>> 
>>> GF: I think the text is OK.
>> 
>> Agreed. Elwyn's objection appears to be to the use of the word
>> "further"; if we don't know of a paper, he'd like us to call for
>> "research". The papers that come quickly to my mind are various
>> papers on non-responsive flows, such as
>> http://www.icir.org/floyd/papers/collapse.may99.pdf or
>> http://www2.research.att.com/~jiawang/sstp08-camera/SSTP08_Pan.pdf.
>> We already have a pretty extensive bibliography...
> 
> Right either remove/alter "further" if there isn't anything already out there or put in some reference(s).

OK, Gorry. You have two papers there to refer to. There are more, but two should cover it.

>>>> s4.2, next to last para: Is it worth saying also that the
>>>> randomness should avoid targeting a single flow within a
>>>> reasonable period to give a degree of fairness.

The text from the spec is:

     Network devices SHOULD use an AQM algorithm to determine the packets
     that are marked or discarded due to congestion.  Procedures for
     dropping or marking packets within the network need to avoid
     increasing synchronization events, and hence randomness SHOULD be
     introduced in the algorithms that generate these congestion signals
     to the endpoints.

>>> GF: Thoughts?
>> 
>> I worry. The reasons for the randomness are (1) to tend to hit
>> different sessions, and (2) when the same session is hit, to minimize
>> the probability of multiple hits in the same RTT. It might be worth
>> saying as much. However, to *stipulate* that algorithms should limit
>> the hit rate on a given flow invites a discussion of stateful
>> inspection algorithms. If someone wants to do such a thing, I'm not
>> going to try to stop them (you could describe fq_* in those terms),
>> but I don't want to put the idea into their heads (see later comment
>> on privacy). Also, that is frankly more of a concern with Reno than
>> with NewReno, and with NewReno than with anything that uses SACK.
>> SACK will (usually) retransmit all dropped segments in the subsequent
>> RTT, while NewReno will retransmit the Nth dropped packet in the Nth
>> following RTT, and Reno might take that many RTO timeouts.
> 
> You have thought about what I said.  Put in what you think it needs.

Well, I didn’t think it needed a lot of saying. :-) How about this?

Network devices SHOULD use an AQM algorithm to measure local congestion and determine which packets to mark or drop to manage congestion. In general, dropping or marking multiple packets from the same sessions in the same RTT is ineffective, and can negative consequences. Also, dropping or marking packets from multiple sessions simultaneously has the effect of synchronizing them, meaning that subsequent peaks and troughs in traffic load are exacerbated. Hence, AQM algorithms should randomize dropping and marking in time, to desynchronize sessions and improve overall algorithmic effectiveness.

>>>> s4.2.1, next to last para:
>>>>> An AQM algorithm that supports ECN needs to define the
>>>>> threshold and algorithm for ECN-marking.  This threshold MAY
>>>>> differ from that used for dropping packets that are not marked
>>>>> as ECN-capable, and SHOULD be configurable.
>>>>> 
>>>> Is this suggestion really compatible with recommendation 3 and
>>>> s4.3 (no tuning)?
>>>> 
>>> GF: I think making a recommendation here is beyond the "BCP"
>>> experience, although I suspect that a lower marking threshold is
>>> generally good. Should we add it also to the research agenda as an
>>> item at the end of para 3 in S4.7.?
> 
> I think you may have misunderstood what I am saying here.  Rec 3 and s4.3 say things should work without tuning.  Doesn't having to set these thresholds/algorithms constitute tuning?  If so then it makes it difficult to see these ECN schemes as meeting the constraints.  If you disagree then explain how it isn't - or suggest  that there should be research to see how to make ECN zero config as well.

Well, I think you may have misunderstood the statement in the draft. The big problem with RED, mandated in RFC 2309, was that it couldn’t be deployed without case by case tuning. We’re trying to minimize that. 

Recommendation 3 is:

   3.  The algorithms that the IETF recommends SHOULD NOT require
       operational (especially manual) configuration or tuning.

The title of section 4.3, which is the section explaining recommendation 3, is "AQM algorithms deployed SHOULD NOT require operational tuning”. That’s not “MUST NOT”; if there is a case, there is a case. It goes on to say:

   o  SHOULD NOT require tuning of initial or configuration parameters.
      An algorithm needs to provide a default behaviour that auto-tunes
      to a reasonable performance for typical network operational
      conditions.  This is expected to ease deployment and operation.
      Initial conditions, such as the interface rate and MTU size or
      other values derived from these, MAY be required by an AQM
      algorithm.

   o  MAY support further manual tuning that could improve performance
      in a specific deployed network. 

We’re looking for “reasonable performance in the general case”, and allowing for the use of knobs to adjust that in edge cases.

Let me give you a case in point. Codel and PIE both work quite well without tuning in intra-continental applications, which is to say use cases in which RTT is on the order of tens of milliseconds. The larger the RTT is, the longer it takes to adjust, but it gets there.

One problem we specifically saw with both algorithms as delays got larger - especially satcom - is most easily explained using Codel. Codel starts a season of marking/dropping when an interface has been continually transmitting for 100 ms, meaning that the queue never emptied in 100 ms. At that point, any packet that has been delayed for more than 5 ms has some probability of being marked or dropped. TCP, in slow start, works pretty hard to keep the buffer full - as full as it can make it. So now I’m moving a large file (iperf, but you get the idea), and manage to fill the queue with 50 ms of data and hold it there for 50 ms, with the probable effect of dropping a packet near the tail of that burst. Now, assume that I am on a geosynchronous satellite, so one way delay is on the order of 325 ms. Codel, dropping a packet 1/3 of the way into getting that going, had a horrible time even filling the link. PIE has the same issue, but the underlying mechanism is different. If I’m on that kind of link, I’m going to either turn AQM off or look for a way to tune it.

>> I can see adding it to the research agenda; the comment comes from
>> Bob Briscoe's research.
>> 
>> That said, any algorithm using any mechanism by definition needs to
>> specify any variables it uses - Codel, for example, tries to keep a
>> queue at 5 ms or less, and cuts in after a queue fails to empty for a
>> period of 100 ms. I don't see a good argument for saying "but an
>> ECN-based algorithm doesn't need to define its thresholds or
>> algorithms". Also, as I recall, the MAY in the text came from the
>> fact that Bob seemed to think there was value in it (which BTW I
>> agree with). To my mind, SHOULD and MUST are strong words, but absent
>> such an assertion, an implementation MAY do just about anything that
>> comes to the implementor's mind. So saying an implementation MAY <do
>> something> is mostly a suggestion that an implementor SHOULD think
>> about it. Are we to say that an implementor, given Bob's research,
>> should NOT think about giving folks the option?
>> 
>> I also don't think Elwyn's argument quite follows. When I say that an
>> algorithm should auto-tune, I'm not saying that it should not have
>> knobs; I'm saying that the default values of those knobs should be
>> adequate for the vast majority of use cases. I'm also not saying that
>> there should be exactly one initial default; I could easily imagine
>> an implementation noting the bit rate of an interface and the ping
>> RTT to a peer and pulling its initial configuration out of a table.

> That would be at least partially acceptable as a mode of operation.  But you might have a "warm-up" issue - would it work OK while the algorithm was working out what the RTT actually was?  And would the algorithms adapt autonomously (i.e., auto-tune) to close in on optimum values after picking initial values from the table?

Again, we have quite a bit of testing experience there, and the places the “warm-up” issue comes in are also the cases where RTT is large. For reasons 100% unrelated to this (it was a comment on ISOC’s so-called Measurement Project), yesterday I sent 10 pings, using IPv4 and IPv6 if possible, to the top 1000 sites in the Alexa list. Why 301 of them didn’t respond, I don’t know; I suspect it has something to do with the name in the Alexa list, like dont-go-here.com. But:

699 reachable by IPv4
median average RTT(IPv4): 0.064000
average loss (IPv4): 0.486409
average minimum RTT(IPv4): 111.923004
average RTT(IPv4): 116.361664
average maximum  RTT(IPv4): 124.621014
average standard deviation(IPv4): 4.043735

118 reachable by IPv6
median average RTT(IPv6): 28.748000
average loss (IPv6): 0.000000
average minimum RTT(IPv6): 67.267008
average RTT(IPv6): 69.580873
average maximum  RTT(IPv6): 74.213856
average standard deviation(IPv6): 2.133907

On a 100 ms timescale, the algorithms we’re discussing get themselves sorted out.

>>>> s7:  There is an arguable privacy concern that if schemes are
>>>> able to identify class 2 or class 3 flows, then a core device can
>>>> extract privacy related info from the identified flows.
>>>> 
>>> GF: I don't see how traffic profiles expose privacy concerns, sure
>>> users and apps can be characterised by patterns of interaction -
>>> but this isn't what is being talked about here.
>> 
>> Agreed. If the reference is to RFC 6973, I don't see a violation of
>> https://tools.ietf.org/html/rfc6973#section-7. I would if we appeared
>> to be inviting stateful inspection algorithms. To give an example of
>> how difficult sessions are managed, RFC 6057 uses the CTS message in
>> round-robin fashion to push back on top-talker users in order to
>> enable the service provider to give consistent service to all of his
>> subscribers when a few are behaving in a manner that might prevent
>> him from doing so. Note that the "session", in that case, is not a
>> single TCP session, but a bittorrent-or-whatever server engaged in
>> sessions to tens or hundreds of peers. The fact that a few users
>> receive some pushback doesn't reveal the identities of those users.
>> I'd need to hear the substance behind Elwyn's concern before I could
>> write anything.
> 
> My reaction was that if your algorithm identifies flows then you have
> [big chunk of text that I think was there by mistake dropped out]
> potentially helped a bad actor to pick off such flows or get to know who is communicating in a situation that currently it would be very difficult to know as the queueing is basically flow agnostic.  OK this fairly way out but we have seen some pretty serious stuff apparently being done around core routers according to Snowden et al.

Again, the algorithms I’m aware of are not sensitive to the user or his/her address; they are sensitive to his/her behavior. But again, unless we’re talking about fq_*, we’re not identifying flows, and with fq_* we’re only identifying them in the sense of WFQ. If, to pick something out of the air, PIE is dropping every 100th packet with some probability, and the traffic has 1000 “mouse” flows competing with a single “elephant”, I have a 1000:1 probability of hitting the elephant. That’s not because I don’t like the user or have singled him out in some way; it’s because he’s sending the vast majority of the traffic. So I don’t see the issue you’re driving at.

>>> s4.7, para 3:
>>>> the use of Map/Reduce applications in data centers I think this
>>>> needs a reference or a brief explanation.
>>> GF: Fred do you know a reference or can suggest extra text?
>> 
>> The concern has to do with incast, which is a pretty active research
>> area (http://lmgtfy.com/?q=research+incast). The paragraph asks a
>> question, which is whether the common taxonomy of network flows (mice
>> vs elephants) needs to be extended to include references to herds of
>> mice traveling together, with the result that congestion control
>> algorithms designed under the assumption that a heavy data flow
>> contains an elephant merely introduce head-of-line blocking in short
>> flows. The word "lemmings" is mine.
>> 
>> I know of at least four papers (Microsoft Research, CAIA, Tsinghua,
>> and KAIST) submitted to various journals in 2014 on the topic. It's
>> also, at least in part, the basis for the DCLC RG. The only ones we
>> could reference, among those, would relate to DCTCP, as the rest have
>> not yet been published.
>> 
>> Again, I'd like to understand the underlying issue. I doubt that it
>> is that Elwyn doesn't like the question as such. Is it that he's
>> looking for the word “incast” to replace "map/reduce"?
> 
> I was just looking for somebody to define the jargon - As far as I am concerned at this moment "incast" would be just as "bad" since it would produce an equally blank stare followed by a grab for Google.

If a researcher had that reaction, his or her research wasn’t particularly relevant to data center operations. People working with Hadoop or other such applications are familiar with this in detail.

Imagine that I walked into Times Square on the evening of 31 December, and using a bull-horn asked everyone present to simultaneously lift their bull-horns and tell me how they were feeling. Imagine that they did. You now know what incast is and why it’s a problem. That’s what map/reduce applications do - they simultaneously open or use sessions to thousands of neighboring computers, expecting a reply from each.

But fine. You mentioned Google, so I asked Google about “data center incast”. I first asked just about “incast”. Merriam-Webster told me that it was in their unabridged dictionary but not the free dictionary. But asking about data center incast, I got pages and pages of papers on the topic. 

Gorry, one that might be worth pointing to would be http://www.academia.edu/2160335/A_Survey_on_TCP_Incast_in_Data_Center_Networks

>>> --- The edits below have been incorporated in the XML for  v-09
>>> ---
>>>> Nits/editorial comments: General: s/e.g./e.g.,/, s/i.e./i.e.,/
>>>> 
>>>> s1.2, para 2(?) - top of p4: s/and often necessary/and is often
>>>> necessary/ s1.2, para 3: s/a > class of technologies that/a class
>>>> of technologies that/
>>>> 
>>>> s2, first bullet 3: s/Large burst of packets/Large bursts of
>>>> packets/
>>>> 
>>>> s2, last para: Probably need to expand POP, IMAP and RDP; maybe
>>>> provide refs??
>>>> 
>>>> s2.1, last para: s/open a large numbers of short TCP flows/may
>>>> open a large number of short duration TCP flows/
>>>> 
>>>> s4, last para: s/experience occasional issues that need
>>>> moderation./can experience occasional issues that warrant
>>>> mitigation./
>>>> 
>>>> s4.2, para 6, last sentence: s/similarly react/react similarly/
>>>> 
>>>> s4.2.1, para 1: s/using AQM to decider when/using AQM to decide
>>>> when/
>>>> 
>>>> s4.7, para 3:
>>>>> In 2013,
>>>> "At the time of writing" ?
>>>> 
>>>> s4.7, para 3:
>>>>> the use of Map/Reduce applications in data centers
>>>> I think this needs a reference or a brief explanation.

[aqm] Fwd: Gen-art LC review of draft-ietf-aqm-re… Wesley Eddy
Re: [aqm] Gen-art LC review of draft-ietf-aqm-rec… Fred Baker (fred)
Re: [aqm] Gen-art LC review of draft-ietf-aqm-rec… Elwyn Davies
Re: [aqm] Gen-art LC review of draft-ietf-aqm-rec… gorry
Re: [aqm] Gen-art LC review of draft-ietf-aqm-rec… Fred Baker (fred)
Re: [aqm] Gen-art LC review of draft-ietf-aqm-rec… Elwyn Davies