[tcpm] Review of draft-bensley-tcpm-dctcp-03

Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch> Thu, 18 June 2015 13:31 UTC

From: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Date: Thu, 18 Jun 2015 15:31:33 +0200
Message-Id: <031D67EB-0676-4398-B5DF-E2F73D174E58@tik.ee.ethz.ch>
To: "Eggert, Lars" <lars@netapp.com>, Dave Thaler <dthaler@microsoft.com>, sbens@microsoft.com
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Archived-At: <http://mailarchive.ietf.org/arch/msg/tcpm/mOULTLPfWYg5fXqs8bHiAIYCSU0>
Cc: tcpm@ietf.org
Subject: [tcpm] Review of draft-bensley-tcpm-dctcp-03
Precedence: list

Hi all,

as I think draft-bensley-tcpm-dctcp should be an tcpm working group document and because I’m interested in DCTCP, I’ve reviewed this document. This draft documents DCTCP as implemented in Microsoft and is very-well written and easy to understand, therefore I think if adopted by tcpm is can be processed quickly.

I have two comments which I would like to see addressed; and a few more smaller comments which might help to even further improve the document and might be addressed or not:

1) I would like to see the point that DCTP is *only* indented to be used in data center made more strongly and explicit in this document by adding one more small paragraph in the intro or even one more sentence in the abstract. There are two reason why I think this is very important to say more explicitly: a) there is no negotiation that detects if DCTCP is also implemented by the other end (this is mainly important for the ECN feedback). And b) if DCTP is used in parallel to ‚traditional‘ (maybe even loss-based) congestion control, it will starve this competing traffic; I don’t think this point is mentioned anywhere but probably should, at least very briefly.

2) This document does not describe what the Microsoft implementation does if loss is detected. I think normally the assumption is that there should be no (congestion-based) loss anymore in a data center where all traffic is using DCTCP. However, the draft should still say what to do if loss occurs. If I remember correctly the Linux implementation has a configuration parameter to decide if a loss should be ignored or the window should be halved. Maybe you can again clarify with Daniel Borkmann who implemented this in Linux; I know he has read the draft and is probably willing to review again or even provide some text on this if needed.

Other minor comments:

1) I would like to see mentioned in the abstract as well as in the intro that the gain in lower latency is achieved by using a very low marking threshold of the AQM. This document is written to mainly describe only the endpoint changes and seem to suggest that DCTCP could work with any ECN-enabled AQM. That might be true or not; at least I have not seems any evaluation results where different AQM schemes are used. I personally would prefer to describe DCTCP as a whole system that implements a certain congestion control scheme (and ECN feedback) as well as assumes a specific parameterization in the bottleneck’s AQM. I don’t think that’s a problem to describe it that way, because DCTCP is clear only intended to be used in data centers where all endpoint and network nodes are under the control of one entity. Actually it should even be mentioned that the proposed parameterization can easily be configured with all switches that implement RED.

2) The second point goes in the same direction: section 3.1 says:
"However, the actual
algorithm for marking congestion is an implementation detail of the
switch and will generally not be known to the sender and receiver.
Therefore, sender and receiver MUST NOT assume that a particular
marking algorithm is implemented by the switching fabric.“
First of all, this informational document should not use normative language here. And second, if DCTCP is used and tested by Microsoft only with a certain configuration, I would remove this sentence and write rather something like:
„Even though the actual
algorithm for marking congestion is an implementation detail of the
switch and will generally not be known to the sender and receive,
DCTCP has been design and evaluated to be used with the configuration describe above.“

3) First paragraph in the intro says that the problem with low cost switches is high loss. I actually though that the problem with the standard config of low cost switches is high delay because there is too much buffering… can you comment on this?

4) Second paragraph says "worker nodes complete at approximately the same time“. I thought that they would even complete at exactly the same time because they often have a timeout..? Can you comment here as well? Maybe also refer the DCTCP paper here because it contains a good analysis of incast.

5) Paragraph on ECN in the intro says
"In the presence of mild congestion, it reduces the TCP congestion window too
aggressively…"
‚It‘ refers here to the ECN mechanism of rfc3168. However, the congestion control reduces the window. Therefore the sentence should rather be:
"In the presence of mild congestion, traditional TCP congestion control reduces the TCP congestion window too aggressively…“

6) Abstract and intro both say „DCTCP enhances ECN…“; I really don’t think ‚enhance' can be used here so generally; if the congestion control only needs one feedback signal per RTT, there is no enhancement. Maybe say ‚change‘ or ‚improves the accuracy of the ECN feedback signal by signal not only the occurrence of congestion but also its extend‘… or something like this.

7) As also mentioned above, I think the intro should say that the more aggressive congestion control of DCTCP should not be used in parallel with traditional congestion as it will starve this traffic.

8) I’d like to propose to use the list given in section 3 not only to say which elements are ‚involved‘ but also very briefly say what was ‚modified‘, e.g.

"There are three components that needs to be modified to use DCTCP:

o The switch (or other intermediate device on the network) detects
congestion and sets the Congestion Experienced (CE) codepoint in
the IP header already when a very short queue builds up.

o The receiver echoes each occurrence of a the Congestion Experienced (CE) mark
back to the sender using the ECN-Echo (ECE) flag in the TCP header.

o The sender reacts to the congestion indication by reducing the TCP
congestion window (cwnd) based on the extend of congestion experienced in the last RTT.“

I think this given a nice high-level overview and helps the reader to easier understand the rest.

9) Maybe add to section 3.2 the state machine image as provided in the paper. I personally would, in an RFC, also describe the algo first and then give the reasoning but that’s a matter of which style is preferred…

10) Section 3.3: Quick question: Why is the fraction of marked packets called M in the draft while it’s F in the paper?

11) Again Section 3.3: Maybe mention somewhere that the SACK information are not used to calculate BytesAcked because usually it’s assumed to have no loss with the use of DCTCP. However, I guess if a calculation as given in RFC6937 (PRR) for DeliveredData is used, that is probably okay as well and should not be a problem, right?

12) Further section 3.3 says "when the sender receives an indication of congestion“. Can you maybe be a little more specify here and say „if the first ECE is received in a RTT“ or something like this. Please also note that this behavior will reduce the congestion window when the marking fraction is low because if will reduce with the first occurrence of ECE and all other ECE markings will occur in the following RTT where the window is not further reduced. Therefore the window will only really be reduced if the congestion stays for more than on RTT (that delays the reaction to congestion a little). That’s okay because this allows to not react to short-bursts of congestion. However, I think this should be mentioned/discussed as it might not be obvious when reading the draft.

13) Next sentence is
"Thus, when no sent byte experienced congestion, DCTCP.Alpha equals
zero, and cwnd is left unchanged.“
I think this sentence is nonsensical as the previous sentence says the congestion window will only be reduces if congestion is signaled…

14) As already mention, section 5 should also mention deployment problems if used in parallel to traditional TCP congestion control.

15) Section 6 says:
"However, if an
ACK packet is dropped, it's possible that a subsequent ACK will
indeed acknowledge a mix of CE and non-CE segments.“
Either I don’t understand this sentence or it is not true. To my understanding an ACK is either ECE set or not and therefore might ack CE marks or not. However if e.g. only one ACK is ECE marked and this ack gets lost, the following non-ECE marked ACK will accumulatively ack all packets as not-CE marked and this information is completely lost.

Thanks again for writing this document. I hope it can be published soon!

Mirja

[tcpm] Review of draft-bensley-tcpm-dctcp-03 Mirja Kühlewind
Re: [tcpm] Review of draft-bensley-tcpm-dctcp-03 Mirja Kühlewind
Re: [tcpm] Review of draft-bensley-tcpm-dctcp-03 Praveen Balasubramanian
Re: [tcpm] Review of draft-bensley-tcpm-dctcp-03 Mirja Kühlewind
Re: [tcpm] Review of draft-bensley-tcpm-dctcp-03 Praveen Balasubramanian