Re: [tcpm] Review of draft-ietf-tcpm-dctcp-01

Praveen Balasubramanian <pravb@microsoft.com> Sun, 13 November 2016 19:52 UTC

Return-Path: <pravb@microsoft.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A2A4F12941A for <tcpm@ietfa.amsl.com>; Sun, 13 Nov 2016 11:52:23 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.992
X-Spam-Level:
X-Spam-Status: No, score=-1.992 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=microsoft.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zKNMZRLAYg-E for <tcpm@ietfa.amsl.com>; Sun, 13 Nov 2016 11:52:20 -0800 (PST)
Received: from NAM01-BY2-obe.outbound.protection.outlook.com (mail-by2nam01on0131.outbound.protection.outlook.com [104.47.34.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 484D612961A for <tcpm@ietf.org>; Sun, 13 Nov 2016 11:52:20 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=+FMpILQSkhSl+cxpJAgttyrgucu31eli+CLfXiNz35g=; b=df4aOAa12IEtNZt9yjTfGp2MsHwlFJYirSkMS3E9POnyNP4UehrtiFGjAlH6kSP+TA9gH/fttND7qFRv8Tv4lZX3lEbtj3GKbKmsO9dyBpJRWwRgoEA4/V9Fa1mkAE1M0FmSk6KrA3+Ibo1tM7osxuPBMo+ZVnd5B+/gH18LGOE=
Received: from BY2PR03MB011.namprd03.prod.outlook.com (10.255.240.37) by BY2PR03MB012.namprd03.prod.outlook.com (10.255.240.38) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.659.11; Sun, 13 Nov 2016 19:52:17 +0000
Received: from BY2PR03MB011.namprd03.prod.outlook.com ([169.254.14.33]) by BY2PR03MB011.namprd03.prod.outlook.com ([169.254.14.33]) with mapi id 15.01.0659.028; Sun, 13 Nov 2016 19:52:16 +0000
From: Praveen Balasubramanian <pravb@microsoft.com>
To: Bob Briscoe <ietf@bobbriscoe.net>
Thread-Topic: Review of draft-ietf-tcpm-dctcp-01
Thread-Index: AQHRGMDBGTpsEllovUSobL5AqqWT0Z6S8nXAgb25KACAfKLFwA==
Date: Sun, 13 Nov 2016 19:52:15 +0000
Message-ID: <BY2PR03MB0117403DA329375E35CB146B6BD0@BY2PR03MB011.namprd03.prod.outlook.com>
References: <563CF0F7.6070302@bobbriscoe.net> <BN1PR03MB008D16E15DCBE49E7C21A4AB6350@BN1PR03MB008.namprd03.prod.outlook.com> <71a48e5c-c992-de4d-12c3-7779ee702bd6@bobbriscoe.net>
In-Reply-To: <71a48e5c-c992-de4d-12c3-7779ee702bd6@bobbriscoe.net>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
authentication-results: spf=none (sender IP is ) smtp.mailfrom=pravb@microsoft.com;
x-originating-ip: [2001:4898:80e8:f::584]
x-microsoft-exchange-diagnostics: 1; BY2PR03MB012; 7:rQALbJWPV2HVZEreN5+L+UkYQ0FmBTvMHinkcQ2TObz54AiRTjbl61pheWCyPAMNFNrBSx04s8uVLrpdtq6r4mDKzZ1oQ1P55geVONy/qQC4ZpETmolzxFCYSj7tkccXRXFUdLLeYlcgIGnBP3yij+0lMu4SKTVtUkDo0MNKVkI+VNw+OZk4pjyRPyCMFobsOCdS0w6F2kCqHcPd+KFUjQ09U0o1d1k4suc9pZT0/vyS+z8eW80UAri5mLpzwEvVE2cIIK62IIcDXuGea7Br5UJ9jKWQDZHxr3NOk0UZTznNgU08Mcg7U5obQNmSoOPnfvxSCHbwwyoKGrs1y4sxTxESqKJFhLBY9ky5ANqAK1RrQr8pn6bTNi3XsmdsMgFI
x-ms-office365-filtering-correlation-id: 738c3d8d-b02e-4987-ebcf-08d40bfe91b3
x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001);SRVR:BY2PR03MB012;
x-microsoft-antispam-prvs: <BY2PR03MB012CD9BB9833C26A9ACBB5FB6BD0@BY2PR03MB012.namprd03.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(158342451672863)(192374486261705)(189930954265078)(222014917165493)(219752817060721)(21748063052155);
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(61425038)(6045074)(6060229)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(6055026)(61426038)(61427038)(6046074)(6061226); SRVR:BY2PR03MB012; BCL:0; PCL:0; RULEID:; SRVR:BY2PR03MB012;
x-forefront-prvs: 012570D5A0
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(7916002)(336003)(189002)(199003)(377454003)(51914003)(24454002)(10290500002)(76576001)(87936001)(586003)(3280700002)(4326007)(54356999)(5005710100001)(230783001)(5660300001)(106116001)(229853002)(10090500001)(33656002)(106356001)(8990500004)(105586002)(122556002)(50986999)(99286002)(92566002)(2906002)(15395725005)(102836003)(6116002)(790700001)(86362001)(7846002)(101416001)(7906003)(76176999)(81156014)(81166006)(97736004)(8936002)(2950100002)(9686002)(110136003)(8676002)(86612001)(7736002)(7696004)(189998001)(2900100001)(68736007)(74316002)(6916009)(3660700001)(559001)(579004); DIR:OUT; SFP:1102; SCL:1; SRVR:BY2PR03MB012; H:BY2PR03MB011.namprd03.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en;
received-spf: None (protection.outlook.com: microsoft.com does not designate permitted sender hosts)
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: multipart/alternative; boundary="_000_BY2PR03MB0117403DA329375E35CB146B6BD0BY2PR03MB011namprd_"
MIME-Version: 1.0
X-OriginatorOrg: microsoft.com
X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Nov 2016 19:52:15.8920 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY2PR03MB012
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/BPPKq58Qh8_aevCAO95LfY3PZXE>
X-Mailman-Approved-At: Mon, 14 Nov 2016 08:01:38 -0800
Cc: tcpm IETF list <tcpm@ietf.org>
Subject: Re: [tcpm] Review of draft-ietf-tcpm-dctcp-01
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 13 Nov 2016 19:52:23 -0000

Thanks for the review Bob. Inline prefixed by >>

From: Bob Briscoe [mailto:ietf@bobbriscoe.net]
Sent: Thursday, August 18, 2016 9:36 AM
To: Praveen Balasubramanian <pravb@microsoft.com>
Cc: tcpm IETF list <tcpm@ietf.org>
Subject: Re: Review of draft-ietf-tcpm-dctcp-01

Praveen,

On 17/07/16 21:25, Praveen Balasubramanian wrote:
Thanks Bob. Sorry for the <ridiculous> delay.
[BB] And now sorry for my delay. I tried to finish this email a number of times, but kept getting interrupted.

Comments inline prefixed by <Praveen>. Most of these edits will be in the next revision.
[BB] I've tried very hard to only comment where I really do think there is still a problem (even if minor), and suggested specific text to help you...



Thanks

From: Bob Briscoe [mailto:ietf@bobbriscoe.net]
Sent: Friday, November 6, 2015 10:27 AM
To: Praveen Balasubramanian <pravb@microsoft.com><mailto:pravb@microsoft.com>; EGGERT, Lars <lars@netapp.com><mailto:lars@netapp.com>; Stephen Bensley <sbens@microsoft.com><mailto:sbens@microsoft.com>; Dave Thaler <dthaler@microsoft.com><mailto:dthaler@microsoft.com>; glenn.judd@morganstanley.com<mailto:glenn.judd@morganstanley.com>
Cc: tcpm IETF list <tcpm@ietf.org><mailto:tcpm@ietf.org>
Subject: Review of draft-ietf-tcpm-dctcp-01

Praveen & co-authors,

As promised, here's my review of draft-ietf-tcpm-dctcp-01

General Comments

In general, in the comments below I have tried to reduce the multiple assumptions about controlled environments down to the one assumption that they are controlled by a single authority. I believe DCTCP can work without the other assumptions, so the following do not apply in all controlled environments:
  * not always low RTT (e.g. inter-DC scenarios, or global private networks)
  * not always no risk of traffic attacks (e.g. internally arranged attacks)
I may have missed some other occurrences of these assumptions, so pls feel free to seek out them all.


<Praveen> DCTCP can be extended to other environments but this draft does not cover any of that. It is explicitly for deployment in a datacenter where the RTT is low and there is no risk of internal attack. We cannot cover other scenarios that have not been tested or deployed. All the known deployments are in controlled datacenters. The TCP Prague effort should result in a draft that covers additional requirements that will make DCTCP work well on high latency links.


Many of the comments argue with your choices of MUST, SHOULD, MAY etc. This might seem nit-picking, given the primary purpose is to document the algo. However, it would be wrong to require things that aren't required or to allow exceptions when exceptions would not be interoperable. An alternative approach would be to just remove all capitalised language.


<Praveen> I am fine with removing such capitalization except for when removing it will lead to interop problems between different implementations of DCTCP. I am responding to each comment below and adding whether the new revision will fix it not.

Section-by-section review.

Abstract

I think it needs to have an applicability statement in the abstract (and introduction) that refers to the deployment and implementation status sections at the end. Something like:

"This is an informational specification of the implementation of DCTCP in Microsoft Windows Server 2012. It is applicable to deployments in controlled environments like data centres but it MUST NOT be deployed over the public Internet without additional measures, as detailed in sections 4-6. "


<Praveen> This is already addressed in the introduction. I don’t see the value of repeating it multiple times. An implementer will read the introduction. If you still feel strongly, we can add it to the abstract and remove it from the introduction section. This draft will not cover any additional measures that are intended for deployments outside the datacenter – that’s not the intended purpose.
[BB] It is very common for the IETF to require even much less minor applicability statements to be highlighted in the abstract (e.g. see TFO RFC). This is because nowadays implementers / readers often skip introductory text assuming they understand the problem, etc. and they just scan for the normative parts. Anyone reading an IETF RFCs assumes the context is the public Internet, so any RFC that doesn't have that context needs to flag that. I know the name says Data Centre, but the abstract needs to explain that this is not an IETF RFC that is standardizing DCTCP for the public Internet.

If you still don't think so, we should put this to the WG (which will slow it down).


>> Alright including text to explicitly say so.


1. Intro
s/its many servers/
 /their many servers/


<Praveen> Fixed in next revision.


"...limited queue capacities..."
I understand that the motivation has been abbreviated, but I think this has lost too much. Perhaps mention memory shared between interfaces, so either bloated buffers or too little, making traffic susceptible to packet losses?

   [RFC3168] describes a mechanism for using Explicit Congestion
   Notification (ECN) from the switches for early detection of
   congestion, rather than waiting for packet loss to occur.

This isn't correct. 3168 requires ECN marking to occur no earlier than drop. It is precisely that 3168 does not allow early marking unless there is also early drop that is the problem.


<Praveen> Fixed in next revision.


   It is recommended that DCTCP be deployed...

"recommended" is too weak. I suggest you use something like the applicability text given above, both here and in the abstract.

<Praveen> Added the word “only” to make this stronger per Michael’s review. Fixed in next revision.
[BB] I suggest the applicability para (here in the intro) refers forward to section 5 for details (e.g. so you don't have to explain why it shouldn't be deployed in the intro).

>> Added


2. Terminology

Suggest you explain that capitalised words are not quite the same as in most RFCs:
"Normative language is used to describe how necessary the various aspects of the Microsoft implementation are for interoperability, but even compliant implementations without the measures in sections 4-6 would still only be safe to deploy in controlled environments."


<Praveen> Fair enough. Added text in next revision.


3. DCTCP Algo

The three bullets say no more than "this is normal stuff". Would it be better to supplement each sentence with a brief phrase saying what is different about each aspect compared to RFC3168?


<Praveen> The goal here it to describe the changes needed on top of 3168 so I don’t think further explanation is necessary.


3.1 Marking

This needs to say something about how you mark the IP header from L2. I suggest you refer to the up-and-forward mode of draft-ietf-tsvwg-ecn-encap.


<Praveen> Why is that relevant to this document? It is not intended as a specification for operation of L2 switches. Won’t fix.
[BB] Well, you specify a lot about the marking algo, e.g.
K > (RTT *  C)/7

A simple solution would be: s/switch/L3 switch/

I just thought that a reader would think, "You say switches (i.e. L2 devices) mark the IP header - how can that work?" unless you explain.


>> Fair enough, Changed to L3 switches and routers.



s/sending rate/
 /link rate/



<Praveen> Fixed in next revision



3.2

CURRENT:

   1.  If the CE codepoint is set and DCTCP.CE is false, send an ACK for

       any previously unacknowledged packets and set DCTCP.CE to true.



   2.  If the CE codepoint is not set and DCTCP.CE is true, send an ACK

       for any previously unacknowledged packets and set DCTCP.CE to

       false.
SUGGESTED:

   1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to

       true and send an ACK for any previously unacknowledged packets.



   2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE

       to false and send an ACK for any previously unacknowledged packets.
    The figure would have to be changed to be consistent too.
RATIONALE:
If the receiver changes DCTCP.CE /after/ sending the ACK like this, it delays each signal until the following ACK is sent. That could add a delay of anything from 1 to m packets. I've never understood why the order of these operations was specified in this way. If there is no reason, then I suggest that it is specified in the opposite order, as I describe above.

A note could be added to say the Windows Server 12 implementation does these steps in reverse order, but the order specified is preferred because it reduces signalling delay and has no interoperability issues with the original Windows implementation.
<Praveen> Won’t fix. This is how the Windows implementation behaves so the ordering is required. Both the paper and the draft are consistent in this.
[BB] The paper and the draft are not consistent. In the draft, the labels on the top and bottom arrows in the diagram have been switched round relative to those in the paper.

However, the text in the draft has not (yet?) been changed to reflect this.

There is no delay being introduced here. This ACK is sent immediately so the order does not matter.

[BB] I agree that the ACK itself is not delayed. However, even tho a CE has been received, feedback of this event is held back until the next ACK.

This is a minor 'unnecessary delay' bug.
* I believe it was fixed in the FreeBSD implementation after I pointed it out.
* I've just checked the Linux code, and it's not been fixed there.
* Obviously I cannot see the Windows source code.

If this bug is still in the Windows implementation, it's up to you if you also want the spec to require that the bugs are implemented exactly as they are in Windows ;)


>> There are two separate issues here. One is that the packet that triggers the CE state to change must itself not be accounted for in the immediate ACK. I think the text makes this clear with the use of the word “previously unacknowledged packets”. I am adding a sentence to ensure that this is interpreted correctly otherwise there can be a change in how alpha is computed if the packet was a data packet. The other issue is that the transition packet itself will not be ACKed immediately and I don’t see that as an issue. With data center latencies and the guidance around using a short delayed ACK timeout the unnecessary delay is not a big issue. Also, the delay is only present if there are no subsequent packets of the flow in flight which should be rare.






   The handling of the "Congestion Window Reduced" (CWR) bit is also

   exactly as per [RFC3168<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2frfc3168&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Ce9m5q0XzkbQNWEb1rqmXrqx12Uf4LDscrUAz5C78eY%3d>] including [RFC3168-ERRATA3639<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2fdraft-ietf-tcpm-dctcp-01%23ref-RFC3168-ERRATA3639&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=wDJv%2bcypXj5layfzvQGhXYqB6kwl8lW%2fpdQ4iYg2cN8%3d>].  That is, on

   receipt of a segment with both the CE and CWR bits set, CWR is

   processed first and then ECE is processed.
Even tho this is the receiver section, I suggest:
s/The handling/Receiver handling/


<Praveen> Isn’t this assumed by a reader familiar with 3168? But this is a minor editorial fix so fixed in the next revision.

Whatever, this cannot be correct. A DCTCP receiver surely does nothing on receipt of CWR, whereas an RFC3168 receiver does something. A DCTCP receiver certainly cannot do exactly what RFC3168 says, because that says stop setting ECE on receipt of CWR until the next CE arrives, which would surely break this feedback protocol.


<Praveen> I don’t see why this would break the feedback protocol. The receiver will start echoing again as soon as it sees the first CE arrives again. If packet has both CWR and CE set then CE takes precedence and the echoing is done.
[BB] I'm not saying it breaks the feedback protocol. I'm just saying it's confusing for the reader.

Of course, when the receiver is already not doing something, ignoring a signal to stop has the same outcome as stopping. However, describing this as "handling CWR ... is also exactly as per RFC3168" is misleading for the reader/implementer.

I guess changing to "handling CWR ... is as per RFC3168" would be a compromise.


>> Fixed this to remove “exactly”.

3.3

   DCTCP.Alpha, which is initialized to 1 and MUST be updated

   as follows:
s/MUST/SHOULD/
And add "An alternative congestion avoidance algorithm MAY be used, but it MUST lead to flow rate inversely proportional to 1/M".

Rationale: There are many ways to implement interoperable 1/p cc. This is just one.

<Praveen> Alternate algorithms are out of scope of this draft. We cannot say without interop testing if a different algorithm will perform well in datacenters or interop well with existing implementations. If there are other ways to implement 1/p they should be covered by other drafts. This draft will only describe the algorithm that has been tested and deployed. That said I am willing to make it a SHOULD so fixed in the next revision.



   DCTCP.BytesSent
This variable actually counts bytes received. it's pretty confusing to call it BytesSent.


<Praveen> No, this is tracking bytes that have been sent.
[BB] I meant it counts bytes now acknowledged as received by the other end (which is not the same as bytes sent by this end).
DCTCP.BytesAcked would have been more meaningful. I'm not insisting on this - the original comment was just to improve comprehensibility.


>>  Fair enough now I see what you meant. Updating to BytesAcked


s/TCP SACK options [RFC2018] are ignored./
 /The sender MUST ignore TCP SACK options [RFC2018] for this computation./
RATIONALE:
  a) I think the idea is to ignore CE marks after a gap because, if the gap becomes considered a loss, the response to loss will override any need to count incoming CE marks for the following round trip. However, if the gap is filled before it is considered a loss {Note 1}, then CE marks that were ignored because they arrived after a gap will need to be "unignored".
  b) We need to make clear that SACK is not ignored altogether, only for this computation.

{Note 1}: I can think of two cases:
* Less than three dupacks then the gap is filled
* The gap is filled by a delayed segment so that an earlier retransmission was spurious
<Praveen> Ok I will update the text.



the sender MUST update cwnd as follows:

      cwnd = cwnd * (1 - DCTCP.Alpha / 2)
s/MUST/SHOULD/
Rationale: Alternative interoperable response functions are possible. For instance an algorithm like Relentless TCP that simply reduces cwnd by half the size of any CE-marked segment. Then there is no Alpha and no g variable.



<Praveen> Fixed



CURRENT:

   Just as specified in [RFC3168<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2frfc3168&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Ce9m5q0XzkbQNWEb1rqmXrqx12Uf4LDscrUAz5C78eY%3d>], TCP should not react to congestion

   indications more than once for every window of data.
SUGGESTED:

   Just as specified in [RFC3168<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2frfc3168&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Ce9m5q0XzkbQNWEb1rqmXrqx12Uf4LDscrUAz5C78eY%3d>], DCTCP does not react to congestion

   indications more than once for every window of data.
I don't know whether you intended this to be an upper-case SHOULD NOT. If so, I would argue against this being normative, because it isn't necessary for interop; therefore I've suggested avoiding he 'should' word.


<Praveen> Fixed




   The setting of

   the "Congestion Window Reduced" (CWR) bit is also exactly as per

   [RFC3168<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2frfc3168&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Ce9m5q0XzkbQNWEb1rqmXrqx12Uf4LDscrUAz5C78eY%3d>].




It is wrong to say "exactly". The sender certainly sets CWR when it does a congestion response. But in DCTCP the congestion response occurs every RTT, triggered by its own measurement-period logic, whereas in RFC3168 it is triggered by the arrival of an ECE after the ACK of any previous segment it sent with CWR set.

I think it is worth saying that setting CWR is necessary for safety, otherwise in the case where there has been a configuration management error and the receiver complies with RFC 3168 not DCTCP, it will continue sending ECE for ever and starve the session.

It might help to explicitly confirm that:
"Other aspects of TCP congestion avoidance and control such as the slow start phase are no different to those in RFC5681."



<Praveen> Fixed that CWR is needed for safety. Not adding the text around 5681, it is presumed.


3.4 SYN, SYN-ACK, RST

   [RFC3168] requires that a compliant TCP MUST NOT set ECT on SYN or
   SYN-ACK packets.

s/MUST NOT/'MUST NOT'/
Rationale: Best to make it clear this is quoting a normative statement in another RFC, because it is going to be contradicted.
[BB] I don't know whether you saw this one - it's not been changed in draft-02.

I suggested adding quotes around 'MUST NOT'. This is good practice when quoting another RFC. Otherwise, to a careless reader (or non-native English speaker), it can look like it is also a requirement of this RFC.


>> Changed to lower case so as to not be misleading around being an ask of this RFC.

CURRENT

   These RFCs, however, are

   intended for general Internet use, and do not directly apply to a

   controlled datacenter environment.

...

   For DCTCP connections, the sender

   SHOULD set ECT for SYN, SYN-ACK and RST packets.
SUGGESTED:

   Also, a SYN is sent before the ECN-capability has been
   negotiated, so if it is CE-marked, a non-ECN TCP server would not
   recognise the marking. RFC 3168 does not specify the ECN field on
   a RST. The security concerns addressed by these RFCs
   might not apply in controlled environments like data centres, and
   it might not be necessary to cater for the presence of non-ECN
   servers.
...

   Therefore, the sender MUST set ECT for SYN/ACK and RST packets and a

   configuration option SHOULD be provided to set ECT for SYNs.
RATIONALE:
   The draft should avoid assuming that security concerns do not apply in all controlled environments.
NOTE:
   This might not reflect the way your implementation is coded, so you might want to make that clear.


<Praveen> Adding some text around security concerns.
[BB] The security sentence you have added in draft-02 doesn't make sense where it is. It ought to have been added once sentence earlier (and it would be better to start a new para). Specifically:
CURRENT:
                be dropped with high probability.  For DCTCP connections, the sender
                SHOULD set ECT for SYN, SYN-ACK and RST packets.  The security
                concerns addressed by both these RFCs might not apply in controlled
                environments like datacenters, and it might not be necessary to cater
                to both the presence of non-ECN servers.
SUGGESTED:
                be dropped with high probability.
<NEW PARA>
                The security concerns addressed by both these RFCs might not apply in controlled
                environments like datacenters, and it might not be necessary to cater
                for the presence of non-ECN servers. For DCTCP connections, the sender
                SHOULD set ECT for SYN, SYN-ACK and RST packets.

(Note I also fixed a glitch after "cater".)


>> Fixed typos but removed security implications from this section and moved them into section 8 (security).



4. Implementation Issues

CURRENT:

   the implementation MUST choose a suitable

   estimation gain.  [DCTCP10<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2fdraft-ietf-tcpm-dctcp-01%23ref-DCTCP10&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=33dRUUxyNa2tcX0xJtMl12nWwzIdHZa6j3kgAL1dxIM%3d>] provides a theoretical basis
SUGGEST:

   the implementation will need to choose a suitable

   estimation gain.  [ADCTCP10<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2fdraft-ietf-tcpm-dctcp-01%23ref-DCTCP10&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=33dRUUxyNa2tcX0xJtMl12nWwzIdHZa6j3kgAL1dxIM%3d>] provides a theoretical basis ...
RATIONALE:
   a) As above, this is only one way to implement the cc, and others could well prove to be interoperable when tested.
   b) More particularly, it would be better for an implementation to adapt the estimation gain to the RTT, so the spec ought not to imply that one value has to be chosen.
   c) The later paper [ADCTCP10] provides a corrected way to determine the gain. From memory (I'm offline at the mo) it at least makes throughput inversely proportional to 1/RTT, whereas the approach in [DCTCP10] made it inversely proportional to 1/(RTT^2). I believe the Linux implementation uses the [ADCTCP10] approach.

<Praveen> Fair enough. Updated.


   The implementation must also decide when to use DCTCP.

...

   likely in the same datacenter network.
The 2nd & 3rd paras are a mix of implementation and deployment issues, but mostly deployment, so they might be better moved to the deployment issues section.



<Praveen> I think they are mainly implementation issues – the fact that a lack of negotiation mechanism places a burden on the implementation to provide knobs or heuristics to use DCTCP.



CURRENT:

   It is RECOMMENDED that an implementation deal with loss episodes in

   the same way as conventional TCP.

...

   DCTCP implementation MAY also allow configuration of resetting the

   value of DCTCP.Alpha as part of processing any loss episodes.
SUGGESTED: Move this whole para to 3.3 Sender Behaviour section, and:

   An implementation MUST deal with loss episodes in

   the same way as conventional TCP.

...

   DCTCP implementation SHOULD also reset the

   value of DCTCP.Alpha as part of processing any loss episodes.

   This aspect of the behaviour MAY be configurable.
RATIONALE:
   a) There would never be a case for not falling back to conventional TCP  (or do you have one?), so "RECOMMENDED" is too weak.
   b) I have tried to interpret what you might have meant about resetting Alpha. I'm not sure why you think it might need to be configurable though?
<Praveen> Yes this is a useful suggestion. Making loss handling a must in sender section. Reset of alpha was discussed in tcpm list, and Linux implementation supports it.



   ...it is also RECOMMENDED that an implementation allow

   configuration of restarting the congestion window (cwnd) of idle

   DCTCP connections
Where special implementation measures are necessary for low RTT, these should be related to the measured RTT, not hard-coded. There is an assumption in a number of places  that "data centre" implies low RTT. Of course this is true in a single DC, but I believe the the single admin aspect of a DC is what characterises DCTCP, not the low RTT aspect. With the modifications in [ADCTCP], DCTCP generalises to a network with larger RTTs as long as it is still  operated by a single admin.



<Praveen> this paragraph has now been removed per Michaels’ feedback.


CURRENT:
   [RFC3168] forbids the ECN-marking of pure ACK packets, because of the
   inability of TCP to mitigate ACK-path congestion and protocol-wise
   preferential treatment by routers.  However, dropping pure ACKs -
   rather than ECN marking them - has disadvantages for typical
   datacenter traffic patterns.  Because of the prevalence of bursty
   traffic patterns that feature transient congestion, dropping of ACKs
   causes subsequent retransmissions. It is RECOMMENDED that an
   implementation provide a configuration knob that will cause ECT to be set
   on such control packes, which can be used in environments where such
   concerns do not apply.
SUGGESTED:
   [RFC3168] forbids the ECN-marking of pure ACK packets, because of the
   inability of TCP to mitigate ACK-path congestion and the extra
   advantage to injection attackers that ECN is perceived to offer.
   For the latter reason RFC 3168 also forbids ECN-marking of
   retransmissions, window probes and RSTs.

   However, dropping all these control packets -
   rather than ECN marking them - has considerable performance
   disadvantages.  It is RECOMMENDED that an
   implementation provide a configuration knob that will cause ECT to be set
   on such control packes, which can be used in environments where such
   concerns do not apply.
NOTE:
   This might not reflect the way your implementation is coded, so you might want to make that clear.




<Praveen> Fixed
[BB] A new comment...

As pointed out in draft-bagulo-tsvwg-generalized-ecn-00, RFC3168 only talks about injection attacks with retransmissions (not ACKs etc). But you imply injection attacks were the concern for other control packets as well.

I'm not insisting you change this. However, unwarranted accusations of security vulnerabilities tend to stick, whether or not they are correct. So it would be better not to throw mud around onto even more control packets than RFC3168 did.



>> Modified this to remove that suggestion.




Cheers


Bob




5. Deployment Issues

   If

   the traffic in the datacenter is a mix of conventional TCP and DCTCP,

   it is RECOMMENDED that DCTCP traffic be segregated from conventional

   TCP traffic.
In a data centre (or anywhere), is there any case where not segregating them would make sense? If not, this needs to be a MUST, not just a RECOMMENDED. I can imagine cases where there is no long-running conventional TCP or no long-running DCTCP, but that's covered by the "if" at the start of the sentence.

You might want to refer to draft-briscoe-aqm-dualq-coupled as another segregation approach here too.

   Since DCTCP relies on congestion marking by the switches, DCTCP can

   only be deployed in datacenters where the entire network

   infrastructure supports ECN.
Surely, the "fall-back to conventional TCP on loss" rule means you can delete this constraint.

<Praveen> Reworded to imply that DCTCP will only work as expected in such cases.


   A

   variant of DCTCP that can be deployed unilaterally and only requires

   standard ECN behavior has been described in [ODCTCP<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2fdraft-ietf-tcpm-dctcp-01%23ref-ODCTCP&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=ZwfJ0LgHQ3vk%2bMD0wiia5yLrcMMSUBcGgPYXmeRu8hE%3d>][BSDCAN],

Which part of "standard ECN behaviour" do you mean? The host part? The network part? The receiver part? And "can be deployed unilaterally" ought to be in the active voice not the passive, which would help resolve the ambiguity too.

I haven't read [ODCTCP] and [BSDCAN] (I'm offline on a plane at the mo). Are they similar to "Instant ECN" in [Wu12] listed below? If not, that could be referred to as well for a technique that uses the regular ECN host behaviour.



<Praveen> This edit predates me and  I don’t have the full context, I am leaving this as is.



6. Known Issues

First para could refer to appendix A of RFC7560, which describes the problem precisely.




A method for improving the fairness of DCTCP has been proposed

   in [ADCTCP<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftools.ietf.org%2fhtml%2fdraft-ietf-tcpm-dctcp-01%23ref-ADCTCP&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=edU2TPGpLvyFGvgocDRJiYRQyjCT%2fWEOgS2MwmwzYuU%3d>], but requires additional experimental evaluation.
s/fairness/RTT fairness/

(In our experiments, it's much better than without.)



<Praveen> Goal is to cite other work that can address potential issues. It is not a recommendation. s/fairness/RTT fairness/ Fixed in next revision.


11. References

I think the following are cited normatively, not just informatively:
RFC5681, RFC5562.

Additional informational Reference:

[Wu12] Wu, H., Ju, J., Lu, G., Guo, C., Xiong, Y. & Zhang, Y., "Tuning ECN for Data Center Networks," In: Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies CoNEXT '12 pp.25-36 ACM (2012)



<Praveen> Citation to RFCs updated as normative. Not including the new reference since even though it may be relevant – it was not used as a reference in writing the draft.


That's it. HTH.


Bob



--

________________________________________________________________

Bob Briscoe                               http://bobbriscoe.net/<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fbobbriscoe.net%2f&data=01%7c01%7cpravb%40microsoft.com%7ca572b0ba8e1b47d3e44a08d2e6d7e03f%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=iEvhWj9FCPozWAp8rnxk5VZdNF%2fv7W7aD1Cp4MxOiSs%3d>



--

________________________________________________________________

Bob Briscoe                               http://bobbriscoe.net/

--

________________________________________________________________

Bob Briscoe                               http://bobbriscoe.net/