Re: [tsvwg] 回复：回复：Takeaways for HPCC++

"Scharf, Michael" <Michael.Scharf@hs-esslingen.de> Wed, 11 November 2020 09:39 UTC

From: "Scharf, Michael" <Michael.Scharf@hs-esslingen.de>
To: "Rui, Miao" <miao.rui@alibaba-inc.com>, Richard Li <richard.li@futurewei.com>, tsvwg <tsvwg@ietf.org>
CC: Barak Gafni <gbarak@mellanox.com>, tsvwg-chairs <tsvwg-chairs@ietf.org>, "Liu, Hongqiang(洪强)" <hongqiang.liu@alibaba-inc.com>, "Pan, Rong" <rong.pan@intel.com>, "Lee, Jeongkeun" <jk.lee@intel.com>
Thread-Topic: 回复：[tsvwg] 回复：Takeaways for HPCC++
Thread-Index: AQHWiBkX5UP5xQKJq0Cr7TYhOkAYu6mtkk2AgAavHQCAANA/sIANQiYAgACQS9A=
Date: Wed, 11 Nov 2020 09:39:17 +0000
Message-ID: <900affa98dad4f7885aa89e6c322a72c@hs-esslingen.de>
References: <725865b7-b8a5-4099-8019-e8c43c84f9b4.miao.rui@alibaba-inc.com> <4331364e-306f-41d0-b867-7dbc9aaa6bcc.miao.rui@alibaba-inc.com> <3f0684890ec9401080c26ccc9fa78ea6@hs-esslingen.de> <BYAPR13MB227942191D509AB9636B112487100@BYAPR13MB2279.namprd13.prod.outlook.com>, <3ccd7022d25f4d059e2024db995ec7b2@hs-esslingen.de> <1d76ebf3-d7a2-4c4f-beb1-a106176ac1e4.miao.rui@alibaba-inc.com>
In-Reply-To: <1d76ebf3-d7a2-4c4f-beb1-a106176ac1e4.miao.rui@alibaba-inc.com>
Accept-Language: de-DE, en-US
Content-Language: de-DE
Content-Type: multipart/alternative; boundary="_000_900affa98dad4f7885aa89e6c322a72chsesslingende_"
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/37SUp8JSE_tJXzKM6-tDcOhJaeA>
Subject: Re: [tsvwg] 回复：回复：Takeaways for HPCC++
Precedence: list

RFC 6077, as well as my thesis, deals with *routers*, which can operate over a many different link layer technologies. As far as I know, there are many link layer technologies that do not have a constant physical link capacity (WLAN , 3G/4G/5G, etc.). A typical reason is adaptive encoding and link-layer retransmissions.

But I’ll give you some food for thought for typical data center networks, in order to explain why I don’t buy *any* use of a “rate” parameter in any router algorithm.

All examples below use routers in data centers, and for none of them I understand how a parameter “capacity” or “rate” would be defined in the router.

1/ For instance, I don’t understand how a L3 switch in a datacenter network using IP routing would know the physical capacity associated to a logical IP interface, say, if the L3 switch has multiple logical IP interfaces. At least the L3 switches I use in my lab can assign IP interfaces to VLANs instead of physical ports. In that case, e.g. with VLANs, there may be no 1-to-1 mapping between the logical IP interface and a physical port. A physical port has a given capacity (say, 100 Gbps Ethernet), but that does not imply anything to IP forwarding/routing, which is controlled by the routing table. For instance, the forwarding/routing may use statistical multiplexing across VLANs. From a IP forwarding/routing perspective, several IP interfaces may then share the physical capacity (and possibly also the buffer). Say, if there are 10 VLANs in the 100 Gbps link, which capacity would be associated to one of the logical IP interfaces? If the answer was 10 Gbps, it would be an underestimation by factor 10 if there is no other traffic on the other VLANs.

2/ And note that data center networks can be more complex, e.g., when they get interconnected. In a datacenter interconnect setting, I really don’t know how a router would know the physical capacity on router-to-router interconnects. As one example, please consider a router that is connected by a 100 Gbps interface to an optical OTN switch. The OTN switch could perhaps use 10x 10 Gbps connections inside the optical transport network with some sort of link bundling (modern optical OTN switches have such capabilities, e.g. using LAGs). Now, let’s assume a fiber failure affects 9 out of these 10 connections. The available bandwidth on the router interconnect then drops to 10 Gbps (i.e., 10%), while the router still uses a physical 100 Gbps interface towards the optical switch. It is not clear to me how the router would know that the capacity has dropped to 10% other than by throughput measurements (except if one assumes technologies such as GMPLS UNI).

3/ I also don’t understand how the router would know any “physical capacity” if a data center interconnect was realized e.g. by multi-point-to-multipoint L2VPNs (e.g., VPLS). In that case, the available capacity in the L2VPN is defined by the SLAs, i.e., commercial contracts. It may be difficult for a router to know the terms of a commercial contract… Also, note that many point-to-multipoint VPNs use in their SLAs a total physical capacity into the VPN, i.e., they specify the sum of capacity to multiple routers in the VPN, but no dedicated capacity for any router-to-router link (“hose-model”). The same applies to L3VPNs.

4/ Finally, I think there is an overall industry trend to use overlay networks in data centers (and beyond) with all sorts of tunnels. So, any solution proposed for such settings will have to be able to cope with those tunnels. That might imply that a cross-layer information exchange is needed between the underlay and the overlay… That is already non-trivial for the ECN bits, and for any more complex information exchange a lot of engineering problems may emerge. And note that tunnels can be encrypted.

As far as I can tell, in research it is easy to setup a simulation model with some links between routers having some well-defined capacity. One can also do this with simple hardware setups. Well, I have done such theoretic “toy” setups in my thesis as well. BUT, as far as I can tell, the reality in engineering can be much more complex. BTW, in my thesis, I already run into a lot of challenges when I tried to add Quick-Start support to the IP forwarding/routing in the Linux kernel… Getting a “rate” parameter inside the Linux IP forwarding code was extremely difficult and only doable for trivial network setups. That is the difference between research and engineering…

Unless all these issues (and there are many more) can be solved, IMHO any algorithm inside a router must be able to work without any a-priori knowledge about any “capacity” towards the next hop in a routing table, because there is no notion of a “physical capacity” in many scenarios, and in many other scenarios the information for a next hop routing table entry could be arbitrarily wrong. A router can of course measure the available bandwidth towards the next hop in the routing table by stats, but that adds a lot of complexity in the router and it would be a heuristic like any end-to-end congestion control.

BTW, if there is an assumption that any IP interface in a router has a queue, this could also result in a lot of engineering questions, as router hardware may not be implemented that way. As far as I know, the buffer management in routers can be very complex, e.g., if H-QoS is used. But this is about router hardware design and I am not an expert for that. https://datatracker.ietf.org/meeting/104/materials/slides-104-wgtlgo-forwarding-plane-realities provides some interesting insights into the realization of modern routers.

So, in a nutshell, IMHO an algorithm for routers cannot and must not use any configurated “rate” parameter. As far as I understand the HPCC++ algorithm so far, the parameter “txRate” must therefore be removed from the algorithm, except if the mentioned challenges are all solved (which may be doable – I just don’t know how). Note that this is not about the algorithm itself, but about how to obtain the required input parameters.

The situation may be different for link-layer switches, in particular, if one assumes a greenfield (“clean-slate”), homogeneous link layer network infrastructure. But, as far as I can tell, there may be significant differences between a switch and a router that uses full IP forwarding/routing.
RFC 6077, as well as my thesis, tries to address the challenges in routers. If you want to propose a network supported congestion control scheme that shall work on routers, you’ll have to detail how a router can be engineered. I may be wrong, but I doubt that this is a low-hanging fruit given the functionality of a modern router.

Michael


From: Rui, Miao <miao.rui@alibaba-inc.com>
Sent: Wednesday, November 11, 2020 12:37 AM
To: Scharf, Michael <Michael.Scharf@hs-esslingen.de>; Richard Li <richard.li@futurewei.com>; tsvwg <tsvwg@ietf.org>
Cc: Barak Gafni <gbarak@mellanox.com>; tsvwg-chairs <tsvwg-chairs@ietf.org>; Liu, Hongqiang(洪强) <hongqiang.liu@alibaba-inc.com>; Pan, Rong <rong.pan@intel.com>; Lee, Jeongkeun <jk.lee@intel.com>
Subject: 回复：[tsvwg] 回复：Takeaways for HPCC++

Thanks, Michael. I have read RFC 6077 and your thesis. So much fun to read them.

Yes, you are right. HPCC++ belongs to the category that relies on network support. Instead of having a single bit indication like ECN, HPCC++ leverages the newly-invented inband telemetry to get link load information for precise the rate update. So it can be viewed as the enhanced/ultimate version of ECN. However, we view HPCC++ is still a primary mode algorithm because it performs rate updates in each end-system in a distributed manner. We have also evaluated through extensive theoretical proofs, simulations, and testbed experiments that HPCC++ is efficient and robust in various settings. Details can be found at

 https://dl.acm.org/doi/10.1145/3341302.3342085

One interesting finding in reading the Quick-Start approach is that to quickly open up the congestion window is essential in today's high-speed network. However, we found our applications intensively use the persistent TCP connections to skip the hand-shaking delay and often have delays in data transmission after the connection is established (e.g., wait for control signal or wait for other connections). Then we turn to rely on congestion control itself to open up the window. Using inband telemetry HPCC++ knows precisely how much it can increase so it can actually use MIMD to obtain the available bandwidth in just one round trip time. (But it still uses AIMD for fairness convergence after iteratiions.)

We have also compared with XCP/RCP in section 3.3 of our paper.  The main takeaway is that those congestion signals are heuristic combinations of different types, so they have scaling parameters to tune their relative significance. In contrast, the key insight of HPCC++ is to limit the ``inflight bytes'' which is essential to resolve congestion. HPCC++ uses inband telemetry with concrete physical meaning to quantify inflight bytes. So you may find that HPCC++ rarely uses parameters (HPCC++ has only three configuration parameters and none of them are reliability-critical). Another issue is that rxRate used in XCP/RCP has no concrete physical meaning and rxRate and qlen used in XCP/RCP overlap with each other. We show an experiment that using txRate in HPCC++ is more robust than rxRate in Figure 6.

I do not quite get it from reading your thesis where you mention the router cannot determine the correct capacity. Here are my preliminary thoughts and please feel free to correct me if I am wrong:
·         HPCC++ assumes each router knows its physical link capacity. e.g. a link with 100GE. Are you suggesting the available bandwidth for HPCC++/XCP/RCP is dynamically varied due to administration reasons, or queueing management? Can you please elaborate on the reasons in realistic settings? How could it be 10x in over- or under-estimation?
·         HPCC++ is devised to assume the link capacity should be given not through estimation. Even if we are targeting a situation with bandwidth variation, my takeaway is that HPCC++ should be modified to be adaptive but the design philosophy is still to precisely estimate the available inflight bytes for rate updates, without doing any sorts of heuristics with parameter tuning.
·         Yes, the congestion feedback can be delayed during the congestion. But the end-system has the window to limit the total inflight bytes, so the congestion situation cannot getting worse.
·         Yes, there is one RTT delay in getting congestion feedback. In control theory, the utilization can still converge very quickly (we show in theory and experiments that HPCC++ can converge in only one round trip time.)

It is great to talk with you about those algorithm details. I am happy to hear your thoughts.

Thanks,
Rui
------------------------------------------------------------------
发件人：Scharf, Michael <Michael.Scharf@hs-esslingen.de<mailto:Michael.Scharf@hs-esslingen.de>>
发送时间：2020年11月2日(星期一) 04:53
收件人：Richard Li <richard.li@futurewei.com<mailto:richard.li@futurewei.com>>; "Miao, Rui(缪睿)" <miao.rui@alibaba-inc.com<mailto:miao.rui@alibaba-inc.com>>; tsvwg <tsvwg@ietf.org<mailto:tsvwg@ietf.org>>
抄　送：Barak Gafni <gbarak@mellanox.com<mailto:gbarak@mellanox.com>>; tsvwg-chairs <tsvwg-chairs@ietf.org<mailto:tsvwg-chairs@ietf.org>>
主　题：RE: [tsvwg] 回复：Takeaways for HPCC++

RFC 6077 Section 3.1 summarizes challenges for all congestion control schemes that rely on network support known at the time of publicatiob.

In addition to the general issues, RFC 6077 Section 3.1 also includes some specific literature pointers to known issues for XCP. As far as I recall, some of the issues in XCP were fairly fundamental. The corresponding papers are referenced in the RFC.

Before publication of RFC 6077, I have myself worked with XCP simulations and XCP hardware implementations. In my PhD thesis I have compared, amonst others, XCP with a simpler solution developed in the IETF (Quick-Start). My personal findings can be found in Sections 4, 5 and 6 of my Phd thesis:

http://content.ikr.uni-stuttgart.de/Content/Publications/Archive/Sf_Diss_40112.pdf

Specifically, I ran into issues when a router supporting XCP cannot correctly determine the capacity of a link (see section 6.5.3.3 in the thesis).

RFC 6077 explains why it is not trivial to actually know what link capacity in reality (i.e., outside simulations). Examples for such cases include virtualization, tunneling, brownfield deployments with legacy devices, etc.

I don’t understand HPCC++ well enough to know how this problem is solved in actual deployments. If you want to repeat the experiment I did in my PhD thesis for XCP (cf. Section 6.5.3.3), you would have to run e.g. experiments in which the actual link capacity is either by factor 10 larger or by factor 10 smaller than the capacity assumed in the algorithm (I guess it is the variable B_j in the draft). If a congestion control algorithm shall be robust, IMHO it would have to be able to deal with such situations. According to my own results, XCP did not work well in such cases of capacity over- or under-estimation.

BTW, the I-D „draft-irtf-panrg-what-not-to-do“ also discusses related lessons learnt.

Michael


From: Richard Li <richard.li@futurewei.com<mailto:richard.li@futurewei.com>>
Sent: Monday, November 2, 2020 1:44 AM
To: Scharf, Michael <Michael.Scharf@hs-esslingen.de<mailto:Michael.Scharf@hs-esslingen.de>>; Rui, Miao <miao.rui@alibaba-inc.com<mailto:miao.rui@alibaba-inc.com>>; tsvwg <tsvwg@ietf.org<mailto:tsvwg@ietf.org>>
Cc: Barak Gafni <gbarak@mellanox.com<mailto:gbarak@mellanox.com>>; tsvwg-chairs <tsvwg-chairs@ietf.org<mailto:tsvwg-chairs@ietf.org>>
Subject: RE: [tsvwg] 回复：Takeaways for HPCC++

>it is has specifically taken into account lessons learnt from protocols such as XCP

Would you mind sharing what lessons exactly you learnt from XCP?

Thanks,

Richard




From: tsvwg <tsvwg-bounces@ietf.org<mailto:tsvwg-bounces@ietf.org>> On Behalf Of Scharf, Michael
Sent: Wednesday, October 28, 2020 10:48 AM
To: Rui, Miao <miao.rui@alibaba-inc.com<mailto:miao.rui@alibaba-inc.com>>; tsvwg <tsvwg@ietf.org<mailto:tsvwg@ietf.org>>
Cc: Barak Gafni <gbarak@mellanox.com<mailto:gbarak@mellanox.com>>; tsvwg-chairs <tsvwg-chairs@ietf.org<mailto:tsvwg-chairs@ietf.org>>
Subject: Re: [tsvwg] 回复：Takeaways for HPCC++

Hi all,
An editorial comment: ICCRG has published RFC 6077 about ten years ago. Obviously, the document may be a bit dated, but it is has specifically taken into account lessons learnt from protocols such as XCP or Quick-Start. Not all content of RFC 6077 may apply to current data centers. Nonetheless, some of the content in RFC 6077 is quite generic, and it could be useful to better explain in a description of HPC++ how this architecture addresses the challenges.
Best regards
Michael (as co-author of RFC 6077)

From: tsvwg <tsvwg-bounces@ietf.org<mailto:tsvwg-bounces@ietf.org>> On Behalf Of Rui, Miao
Sent: Friday, September 11, 2020 10:53 AM
To: Miao, Rui(缪睿) <miao.rui@alibaba-inc.com<mailto:miao.rui@alibaba-inc.com>>; tsvwg <tsvwg@ietf.org<mailto:tsvwg@ietf.org>>
Cc: Barak Gafni <gbarak@mellanox.com<mailto:gbarak@mellanox.com>>; tsvwg-chairs <tsvwg-chairs@ietf.org<mailto:tsvwg-chairs@ietf.org>>
Subject: [tsvwg] 回复：Takeaways for HPCC++

Dear WG members,

At IETF 108, we presented HPCC++, a new architecture for congestion control designed for high-speed networks. We have modified the draft according to the feedback from the meeting. The newer version is available at https://tools.ietf.org/html/draft-pan-tsvwg-hpccplus-02<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftools.ietf.org%2Fhtml%2Fdraft-pan-tsvwg-hpccplus-02&data=04%7C01%7Crichard.li%40futurewei.com%7C5f95b2910d2b41a59fe208d87b699b9a%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C637395040817739023%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=Q%2FDwV%2Bm73NlD3m%2BmYoOIA0OGUUlgtUYCUFz2uwAC3Qk%3D&reserved=0>

Thanks, and please feel free to let us know if you have more comments.

Thanks,
Rui

------------------------------------------------------------------
发件人：Miao, Rui(缪睿) <miao.rui@alibaba-inc.com<mailto:miao.rui@alibaba-inc.com>>
发送时间：2020年8月2日(星期日) 11:21
收件人：tsvwg <tsvwg@ietf.org<mailto:tsvwg@ietf.org>>
抄　送：tsvwg-chairs <tsvwg-chairs@ietf.org<mailto:tsvwg-chairs@ietf.org>>; Wesley Eddy <wes@mti-systems.com<mailto:wes@mti-systems.com>>; "Pan, Rong" <rong.pan@intel.com<mailto:rong.pan@intel.com>>; Barak Gafni <gbarak@mellanox.com<mailto:gbarak@mellanox.com>>
主　题：Takeaways for HPCC++


Hello TSVWG members,

It was our pleasure to present HPCC++ in the IETF-108 meeting this week. We appreciate your valuable feedback. Our current draft is located at https://tools.ietf.org/html/draft-pan-tsvwg-hpccplus-01<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftools.ietf.org%2Fhtml%2Fdraft-pan-tsvwg-hpccplus-01&data=04%7C01%7Crichard.li%40futurewei.com%7C5f95b2910d2b41a59fe208d87b699b9a%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C637395040817748978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=mrAKAuPvwwdWCBghjQDoMYqNftztOQtCmGJivkqwgpE%3D&reserved=0>.
Here, we have summarized a list of takeaways to further improve the draft.

1. HPCC++ is a generic congestion control scheme that works for high-speed networks. It requires to have window limit and ACK self-clocking, which is out of the definition of RoCEv2 but should instead work well for TCP/iWRAP. In addition, HPCC++ conforms to the paradigm of existing TCP congestion control, such as switch marking, receiver feedback, and sender enforcement. The major changes are using inband telemetry as the congestion feedback and it is incremental. In the next revision, we describe clearly how it works for TCP/UDP and related protocols.

2. HPCC++ works for both data centers and the Internet. We will add descriptions and results for the Internet-scale size and latency in the next revision.

Thanks again for the comments. Please feel free to let us know if you have more thoughts.

All the best,
Rui, Rong, Barak

[tsvwg] Takeaways for HPCC++ Rui, Miao
[tsvwg] 回复：Takeaways for HPCC++ Rui, Miao
Re: [tsvwg] 回复：Takeaways for HPCC++ Scharf, Michael
Re: [tsvwg] 回复：Takeaways for HPCC++ Pan, Rong
Re: [tsvwg] 回复：Takeaways for HPCC++ Richard Li
Re: [tsvwg] 回复：Takeaways for HPCC++ Scharf, Michael
[tsvwg] 回复：回复：Takeaways for HPCC++ Rui, Miao
Re: [tsvwg] 回复：回复：Takeaways for HPCC++ Scharf, Michael
Re: [tsvwg] 回复：回复：Takeaways for HPCC++ Spencer Dawkins at IETF
Re: [tsvwg] 回复：回复：Takeaways for HPCC++ Scharf, Michael
Re: [tsvwg] 回复：回复：Takeaways for HPCC++ Barak Gafni
[tsvwg] 回复：回复：回复：Takeaways for HPCC++ Rui, Miao

Re: [tsvwg] 回复： 回复：Takeaways for HPCC++

Re: [tsvwg] 回复：回复：Takeaways for HPCC++