[Rdma-cc-interest] 答复: 答复: Side meeting plans at IETF-106

"Zhuangyan (Yan)" <zhuangyan.zhuang@huawei.com> Wed, 13 November 2019 01:03 UTC

Return-Path: <zhuangyan.zhuang@huawei.com>
X-Original-To: rdma-cc-interest@ietfa.amsl.com
Delivered-To: rdma-cc-interest@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 226B0120091 for <rdma-cc-interest@ietfa.amsl.com>; Tue, 12 Nov 2019 17:03:21 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.199
X-Spam-Level:
X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WlxhBHg2cGHI for <rdma-cc-interest@ietfa.amsl.com>; Tue, 12 Nov 2019 17:03:16 -0800 (PST)
Received: from huawei.com (lhrrgout.huawei.com [185.176.76.210]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 30AF6120019 for <rdma-cc-interest@ietf.org>; Tue, 12 Nov 2019 17:03:16 -0800 (PST)
Received: from LHREML710-CAH.china.huawei.com (unknown [172.18.7.107]) by Forcepoint Email with ESMTP id A762D8FF6E3896541B17 for <rdma-cc-interest@ietf.org>; Wed, 13 Nov 2019 01:03:13 +0000 (GMT)
Received: from nkgeml701-chm.china.huawei.com (10.98.57.156) by LHREML710-CAH.china.huawei.com (10.201.108.33) with Microsoft SMTP Server (TLS) id 14.3.408.0; Wed, 13 Nov 2019 01:03:12 +0000
Received: from nkgeml703-chm.china.huawei.com (10.98.57.159) by nkgeml701-chm.china.huawei.com (10.98.57.156) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Wed, 13 Nov 2019 09:03:10 +0800
Received: from nkgeml703-chm.china.huawei.com ([10.98.57.159]) by nkgeml703-chm.china.huawei.com ([10.98.57.159]) with mapi id 15.01.1713.004; Wed, 13 Nov 2019 09:03:10 +0800
From: "Zhuangyan (Yan)" <zhuangyan.zhuang@huawei.com>
To: "Black, David" <David.Black@dell.com>, Lars Eggert <lars@eggert.org>, Paul Congdon <paul.congdon@tallac.com>
CC: "rdma-cc-interest@ietf.org" <rdma-cc-interest@ietf.org>
Thread-Topic: [Rdma-cc-interest] 答复: Side meeting plans at IETF-106
Thread-Index: AQHVmbrvNmjQyYwOZUCfkj0paPnSFaeIRCCC
Date: Wed, 13 Nov 2019 01:03:10 +0000
Message-ID: <eff21634c3a9470f9aa3963e0513fcb3@huawei.com>
References: <CAAMqZPu6g56PotHQJcn6vvoex3=EPomCTgrmMm8jo3ozehG-WQ@mail.gmail.com>, <1605A4E1-7C7C-4BBD-BE35-960730A678D0@eggert.org> <326e89210c104ec6856152d9a76553fb@huawei.com>, <MN2PR19MB4045F5565E6F210A0130FB8583760@MN2PR19MB4045.namprd19.prod.outlook.com>
In-Reply-To: <MN2PR19MB4045F5565E6F210A0130FB8583760@MN2PR19MB4045.namprd19.prod.outlook.com>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.220.66.99]
Content-Type: multipart/alternative; boundary="_000_eff21634c3a9470f9aa3963e0513fcb3huaweicom_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/rdma-cc-interest/_jOQN5JPQJ6uWP0b3Fi2_YD02Gg>
Subject: [Rdma-cc-interest] 答复: 答复: Side meeting plans at IETF-106
X-BeenThere: rdma-cc-interest@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Congestion Control for Large Scale HPC/RDMA Data Centers <rdma-cc-interest.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rdma-cc-interest>, <mailto:rdma-cc-interest-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rdma-cc-interest/>
List-Post: <mailto:rdma-cc-interest@ietf.org>
List-Help: <mailto:rdma-cc-interest-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rdma-cc-interest>, <mailto:rdma-cc-interest-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Nov 2019 01:03:22 -0000

Hi David,


Thank you for the comments. Some responses as below.


Best Regards,


Yan

________________________________
·¢¼þÈË: Black, David <David.Black@dell.com>
·¢ËÍʱ¼ä: 2019Äê11ÔÂ13ÈÕ 8:39
ÊÕ¼þÈË: Zhuangyan (Yan); Lars Eggert; Paul Congdon
³­ËÍ: rdma-cc-interest@ietf.org; Black, David
Ö÷Ìâ: RE: [Rdma-cc-interest] ´ð¸´: Side meeting plans at IETF-106


A few follow-up comments.



>>> 2.1 AI ECN
>>>

>>> Discuss feedback on https://datatracker.ietf.org/doc/draft-zhuang-tsvwg-ai-ecn-for-dcn/.

>> I don't see a proposal here, I don't even see a concrete problem statement? This is yet another "let's throw AI at it" three-pager.



[¡­ snip ¡­]



> The problem is stated in background section, however it might not be so obvious¡­ ¡°As stated in [RFC7567], with proper parameters,

> RED can be an effective algorithm.  However, dynamically predicting the set of parameters (minimum threshold and maximum threshold)

> is difficult. ¡± Dynamic configuration of threshold is a problem for network configuration somehow¡­



That sounds like a straw-person demolition exercise, as RFC 7567 also states:



   This memo also explicitly obsoletes the recommendation that Random

   Early Detection (RED) be used as the default AQM mechanism for the

   Internet.  This is replaced by a detailed set of recommendations for

   selecting an appropriate AQM algorithm.



The upshot is that RED is not a good AQM algorithm to use as a comparison baseline - I hope that the results to be presented will make comparisons to more recent AQM algorithms, e.g., FQ-CoDel [RFC8290].

[Y] Since it is a startpoint to see how adaptive/dyamic configuration can help, so we just use RED as the one to see whether it can be improved. Other algorithms can also be tested but would wait for next meeting cycle...the results to be presented are RED only...



>>> 2.3 Mixing RDMA and TCP traffic
>>>
>>> These two traffic types with their differing congestion controllers are known to not play well with one another in the same traffic class.



Using protocol names to denote congestion control classes does not work well, even though it¡¯s common (and I¡¯ve done it myself).



We are dealing with two clases of congestion controls.  For lack of better terms the following class names are based on what the transport protocol throughput is proportional to where ¡®p¡¯ is the loss and/or congestion marking probability:

               - 1/sqrt(p)-class congestion controls: Includes most existing TCP congestion control algorithms, e.g., NewReno, CUBIC.

               - 1/p-class congestion controls: Includes DCTCP congestion control.

Keep in mind that p is a probability that is usually << 1 when expresed as a decimal, e.g., p=0.01 represents a 1% loss/marking rate.

>> When you say RDMA, you mean RoCE? Separate RoCE into a slice and move on. It's pointless to try and optimize for coexistence with a protocol that can change willy-nilly.

> [Y] yes, it means RoCE. If the network does not differentiate ROCE and TCP traffics, then they would compete anyhow¡­L4S might work on a similar work on classic TCP vs. DCTCP.



DCTCP would be a better protocol to focus on than RoCE, as both DCTCP congestion control and the DCQCN congestion control commonly used for RoCE are 1/p-class congestion controls.

[Y] sure, we can see how DCTCP works which might also be used in DCQCN.



TSVWG will be discussing L4S and SCE next week ¨C both of those proposals are intended to enable coexistence of 1/sqrt(p)-class and 1/p-class congestion controls, among other goals.



Thanks, --David



From: Rdma-cc-interest <rdma-cc-interest-bounces@ietf.org> On Behalf Of Zhuangyan (Yan)
Sent: Tuesday, November 12, 2019 6:28 PM
To: Lars Eggert; Paul Congdon
Cc: rdma-cc-interest@ietf.org
Subject: [Rdma-cc-interest] ´ð¸´: Side meeting plans at IETF-106



[EXTERNAL EMAIL]

Hi Lars,



Thank you for the review and detailed comments. Some responses inline at [Y].



Best Regards,



Yan

________________________________

·¢¼þÈË: Rdma-cc-interest <rdma-cc-interest-bounces@ietf.org<mailto:rdma-cc-interest-bounces@ietf.org>> ´ú±í Lars Eggert <lars@eggert.org<mailto:lars@eggert.org>>
·¢ËÍʱ¼ä: 2019Äê11ÔÂ12ÈÕ 19:35
ÊÕ¼þÈË: Paul Congdon
³­ËÍ: rdma-cc-interest@ietf.org<mailto:rdma-cc-interest@ietf.org>
Ö÷Ìâ: Re: [Rdma-cc-interest] Side meeting plans at IETF-106



Hi,

On 2019-11-11, at 21:11, Paul Congdon <paul.congdon@tallac.com<mailto:paul.congdon@tallac.com>> wrote:
> Tuesday, November 19
> 8:30AM - 9:45AM
> Room: VIP A

I'll try and make the meeting, modulo NomCom duties. (Please all, send your feedback about candidates to NomCom now!)

Because I am not fully sure I can make this, some written feedback on the various agenda items:

> 1. How NICs can be designed for better CC in the HPC/RDMA/AI DCN
>
> Discuss feedback on a draft under development on OpenCC: https://datatracker.ietf.org/doc/draft-zhh-tsvwg-open-architecture/; a framework for flexible establishment of congestion control algorithms implemented by NICs and the network.  The expectation is there will be some experiment results.  The goal is to discuss the ideas with stakeholders (customers, NIC vendors, switch vendors) and explore what could/should be standardized.

There seem to be three things here:

A. A modular NIC offload interface. Unclear if the IETF is the right home for this? Also, see "Restructuring Endpoint Congestion Control" (Narayan et al.) for another direction.
[Y] yes, papers from Sigcomm (or related conferences) have discussed about decoupling congestion controls on NICs a bit, like CCP, AD/DC TCP et al., which we think it might be good for a discussion in the industry for a modular cc design : ), however we might be more focused on the interfaces between layers while the implementation of modules would be open to vendors. It could be discussed further.

B. Mixing and matching different CCs in one network over time. Given that at datacenter latencies, you really want to prevent even small-scale hiccups due to interactions between different CCs, I wonder if it would be sufficient to slice the NIC and have different slices where all traffic is handled by one CC? Seems more tractable.

[Y] Slicing the nic rather than different cc algorithms would also be an option. We also plan on experiments of different configurations of one cc for different traffics which might be in later version. Based on results in this version, 2 cc algorithms performs better than one cc for two types of traffics. It might be due to different feedbacks/reacts in the network.

C. The architectural idea to move away from in-band CC signaling from the network to the endpoints. There isn't much in the document that motivates this, and nothing about potential issues (e.g., loss of fate-sharing).
[Y] The architecture is intended to support both signaling directly from network (rather from receivers--is that what you mean in-band CC signaling?) and signaling from receivers (current practice). Each CC chooses their own ways of signaling and sure it would affect each other somehow if different CCs are used together while their signaling is different. More details will be discussed in later version and inputs/issues that we missed are greatly welcome.

A comment on this document that is really about this entire effort: we should just give up on RoCE. Mellanox has no interest in opening it, and I am therefore unwilling to spend cycles thinking about it.
[Y] :) Actually, the architecture is not binding to RoCE. The thought is to support several transport protocols including TCP. Current experiment results of different ccs are based on TCP (Reno, Cubi, bbr, dctcp), however we don¡¯t want to exclude RoCE either at this point :( ...
Besides, Mellanox provides really good smartNIC design. And very recently, they also announced programmable congestion control on their NICs. At this point, it might be a good timing to discuss with the broader industry on this¡­

> 2. How does the network participate in CC for HPC/RDMA/AI DCN?
>
> There are a few items for discussion.
>
> 2.1 AI ECN
>
> Discuss feedback on https://datatracker.ietf.org/doc/draft-zhuang-tsvwg-ai-ecn-for-dcn/.  The idea is to use AI for adaptive configuration of the network - a hard problem.  How is necessary information collected from the devices to form models and what could/should be standardized here as well?

I don't see a proposal here, I don't even see a concrete problem statement? This is yet another "let's throw AI at it" three-pager.
[Y] AI might not be a technical wording nowadays due to ai everywhere¡­however, we do provide a scene-based ECN reconfiguration to adapt the changes of traffics in data center networks in which scene training and dynamic scene inducing is where AI technologies are applied¡­and we don't want to limit any specific ai technologies, like deep-forest or et al...

We would share some testing results in the side meeting ...to show it is not another ai paper work...hope it helps.
The problem is stated in background section, however it might not be so obvious¡­ ¡°As stated in [RFC7567], with proper parameters, RED can be an effective algorithm.  However, dynamically predicting the set of parameters (minimum threshold and maximum threshold) is difficult. ¡± Dynamic configuration of threshold is a problem for network configuration somehow¡­

> 2.2 Network Fast Feedback
>
> Discuss follow-on feedback on MailScanner has detected a possible fraud attempt from "tools.ietf.org" claiming to be https://tools.ietf..org/html/draft-even-iccrg-dc-fast-congestion-00 which is expected to be introduced in ICCRG on Monday.  The draft discusses the state-of-the-art congestion controllers in use and from research, and poses a number of questions for discussion. What is to be researched and what could/should be standardized going forward?

This is the beginnings of a survey. It misses a ton of related work esp. from academia though. HOMA, pFabric, HULL, D3, PDQ, pHost, NDP, etc., etc.

> 2.3 Mixing RDMA and TCP traffic
>
> These two traffic types with their differing congestion controllers are known to not play well with one another in the same traffic class.  There may be some analysis data to share on this topic.  A goal would be to discuss network approaches for mitigating the impact of the two on each other.

When you say RDMA, you mean RoCE? Separate RoCE into a slice and move on. It's pointless to try and optimize for coexistence with a protocol that can change willy-nilly.

[Y] yes, it means RoCE. If the network does not differentiate ROCE and TCP traffics, then they would compete anyhow¡­L4S might work on a similar work on classic TCP vs. DCTCP.

> 3. Metrics for HPC/RDMA/AI networks
>
> Are the current metrics and scales appropriate for HPC/RDMA/AI networks?  HPC and Storage networks tend to use IOPS as a key measure and the latency requirements can be on the order of 10us; much different than Internet latency and throughput measures.  Should there be a draft on metric requirements for DCN networks?   Can we work with real customers to define some well-known scenarios and metrics for HPC/RDMA/AI DCNs.

Whose "current metrics and scales"? Papers on DC mechanisms certainly define appropriate metrics. Obviously Internet scales don't work, but who is using those?
[Y] To my understanding, current metrics in ietf are mostly discussed about Internet. The question is whether it is worthy dicussing and getting consenus on some common metrics for applications in DCN especially HPC. It is an open discussion to seek feedbacks :)

Lars
--
Rdma-cc-interest mailing list
Rdma-cc-interest@ietf.org<mailto:Rdma-cc-interest@ietf.org>
https://www.ietf.org/mailman/listinfo/rdma-cc-interest