[Rdma-cc-interest] 答复: Side meeting plans at IETF-106

"Zhuangyan (Yan)" <zhuangyan.zhuang@huawei.com> Tue, 12 November 2019 23:28 UTC

Return-Path: <zhuangyan.zhuang@huawei.com>
X-Original-To: rdma-cc-interest@ietfa.amsl.com
Delivered-To: rdma-cc-interest@ietfa.amsl.com
Received: from localhost (localhost []) by ietfa.amsl.com (Postfix) with ESMTP id 7093B1200B7 for <rdma-cc-interest@ietfa.amsl.com>; Tue, 12 Nov 2019 15:28:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.199
X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id bMWhpg4bt6qL for <rdma-cc-interest@ietfa.amsl.com>; Tue, 12 Nov 2019 15:28:08 -0800 (PST)
Received: from huawei.com (lhrrgout.huawei.com []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id DF126120045 for <rdma-cc-interest@ietf.org>; Tue, 12 Nov 2019 15:28:07 -0800 (PST)
Received: from lhreml706-cah.china.huawei.com (unknown []) by Forcepoint Email with ESMTP id F1400A769B0E710C5D0A for <rdma-cc-interest@ietf.org>; Tue, 12 Nov 2019 23:28:04 +0000 (GMT)
Received: from nkgeml704-chm.china.huawei.com ( by lhreml706-cah.china.huawei.com ( with Microsoft SMTP Server (TLS) id 14.3.408.0; Tue, 12 Nov 2019 23:28:04 +0000
Received: from nkgeml703-chm.china.huawei.com ( by nkgeml704-chm.china.huawei.com ( with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Wed, 13 Nov 2019 07:28:01 +0800
Received: from nkgeml703-chm.china.huawei.com ([]) by nkgeml703-chm.china.huawei.com ([]) with mapi id 15.01.1713.004; Wed, 13 Nov 2019 07:28:01 +0800
From: "Zhuangyan (Yan)" <zhuangyan.zhuang@huawei.com>
To: Lars Eggert <lars@eggert.org>, Paul Congdon <paul.congdon@tallac.com>
CC: "rdma-cc-interest@ietf.org" <rdma-cc-interest@ietf.org>
Thread-Topic: [Rdma-cc-interest] Side meeting plans at IETF-106
Thread-Index: AQHVmMPox+RqCMcdT0qtVbw4XBqpgKeG4uIAgAE3GZU=
Date: Tue, 12 Nov 2019 23:28:01 +0000
Message-ID: <326e89210c104ec6856152d9a76553fb@huawei.com>
References: <CAAMqZPu6g56PotHQJcn6vvoex3=EPomCTgrmMm8jo3ozehG-WQ@mail.gmail.com>, <1605A4E1-7C7C-4BBD-BE35-960730A678D0@eggert.org>
In-Reply-To: <1605A4E1-7C7C-4BBD-BE35-960730A678D0@eggert.org>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
x-originating-ip: []
Content-Type: multipart/alternative; boundary="_000_326e89210c104ec6856152d9a76553fbhuaweicom_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/rdma-cc-interest/EleqG3WZ-pkdpNP-FETIOBainBE>
Subject: [Rdma-cc-interest] =?gb2312?b?tPC4tDogIFNpZGUgbWVldGluZyBwbGFu?= =?gb2312?b?cyBhdCBJRVRGLTEwNg==?=
X-BeenThere: rdma-cc-interest@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Congestion Control for Large Scale HPC/RDMA Data Centers <rdma-cc-interest.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rdma-cc-interest>, <mailto:rdma-cc-interest-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rdma-cc-interest/>
List-Post: <mailto:rdma-cc-interest@ietf.org>
List-Help: <mailto:rdma-cc-interest-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rdma-cc-interest>, <mailto:rdma-cc-interest-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Nov 2019 23:28:10 -0000

Hi Lars,

Thank you for the review and detailed comments. Some responses inline at [Y].

Best Regards,


发件人: Rdma-cc-interest <rdma-cc-interest-bounces@ietf.org> 代表 Lars Eggert <lars@eggert.org>
发送时间: 2019年11月12日 19:35
收件人: Paul Congdon
抄送: rdma-cc-interest@ietf.org
主题: Re: [Rdma-cc-interest] Side meeting plans at IETF-106


On 2019-11-11, at 21:11, Paul Congdon <paul.congdon@tallac.com> wrote:
> Tuesday, November 19
> 8:30AM - 9:45AM
> Room: VIP A

I'll try and make the meeting, modulo NomCom duties. (Please all, send your feedback about candidates to NomCom now!)

Because I am not fully sure I can make this, some written feedback on the various agenda items:

> 1. How NICs can be designed for better CC in the HPC/RDMA/AI DCN
> Discuss feedback on a draft under development on OpenCC: https://datatracker.ietf.org/doc/draft-zhh-tsvwg-open-architecture/; a framework for flexible establishment of congestion control algorithms implemented by NICs and the network.  The expectation is there will be some experiment results.  The goal is to discuss the ideas with stakeholders (customers, NIC vendors, switch vendors) and explore what could/should be standardized.

There seem to be three things here:

A. A modular NIC offload interface. Unclear if the IETF is the right home for this? Also, see "Restructuring Endpoint Congestion Control" (Narayan et al.) for another direction.
[Y] yes, papers from Sigcomm (or related conferences) have discussed about decoupling congestion controls on NICs a bit, like CCP, AD/DC TCP et al., which we think it might be good for a discussion in the industry for a modular cc design : ), however we might be more focused on the interfaces between layers while the implementation of modules would be open to vendors. It could be discussed further.

B. Mixing and matching different CCs in one network over time. Given that at datacenter latencies, you really want to prevent even small-scale hiccups due to interactions between different CCs, I wonder if it would be sufficient to slice the NIC and have different slices where all traffic is handled by one CC? Seems more tractable.
[Y] Slicing the nic rather than different cc algorithms would also be an option. We also plan on experiments of different configurations of one cc for different traffics which might be in later version. Based on results in this version, 2 cc algorithms performs better than one cc for two types of traffics. It might be due to different feedbacks/reacts in the network.

C. The architectural idea to move away from in-band CC signaling from the network to the endpoints. There isn't much in the document that motivates this, and nothing about potential issues (e.g., loss of fate-sharing).
[Y] The architecture is intended to support both signaling directly from network (rather from receivers--is that what you mean in-band CC signaling?) and signaling from receivers (current practice). Each CC chooses their own ways of signaling and sure it would affect each other somehow if different CCs are used together while their signaling is different. More details will be discussed in later version and inputs/issues that we missed are greatly welcome.

A comment on this document that is really about this entire effort: we should just give up on RoCE. Mellanox has no interest in opening it, and I am therefore unwilling to spend cycles thinking about it.
[Y] :) Actually, the architecture is not binding to RoCE. The thought is to support several transport protocols including TCP. Current experiment results of different ccs are based on TCP (Reno, Cubi, bbr, dctcp), however we don’t want to exclude RoCE either at this point :( ...
Besides, Mellanox provides really good smartNIC design. And very recently, they also announced programmable congestion control on their NICs. At this point, it might be a good timing to discuss with the broader industry on this…

> 2. How does the network participate in CC for HPC/RDMA/AI DCN?
> There are a few items for discussion.
> 2.1 AI ECN
> Discuss feedback on https://datatracker.ietf.org/doc/draft-zhuang-tsvwg-ai-ecn-for-dcn/.  The idea is to use AI for adaptive configuration of the network - a hard problem.  How is necessary information collected from the devices to form models and what could/should be standardized here as well?

I don't see a proposal here, I don't even see a concrete problem statement? This is yet another "let's throw AI at it" three-pager.
[Y] AI might not be a technical wording nowadays due to ai everywhere…however, we do provide a scene-based ECN reconfiguration to adapt the changes of traffics in data center networks in which scene training and dynamic scene inducing is where AI technologies are applied…and we don't want to limit any specific ai technologies, like deep-forest or et al...
We would share some testing results in the side meeting ...to show it is not another ai paper work...hope it helps.
The problem is stated in background section, however it might not be so obvious… “As stated in [RFC7567], with proper parameters, RED can be an effective algorithm.  However, dynamically predicting the set of parameters (minimum threshold and maximum threshold) is difficult. ” Dynamic configuration of threshold is a problem for network configuration somehow…

> 2.2 Network Fast Feedback
> Discuss follow-on feedback on MailScanner has detected a possible fraud attempt from "tools.ietf.org" claiming to be https://tools.ietf..org/html/draft-even-iccrg-dc-fast-congestion-00 which is expected to be introduced in ICCRG on Monday.  The draft discusses the state-of-the-art congestion controllers in use and from research, and poses a number of questions for discussion. What is to be researched and what could/should be standardized going forward?

This is the beginnings of a survey. It misses a ton of related work esp. from academia though. HOMA, pFabric, HULL, D3, PDQ, pHost, NDP, etc., etc.

> 2.3 Mixing RDMA and TCP traffic
> These two traffic types with their differing congestion controllers are known to not play well with one another in the same traffic class.  There may be some analysis data to share on this topic.  A goal would be to discuss network approaches for mitigating the impact of the two on each other.

When you say RDMA, you mean RoCE? Separate RoCE into a slice and move on. It's pointless to try and optimize for coexistence with a protocol that can change willy-nilly.
[Y] yes, it means RoCE. If the network does not differentiate ROCE and TCP traffics, then they would compete anyhow…L4S might work on a similar work on classic TCP vs. DCTCP.

> 3. Metrics for HPC/RDMA/AI networks
> Are the current metrics and scales appropriate for HPC/RDMA/AI networks?  HPC and Storage networks tend to use IOPS as a key measure and the latency requirements can be on the order of 10us; much different than Internet latency and throughput measures.  Should there be a draft on metric requirements for DCN networks?   Can we work with real customers to define some well-known scenarios and metrics for HPC/RDMA/AI DCNs.

Whose "current metrics and scales"? Papers on DC mechanisms certainly define appropriate metrics. Obviously Internet scales don't work, but who is using those?
[Y] To my understanding, current metrics in ietf are mostly discussed about Internet. The question is whether it is worthy dicussing and getting consenus on some common metrics for applications in DCN especially HPC. It is an open discussion to seek feedbacks :)

Rdma-cc-interest mailing list