[Rdma-cc-interest] draft minutes of IETF-105 Side meeting on Large Scale Data Center HPC/RDMA Networks

Paul Congdon <paul.congdon@tallac.com> Mon, 12 August 2019 21:55 UTC

Return-Path: <paul.congdon@tallac.com>
X-Original-To: rdma-cc-interest@ietfa.amsl.com
Delivered-To: rdma-cc-interest@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 808CF120077 for <rdma-cc-interest@ietfa.amsl.com>; Mon, 12 Aug 2019 14:55:25 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.118
X-Spam-Level:
X-Spam-Status: No, score=-1.118 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NEUTRAL=0.779, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=tallac-com.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Fzqrg15naNCr for <rdma-cc-interest@ietfa.amsl.com>; Mon, 12 Aug 2019 14:55:22 -0700 (PDT)
Received: from mail-ot1-x333.google.com (mail-ot1-x333.google.com [IPv6:2607:f8b0:4864:20::333]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1E9DE1216BB for <rdma-cc-interest@ietf.org>; Mon, 12 Aug 2019 07:13:51 -0700 (PDT)
Received: by mail-ot1-x333.google.com with SMTP id m24so6243656otp.12 for <rdma-cc-interest@ietf.org>; Mon, 12 Aug 2019 07:13:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tallac-com.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=xDDS1drYmSB5oSf3P0CwHqRIWIGjjyXUX4cVHFyUnMc=; b=E284MbJyDom0sZjY7nAfvWowyxZbuB4twZF7NowFItGmqSurGx39Q+qzbz0FHLGZX8 UgAkxcc9nDK6HzI6xJc18tHZ9HShHlas+S/P4V4QYdI3TWb57hkdVBpvo5QAhUe2pfn4 Wr5HNp0jo/PJm20YVzbTJUCFdcIjly+zMA+jNaJHP4IVTGVQsemi5FEMYSpgxwaTab3J txnExgvMmPsVcjwjU7s5V0GWj0FGJXttd1B8Mr2SuKWJM8kVqQFZQca8rE3cPsup+2fW pkZuQn9hzJBEh8mOKrdghW8s3THrwdchHaatLZcSto7bBaAZvdT0sgdOFEwMqMddPnAN Pysw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=xDDS1drYmSB5oSf3P0CwHqRIWIGjjyXUX4cVHFyUnMc=; b=m5A+F0VOruKrOj/kldEv7PJSLVD/Ad+VgoMm+E80u8hd/4Nry8mTkbeF+vM9LyWTC+ 1lwib84u7yjX+HlwweVUkpF6mxYxdRwDUa7p5AQcrrMy0hcpDyjm6UZgSH7MPuxDbNJf Hq/MidCpcT4BLociKqx3KGr1zVcAoeg7hoph4y+b/KUjD1LprNmfHJC7sYh5jdR+FF8g DqXjZMfr21gQxyPhWVNlp5bbclOdjLVpXPTsJS7/oeBv+TtQXTzjvnokKgQU0bKyt3JU M1ZB40IDBc9ho2tgt4WhuHXjv6W2R997XJrld+iPow6nfHSpzGZiXD3uGDZN+MIFhVPi bEnA==
X-Gm-Message-State: APjAAAV/ZRu2GkAGuodqlCRnX0gnWfUUGhZjpnSMfk9F4gTYSYDTOgFw w/qFZ2rYoG6kyj4zKr4PWmx1CMB9QjBW9WGuAKaIWeWSY6N0oUlQ
X-Google-Smtp-Source: APXvYqxpj7qERxlEFpE5jL0oQg+tp+aG5sFTcZKcVyH7VqiM26VCIJtxlP95vsMCwRjXCZUdY1/JarsuaeNffqPo5M8=
X-Received: by 2002:a05:6830:1159:: with SMTP id x25mr2536336otq.237.1565619229288; Mon, 12 Aug 2019 07:13:49 -0700 (PDT)
MIME-Version: 1.0
From: Paul Congdon <paul.congdon@tallac.com>
Date: Mon, 12 Aug 2019 07:13:38 -0700
Message-ID: <CAAMqZPu-JSoK10dU+LfXHqh_CBzaKroA9k_PpQUauaTjLCWj8g@mail.gmail.com>
To: rdma-cc-interest@ietf.org
Content-Type: multipart/alternative; boundary="00000000000003f582058fec22fc"
Archived-At: <https://mailarchive.ietf.org/arch/msg/rdma-cc-interest/dlUj-mwkZxsrEMzp2XWyOHrJDOM>
Subject: [Rdma-cc-interest] draft minutes of IETF-105 Side meeting on Large Scale Data Center HPC/RDMA Networks
X-BeenThere: rdma-cc-interest@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Congestion Control for Large Scale HPC/RDMA Data Centers <rdma-cc-interest.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rdma-cc-interest>, <mailto:rdma-cc-interest-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rdma-cc-interest/>
List-Post: <mailto:rdma-cc-interest@ietf.org>
List-Help: <mailto:rdma-cc-interest-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rdma-cc-interest>, <mailto:rdma-cc-interest-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Aug 2019 21:55:26 -0000

I've uploaded the initial minutes of the side meeting to the following
public location:
https://mentor.ieee.org/802.1/dcn/19/1-19-0062-01-ICne-ietf0105-sidemeeting-minutes.pdf

I've also put a copy here in text form for quick review.  Please send me
any updates or corrections you see.   Sorry for the duplication for some,
as I have also copied the non-wg email list.  Please sign-up for the list
if you have not already done so.
https://www.ietf.org/mailman/listinfo/rdma-cc-interest



IETF-105 Side Meeting:

Large Scale Data Center HPC/RDMA

Monday, July 22, 2019

8:30AM – 9:45AM



Attendees (who signed the blue sheet or where recognized):

*First*

*Last*

*Affiliation*

Hirochika

Asai

Preferred Networks / WIDE Project

David

Black

Dell

Randy

Bush

Arrcus

Xavier

de Foy

InterDigital

Jesús

Escudero

Universidad de Castilla-La Mancha

Roni

Even

Huawei

Randy

Haagens

Microsoft

Jianfei

He

Huawei

Russ

Housley

Vigil Security, LLC

Rachel

Huang

Huawei

Georgios

Karagiannis

Huawei Technologies Dusseldorf GmbH

Younghan

Kim

SSU

Kee-Cheon

Kim

Wangbong

Lee

ETRI

Aini

Li

Huawei

Peng

Liu

China Mobile

Sonum

Mathur

Viasat

David

Melman

Marvell

Jan

Metzke

BSI

Tal

Mizrahi

Huawei

Yoshifumi

Nishda

GE Global Research / Keio Research Institute

Keyur

Patel

Arrcus

Fengwei

Qin

China Mobile

Richard

Scheffenegger

NetApp

Marcus

Sun

Huawei

KJ

Sun

Sowmini

Varadhan

Microsoft

Stephan

Wenger

Tencent America

Hua Ru

Yang

Huawei

Hyunsik

Yang

IISTRC

Xiang

Yu

Huawei

Shuai

Zhao

Tencent

Yan

Zhuang

Huawei

Ning

Zong

Huawei



Details:

1.    The meeting organizer, Paul Congdon, presented the IETF Note Well
reminder and bashed the agenda.  No changes to the agenda, but it was
suggested to include additional technical approaches being pursued in the
IETF such as LSVR.

2.    The slide material presented at the meeting is available at:
https://mentor.ieee.org/802.1/dcn/19/1-19-0061-01-ICne-ietf-sidemeeting.pdf

3.    The slides from IETF-105 HotRFC that announced this side meeting are
available at:
https://datatracker.ietf.org/meeting/105/materials/slides-105-hotrfc-7-strategies-to-drastically-improve-congestion-control-in-high-performance-data-centers-next-steps-for-rdma-00

4.    Jesús Escudero presented strategies to drastically improve congestion
control in high performance DC and the next steps for RDMA.  The slides are
included in the above link.  It was suggested to have an interactive
discussion about the proposed solutions in the final slide – consider their
feasibility and applicability to scaling RDMA and HPC networks.

5.    David Black provided some background on RDMA protocols running over
IP.  He clarified that in practice, RDMA networks often use PFC, but in
principle, it is not required for all transports.  David provided the
history of iWARP and RoCEv2 and the industry momentum behind RoCEv2.  He
indicated that iWARP is, perhaps, superior, especially when it comes to
congestion control, but the industry has adopted RoCE. RoCEv2 is not an
IETF protocol.

6.    Randy Bush asked if we are aware of NDP, the best paper in Sigcomm
2017.  Randy pointed out that IETF has a long history of doing such work as
well as the research community and it would be nice to see work in this
area.  It was asked if there was any sensitivity to addressing such
transports in the IETF

7.    David Black asked if there were any NIC vendors in the audience.
There was one in the crowd and one on the phone.  The data rates in the
data center require hardware offload support.

8.    In the slides, it was suggested that perhaps a new UDP based
transport for data centers would be interesting to consider.  One of the
key requirements would be that it would need to be hardware offload-able.
The slides allude to a current trend to run more applications over UDP for
low latency and high efficiency.

9.    Keyur Patel points out that congestion occurs in switches and that
there are already many switch vendors implementing proprietary extensions
to address this.  He would like to see a problem definition because people
are currently building solutions but wonders what could be done beyond what
the switch vendors are currently doing.  Paul Congdon points out that we
are a standards community and defining interoperable solutions is whole the
point.  Proprietary vendor solutions are not standards based nor
necessarily interoperable between vendors.

10. The topic of being able to identify the type of congestion (e.g. incast
verses in-network) could be valuable.  The congestion mitigation approach
might vary based upon where we are in the topology and what type of
congestion the switch is experiencing. Knowing your position in the
topology can be configured, but more configuration is not desirable and
could be error prone.  An automated protocol to identify the location of a
switch in the overall topology could be valuable.

11. One of the suggested improvements is to provide more congestion
information within the headers.  Jeff pointed out the use of overlay
networks and their rich and extensible header can help.  This is being used
more and more now.

12. Roni Even points out that RoCE is driven by Mellanox.  They were
invited to attend this side meeting but did not attend.  Roni felt the lack
of attendance and participation is political and that Mellanox prefers to
contribute to IBTA, which is a closed working group where they have control.

13. Yan Zhuang presented the material for two drafts:
https://tools.ietf.org/html/draft-zhh-tsvwg-open-architecture-00 and
https://tools.ietf.org/html/draft-yueven-tsvwg-dccm-requirements-00.html.
The slides are included in the complete side meeting slide deck referenced
above.

14. David Black asked if any switch in the network can directly communicate
with a NIC?   The answer is yes, in the proposed architecture, the switch
will send messages back to the source.  The increasing use of overlays
creates a challenge here.  RDMA doesn’t currently do much with overlays,
however, that may change.  The headers the switch sees may not be the
headers of the source/desk of the RDMA traffic.   The messages sent back
from the switch to the NIC need to have enough data for the source NIC to
be able to unwrap and decode the overlay.

15. There was a question about the drafts and what is next for them.  David
Black, as TSVWG chair, expressed that the drafts are being presented in the
side meeting to better understand their content and interest.  Roni Even is
asking where the appropriate place is to progress these drafts; iccrg or
tsvwg.

16. Paul Congdon asked how the transport agnostic congestion signaling
works in conjunction with a TCP based transport that also has congestion
signals?  Since the switch will be signaling directly to the NIC, it will
be the responsibility of the source to combine and mix the signals
appropriately.

17. There are some simulation results that show the effectiveness of the
proposals in the drafts.  David Black suggests the results should be shared
with ICCRG because it has congestion control expertise and TSV looks to
ICCRG for this guidance.

18. Paul Congdon closed the meeting with an observation that we have more
people attending this side meeting, in part due to the favorable scheduling
at the IETF.  There is a request to create an IETF mailing list for this
work.  POST MEETING NOTE: the email list is created and is called
rdma-cc-interest@ietf.org.  The listserv sign-up and info is available at:
https://www.ietf.org/mailman/listinfo/rdma-cc-interest

19. Jeff pointed out that NVME over RDMA solutions are coming and the NIC
will no longer be the bottleneck, putting more pressure and congestion on
the network.  David Black, as author of NVME over Fabrics using TCP, points
out that NVME over TCP (without RDMA) is new and generating a lot of
interest.

20. Paul Congdon asked if there are other applications, besides RDMA, in
the data center that might benefit from low-latency, high-throughput.
Perhaps other control traffic (e.g. simple server-to-server REST APIs).
Roni Even points out that we should look for other applications, but it
will all depend on the tradeoffs and requirements.

21. The side meeting was adjourned approximately at 9:40AM