Re: [nfsv4] NFS/RDMA next steps

Chuck Lever <chuck.lever@oracle.com> Wed, 02 August 2017 16:22 UTC

Return-Path: <chuck.lever@oracle.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7CCDA1320D8 for <nfsv4@ietfa.amsl.com>; Wed, 2 Aug 2017 09:22:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.001
X-Spam-Level:
X-Spam-Status: No, score=-7.001 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-2.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id u-Zn4ix1fwQ5 for <nfsv4@ietfa.amsl.com>; Wed, 2 Aug 2017 09:22:00 -0700 (PDT)
Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 24B86131BBE for <nfsv4@ietf.org>; Wed, 2 Aug 2017 09:22:00 -0700 (PDT)
Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id v72GLwEN004496 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 2 Aug 2017 16:21:58 GMT
Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id v72GLveR007501 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 2 Aug 2017 16:21:58 GMT
Received: from abhmp0013.oracle.com (abhmp0013.oracle.com [141.146.116.19]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id v72GLvX2016917; Wed, 2 Aug 2017 16:21:57 GMT
Received: from anon-dhcp-171.1015granger.net (/68.46.169.226) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 02 Aug 2017 09:21:57 -0700
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jeyxWEDkdWcRvaK-Vet0dCCXgJ0HcMywP3aXawV9KVPbg@mail.gmail.com>
Date: Wed, 02 Aug 2017 12:21:56 -0400
Cc: NFSv4 <nfsv4@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <0BACD3F5-7C29-4D7D-B88C-9D1AD74443AE@oracle.com>
References: <53DF3636-D420-4FAA-B1B0-8824602CBB72@oracle.com> <CADaq8jeyxWEDkdWcRvaK-Vet0dCCXgJ0HcMywP3aXawV9KVPbg@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
X-Mailer: Apple Mail (2.3124)
X-Source-IP: userv0021.oracle.com [156.151.31.71]
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/wzLEfSNsQTWDL1JMGOm6C_eftd0>
Subject: Re: [nfsv4] NFS/RDMA next steps
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Aug 2017 16:22:01 -0000

> On Jul 31, 2017, at 4:04 PM, David Noveck <davenoveck@gmail.com> wrote:
> 
> > The slides split the possibilities into three somewhat orthogonal groupings.
> 
> You had to divide these up to produce slides, but everyone will divide things
> differently.  In any case, I don't see these divisions as all that helpful.

The purpose of this thread is to form a coherent technical
vision for the NFS/RDMA umbrella effort. I view this vision
as a necessary precursor to sorting the scattershot of I-Ds
(both extant and pending) the WG is considering.

The question of charter milestones overlaps this discussion
in some cases, and there will be milestones that are not
under this umbrella. Thus the milestone conversation is
related but orthogonal for the moment.


> I think we
> need to decide what documents are ready to become working group 
> documents based on their maturity and promise and believe that is the 
> appropriate way to frame the discussion.

I regard that as premature for two reasons:

1. Without a technical vision, there's no agreement between
client and server designers about what are the priority
features. This work is supposed to be about interoperability,
and we can't get that without there being some agreement on
what features to realize in prototype and in specification.

2. Resource limitations are a reality, and mean there must be
some prioritization of which features we try for first, and
which are put off.

I would prefer a technical conversation first before diving
into the mechanics of I-D promotion. I believe that will give
us a better chance of producing specifications that will be
implemented and deployed without wasting time on items that
are of little common interest.


> > Opinions
> > are welcome as to what order, whether something is left out, or what might be
> > removed from this list. Groupings Two and Three could introduce new Working
> > Group documents, and thus have implications for our charter milestone count.
> 
> Note that Grouping One does as well.  See below.
> 
> 
> > Grouping Zero:
> > Focus on improving existing implementations of RPC-over-RDMA and NFS/RDMA. No
> > IETF action needed, which is why I didn't include this on the slides. 
> 
> For the same reason it might not be all that appropriate to this discussion.  The balance
> between iimplementation and specfication is a decision for individual companies to 
> make.  In my case, the work is concerned with the development of a new implementation
> but the same principle applies.  My employer will have opinions about this question and
> I'm going to pay attenntion to my company's needs.  It is not up to me how Oracle people 
> spend their time or up to Oracle about how mine is spent.

Not talking about "Oracle people" here. I'm bound by my
employment contract not to share that kind of information
publicly.

I was also strongly reprimanded the last time I "represented
my company's needs" in this forum. Thus I speak from a
personal view:

The limited resource pool is a community reality, and must
be considered as the WG chooses which work to move forward.

And:

It is a given that protocol work has a longer lead time than
implementation improvements. IMO at this stage, there is
significant customer benefit with implementation work that can
be brought to market in the near term.


> > There
> > are substantial improvements that can be made to existing base implementations,
> > but these would be done by many of the same folks who would be working on new
> > protocol.
> 
> True.  If people are so busy doing implementation that they have no time to work on 
> specs we will have to adjust.

> > Grouping One:
> > Enable greater transport parallelism in NFS. 
> 
> This strikes me as a highly unnatural grouping to use as a basis for decision-making.
> It might make sense in a slide presentation.

The primary reason SMB Direct goes faster than NFS/RDMA is
SMB's multi-channel capability: it's ability to automatically
utilize available network adapters and fabrics. In other
words, to parallelize its workload across hardware resources.

The purpose of Grouping One is to identify the ways we can
bring the same facility to NFS. The mechanisms we have
available to us are trunking and pNFS.

Those mechanisms can benefit other transports as well. That
does not mean they are not critical to the success of
NFS/RDMA.


> > This includes multipathing 
> 
> I'm working on this.  I mentioned prodcucing draft-dnoveck-nfsv4-mv1-msns-update at the meeting.
> 
> My understanding is that Chuck will submit draft-cel-mv0-trunking-update which addresses
> trunking/multipathing in the v4.0 context.
> 
> > and
> > use of pNFS with RDMA. No changes to RPC-over-RDMA or NFS/RDMA are necessary,
> > and this would bring important performance capabilities to NFS, especially
> > by enabling very low latency client access to Storage Class Memory.
> 
> I agree with doing this but what we heard at ietf99 was a very early draft.  This is a long-term effort which should be pursued but this document is quite a ways from being ready to become
> a working group document.

You may have misunderstood what I meant by my blanket
mention of pNFS over RDMA. This loose term includes both
the NVMe layout, which can use RDMA, and the newer RDMA-
only push-mode layout that Christoph introduced at IETF 99.

With regard to the former:

draft-hellwig-nfsv4-scsi-layout-nvme

is an integral part of bringing increased parallelism
to NFS/RDMA, and I hope that Christoph allows us to
promote this work.

The latter:

draft-hellwig-nfsv4-rdma-layout

is indeed in early stages. But overall it is less complex
than other work (like RPC-over-RDMA version 2) and is a
copy (plus-or-minus details about memory registration and
RDMA Flush) of existing block layouts.

Therefore I don't regard this as long-term at all, but
rather something that we can and should move along as
quickly as is possible. Support for push-mode in SMB
Direct, and the march toward market of Storage Class
Memory, is happening now.

This layout type can be utilized by a server with
traditional DRAM as well. There's no reason to put this
off.


> > Grouping Two:
> > Incrementally improve RPC-over-RDMA version 1. The main idea here is to
> > introduce a per-connection transport property negotiation mechanism to replace
> > CCP. This would enable variable size (ie larger) inline thresholds and the use
> > of Remote Invalidation in some instances with existing deployments.
> 
> Implementation of this is important and will be proceeding.

It's already implemented in Linux. Are you announcing a
product feature in a NetApp product?

Whether it's important is a question mark. I've found only
one area of significant performance improvement in my testing,
and that's with RPC Calls of size 1KB to 16KB.

Examples include metadata-type NFS operations with large
arguments, and small-to-moderate NFS WRITEs. This needs further
analysis and discussion, independent of controversy surrounding
standardization.

For instance, similar improvement might be possible to see with
better implementation of Read chunks, and no protocol changes
would be necessary. Also, improving heuristics of delegation
can reduce the need for large metadata operations on the wire.
Tuning clients to emit smaller NFS COMPOUNDs would be beneficial.

There have been only a few percentage points of boost in small
IOPS, and of course no impact on I/O with larger payloads, in
my testing.


> I intend to implement it regardless of working group decisions about priorities or the pospect of controversy. 
> Perhaps this belongs in group Zero.  The work to get this
> already completed specfication work through the IETF process is relatively small although there may be controversies that need to be resolved.  See below.
> 
> 
> > Grouping Three:
> > Pursue RPC-over-RDMA version 2. This would open a variety of avenues by which
> > many of the perceived shortcomings of RPC-over-RDMA version 1 could be
> > addressed.
> 
> 
> > IMO Zero and One are where we can get the greatest bang for the buck in the
> > near term.
> 
> Two issues with this:
> 	• I believe Zero is out of scope for this discussion for reasons already given.

The realities of resource limits must be considered.


> 	• The pNFS-RDMA layout type is not a near-term item.  I agree we should pursue it, but it is not going to happen soon, especially if much of the working group is busy doing implementation work.  In any case, when Christoph thinks this is ready to be a working group item, we should definitely consider this seriously.

Perhaps you misunderstand what Grouping One is about. That
grouping encompasses work that is active and ongoing, and
the RDMA push-mode layout work happens to be in that bundle.

I'm advocating that we prioritize the efforts around trunking
(the drafts you and I are authoring) and around Christoph's
NVMe SCSI layout, and help in any way we can with the RDMA
push-mode layout.


> > The current proposal for Grouping Two (draft-cel-nfsv4-rpcrdma-cm-pvt-msg) is
> > controversial. 
> 
> if Tom, or anyone else, has objections to proceeding with this, we need to discuss this on the list and work to arrive at a reasonable resolution.  My understaning is that Tom objects to this being standards-track.  While I believe it should be standards-track, I can live with an Informatinal RFC.   I just don't want ths to be an undocumented de facto standard.

I don't object to it being Informational, nor do I object to
the work moving ahead as a lower priority to Grouping One.


> > Grouping Three would be an immense amount of work to generalize
> > some things for less gain than we might see with work in Grouping Zero or One.
> 
> I don't think "immense" is the right word but let's not argue about that.  In any case, we need
> relef from the major performance issue that were somehow baked into Version One.

That claim needs substantiation, which is why I implemented
the experiment described in

draft-cel-nfsv4-rpcrdma-cm-pvt-msg

I don't believe this work is ready to pursue as a WG document
until we have quantified the "major performance issue" and have
confirmed that large inline thresholds and Remote Invalidation
will relieve the problem, whatever it may be.


> We can live with Version One if we have the implementation corresponding to cm-pvt-msg.

Does anyone second that opinion?

My feeling is that the parallelism effort (Grouping One) will
bring much greater performance improvements, will benefit
RPC-over-RDMA version 1 consumers, and therefore should be
prioritized.

The conclusion I draw from your comments is that you believe
NetApp can't take advantage of a pNFS block layout (either
NVMe or push-mode). However, NetApp was the sponsor of the
Linux pNFS block mode (with XFS) work that Christoph recently
completed.

What about trunking?


> I'm not sure whether this is Grouping Two (as above) or Zero (because we are are talking about existing implementations).

As rpcrdma-cm-pvt-msg involves protocol specification and
considerations of interoperability, and nothing in Grouping
Zero has those issues, a separate Grouping is an appropriate
place for it.


> In any case we have running code and should take advantage of Chuck's good work in specifying this regardless of any potential controversy.


--
Chuck Lever