Re: [nfsv4] NFS/RDMA next steps

David Noveck <davenoveck@gmail.com> Wed, 02 August 2017 19:50 UTC

Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E2EF412EE45 for <nfsv4@ietfa.amsl.com>; Wed, 2 Aug 2017 12:50:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0ZRZ67-ehLEd for <nfsv4@ietfa.amsl.com>; Wed, 2 Aug 2017 12:50:27 -0700 (PDT)
Received: from mail-io0-x231.google.com (mail-io0-x231.google.com [IPv6:2607:f8b0:4001:c06::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 6E77F127869 for <nfsv4@ietf.org>; Wed, 2 Aug 2017 12:50:27 -0700 (PDT)
Received: by mail-io0-x231.google.com with SMTP id j32so24544512iod.0 for <nfsv4@ietf.org>; Wed, 02 Aug 2017 12:50:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=uyISvbRrfselgILKp9/TPHafFX7q1mtPvQVnFjgGCBE=; b=ebQMgAGyX1UfK5q/gjEub4svmoHh9vX2lPbqMentHZq6w56n5TsdlhmnAnD+LirQIw nBdLUm5eQekY9vLtjB+SLB03cjd4Sp6smHDxZPb/M3MmDwoQl1E98ZkzRuc/wycFFZbE v4AMhu+6xia8IvyTE+cEvSlDhwm2H6y8gM5Wrx6Qjot2DW122n1zUMfbMJSWo0/XV0j+ En180OxTOb2/CUtMN46Nj2DhTDwu0+zbikrPSqrVuYuOTQPXSHDfOS0NJxuHVPwXpTBv /9B8LJXYvBpWo1qcOxYwMbHdKK3GRBcwAgvxS6mUBHgZsZ4nxTb8KRlY0thvFctLnuik QfMQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=uyISvbRrfselgILKp9/TPHafFX7q1mtPvQVnFjgGCBE=; b=MTLqphcfj63HfqH7B5TwV61i0CoompochdheijLefyHf4ipBvIz7tPrd/KvGOJERFZ VEXMtmydfetigQqQU3gn4Sif6pWvaoEWsoc0DZP43RJ7M9KcAJtPTHtE1cacjtHlInrN GpaIniFPytF5CzTenbq9Z9VK8wFaeUmv3nf6QWZu3FcUvzwcDD7ad2qORx8babJe11/K eoVVp7/7jwNniQ2mWPHFZUFfPIu+p61FfRgC2XIJaqLvCDKEHqJQF11pIgt/fvljxUho agt3RWHbRuc8WKUeHzboejW1lsDvWTOuXBON1aadDwrzvUjHKnE1HZG6O/LXxXI+hQXg xljw==
X-Gm-Message-State: AIVw113qRdI5ZRgRFl7qGvrK9Tbhcw98KMPgTW0ZloIzJq9yBk2V8tRj mtkIm5GfUZHrwmnIGIPN84xXG2gFWKl+
X-Received: by 10.107.164.130 with SMTP id d2mr30048669ioj.14.1501703426465; Wed, 02 Aug 2017 12:50:26 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.107.142.72 with HTTP; Wed, 2 Aug 2017 12:50:25 -0700 (PDT)
In-Reply-To: <0BACD3F5-7C29-4D7D-B88C-9D1AD74443AE@oracle.com>
References: <53DF3636-D420-4FAA-B1B0-8824602CBB72@oracle.com> <CADaq8jeyxWEDkdWcRvaK-Vet0dCCXgJ0HcMywP3aXawV9KVPbg@mail.gmail.com> <0BACD3F5-7C29-4D7D-B88C-9D1AD74443AE@oracle.com>
From: David Noveck <davenoveck@gmail.com>
Date: Wed, 02 Aug 2017 15:50:25 -0400
Message-ID: <CADaq8jecKcLM2FzkACHrd7nxfrNLTGDPkZjsn3MgaJK7e_=bxQ@mail.gmail.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: NFSv4 <nfsv4@ietf.org>
Content-Type: multipart/alternative; boundary="001a1141bc564ae2fe0555ca935c"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/L3N0Jb-GSlUwsZAefgr0q3kBhVc>
Subject: Re: [nfsv4] NFS/RDMA next steps
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Aug 2017 19:50:32 -0000

> The purpose of this thread is to form a coherent technical
> vision for the NFS/RDMA umbrella effort.

I don't think this is very realistic.  The people in this group
have different views about things and will not necessatily agree
about what is the most important.  We have to accomodate
those different opinions and try to work together on what we
do agree on, even if we do not share an overall vision.

> I view this vision
> as a necessary precursor to sorting the scattershot of I-Ds
> (both extant and pending) the WG is considering.

I don't agree.  If you and I agree that something is important
and should be worked (and other working group members
don't disagree), we should proceed on that, even if we do not
share an overall technical vision.

> The purpose of Grouping One is to identify the ways we can
> bring the same facility to NFS. The mechanisms we have
> available to us are trunking and pNFS.

I'm working on a trunking-related document as are you.   This
is despite the fact that I am also workinhg on implementation.

I agree that using pNFS would be helpful and should be
encouraged.

> Perhaps you misunderstand what Grouping One is about.

Probably so.  At IETF it was billed as work above the RDMA
layer.  I never paid much attention to the grouping but focused
on the ndividual items and it appears we agree about a lot of
those.

> That
> grouping encompasses work that is active and ongoing, and
> the RDMA push-mode layout work happens to be in that bundle.

> I'm advocating that we prioritize the efforts around trunking
> (the drafts you and I are authoring)

OK.  I agree this is important, but I'm not sure what you mean
by "proritize".  I'm doing the best I can, given that I also have
implementation committments.  I promised to have
draft-dnoveck-nfsv4-nv1-msns-update-00 out by the end of
August and intend to get that done

> and around Christoph's NVMe SCSI layout,

Agree but I don't think Christoph is slacking off here.  At IETF99,
some decisions were made that mean that Christoph has a
bunch of work to do on this specification.  Also, since it is clear
that this will be a working group document, perhaps it is time
to ask him for a milestone (cue maniacal laughter :-).

> and help in any way we can with the RDMA
> push-mode layout.

I'm waiting for Christoph to indicate he needs help.  if he does,
it would be great if he could get it.

> The latter:
>
>draft-hellwig-nfsv4-rdma-layout
>
> is indeed in early stages.

That's what I meant by saying it was a long-term effort, i.e, that it
just strted recently and it will be a while until it is worked
out completely.

> But overall it is less complex
> than other work (like RPC-over-RDMA version 2) and is a
> copy (plus-or-minus details about memory registration and
> RDMA Flush) of existing block layouts.

If you ignore the difficult parts, it is trivial :-)

However, we should be realistic about how long this will take to do.

> Therefore I don't regard this as long-term at all, but
> rather something that we can and should move along as
> quickly as is possible.

I agrre we should move this along as quickly as possible, but we
have to realize that will take a while.  That's what I meant by
saying it is a "long-term effort".

> This layout type can be utilized by a server with
> traditional DRAM as well.

Good point.

> There's no reason to put this off.

I agree.  When I said it was a "long-term effort", I meant precisely that
and
not "Let's put this on the back burner"

> > Grouping Two:
>
>
> Implementation of this is important and will be proceeding.

> It's already implemented in Linux. Are you announcing a
> product feature in a NetApp product?

Obviously not.  I am not announcing Ontap RDMA support, but
it is pretty obvious that netapp would be interested in it.  Similarly, it
is obvious that, if the Linux client implements a performance-
oriented feature, then server vendors who want to interoperate
with it (i.e. all of them) will also implement this.

There is no material non-public information here.  The working group
members have no need to consider how they would look in orange :-)

> Whether it's important is a question mark.

it depends on your point of view.  You thought it was important enough to
write a specfication.  Someone thought it was important enough to write
client and server imlementations and get them to interoperate.

I don't see why the relatively small amount of work to get this published
officially calls for a substantially greater degree of importance but there
is no point in arguing the matter further.

> I've found only
> one area of significant performance improvement in my testing,
> and that's with RPC Calls of size 1KB to 16KB.

To me, that is a pretty significant area.

> The conclusion I draw from your comments is that you believe
> NetApp can't take advantage of a pNFS block layout (either
> NVMe or push-mode).

I don't see on what basis you draw that conclusion.

> However, NetApp was the sponsor of the
> Linux pNFS block mode (with XFS) work that Christoph recently
> completed.

Good to know.

On Wed, Aug 2, 2017 at 12:21 PM, Chuck Lever <chuck.lever@oracle.com> wrote:

>
> > On Jul 31, 2017, at 4:04 PM, David Noveck <davenoveck@gmail.com> wrote:
> >
> > > The slides split the possibilities into three somewhat orthogonal
> groupings.
> >
> > You had to divide these up to produce slides, but everyone will divide
> things
> > differently.  In any case, I don't see these divisions as all that
> helpful.
>
> The purpose of this thread is to form a coherent technical
> vision for the NFS/RDMA umbrella effort. I view this vision
> as a necessary precursor to sorting the scattershot of I-Ds
> (both extant and pending) the WG is considering.
>
> The question of charter milestones overlaps this discussion
> in some cases, and there will be milestones that are not
> under this umbrella. Thus the milestone conversation is
> related but orthogonal for the moment.
>
>
> > I think we
> > need to decide what documents are ready to become working group
> > documents based on their maturity and promise and believe that is the
> > appropriate way to frame the discussion.
>
> I regard that as premature for two reasons:
>
> 1. Without a technical vision, there's no agreement between
> client and server designers about what are the priority
> features. This work is supposed to be about interoperability,
> and we can't get that without there being some agreement on
> what features to realize in prototype and in specification.
>
> 2. Resource limitations are a reality, and mean there must be
> some prioritization of which features we try for first, and
> which are put off.
>
> I would prefer a technical conversation first before diving
> into the mechanics of I-D promotion. I believe that will give
> us a better chance of producing specifications that will be
> implemented and deployed without wasting time on items that
> are of little common interest.
>
>
> > > Opinions
> > > are welcome as to what order, whether something is left out, or what
> might be
> > > removed from this list. Groupings Two and Three could introduce new
> Working
> > > Group documents, and thus have implications for our charter milestone
> count.
> >
> > Note that Grouping One does as well.  See below.
> >
> >
> > > Grouping Zero:
> > > Focus on improving existing implementations of RPC-over-RDMA and
> NFS/RDMA. No
> > > IETF action needed, which is why I didn't include this on the slides.
> >
> > For the same reason it might not be all that appropriate to this
> discussion.  The balance
> > between iimplementation and specfication is a decision for individual
> companies to
> > make.  In my case, the work is concerned with the development of a new
> implementation
> > but the same principle applies.  My employer will have opinions about
> this question and
> > I'm going to pay attenntion to my company's needs.  It is not up to me
> how Oracle people
> > spend their time or up to Oracle about how mine is spent.
>
> Not talking about "Oracle people" here. I'm bound by my
> employment contract not to share that kind of information
> publicly.
>
> I was also strongly reprimanded the last time I "represented
> my company's needs" in this forum. Thus I speak from a
> personal view:
>
> The limited resource pool is a community reality, and must
> be considered as the WG chooses which work to move forward.
>
> And:
>
> It is a given that protocol work has a longer lead time than
> implementation improvements. IMO at this stage, there is
> significant customer benefit with implementation work that can
> be brought to market in the near term.
>
>
> > > There
> > > are substantial improvements that can be made to existing base
> implementations,
> > > but these would be done by many of the same folks who would be working
> on new
> > > protocol.
> >
> > True.  If people are so busy doing implementation that they have no time
> to work on
> > specs we will have to adjust.
>
> > > Grouping One:
> > > Enable greater transport parallelism in NFS.
> >
> > This strikes me as a highly unnatural grouping to use as a basis for
> decision-making.
> > It might make sense in a slide presentation.
>
> The primary reason SMB Direct goes faster than NFS/RDMA is
> SMB's multi-channel capability: it's ability to automatically
> utilize available network adapters and fabrics. In other
> words, to parallelize its workload across hardware resources.
>
> The purpose of Grouping One is to identify the ways we can
> bring the same facility to NFS. The mechanisms we have
> available to us are trunking and pNFS.
>
> Those mechanisms can benefit other transports as well. That
> does not mean they are not critical to the success of
> NFS/RDMA.
>
>
> > > This includes multipathing
> >
> > I'm working on this.  I mentioned prodcucing
> draft-dnoveck-nfsv4-mv1-msns-update at the meeting.
> >
> > My understanding is that Chuck will submit draft-cel-mv0-trunking-update
> which addresses
> > trunking/multipathing in the v4.0 context.
> >
> > > and
> > > use of pNFS with RDMA. No changes to RPC-over-RDMA or NFS/RDMA are
> necessary,
> > > and this would bring important performance capabilities to NFS,
> especially
> > > by enabling very low latency client access to Storage Class Memory.
> >
> > I agree with doing this but what we heard at ietf99 was a very early
> draft.  This is a long-term effort which should be pursued but this
> document is quite a ways from being ready to become
> > a working group document.
>
> You may have misunderstood what I meant by my blanket
> mention of pNFS over RDMA. This loose term includes both
> the NVMe layout, which can use RDMA, and the newer RDMA-
> only push-mode layout that Christoph introduced at IETF 99.
>
> With regard to the former:
>
> draft-hellwig-nfsv4-scsi-layout-nvme
>
> is an integral part of bringing increased parallelism
> to NFS/RDMA, and I hope that Christoph allows us to
> promote this work.
>
> The latter:
>
> draft-hellwig-nfsv4-rdma-layout
>
> is indeed in early stages. But overall it is less complex
> than other work (like RPC-over-RDMA version 2) and is a
> copy (plus-or-minus details about memory registration and
> RDMA Flush) of existing block layouts.
>
> Therefore I don't regard this as long-term at all, but
> rather something that we can and should move along as
> quickly as is possible. Support for push-mode in SMB
> Direct, and the march toward market of Storage Class
> Memory, is happening now.
>
> This layout type can be utilized by a server with
> traditional DRAM as well. There's no reason to put this
> off.
>
>
> > > Grouping Two:
> > > Incrementally improve RPC-over-RDMA version 1. The main idea here is to
> > > introduce a per-connection transport property negotiation mechanism to
> replace
> > > CCP. This would enable variable size (ie larger) inline thresholds and
> the use
> > > of Remote Invalidation in some instances with existing deployments.
> >
> > Implementation of this is important and will be proceeding.
>
> It's already implemented in Linux. Are you announcing a
> product feature in a NetApp product?
>
> Whether it's important is a question mark. I've found only
> one area of significant performance improvement in my testing,
> and that's with RPC Calls of size 1KB to 16KB.
>
> Examples include metadata-type NFS operations with large
> arguments, and small-to-moderate NFS WRITEs. This needs further
> analysis and discussion, independent of controversy surrounding
> standardization.
>
> For instance, similar improvement might be possible to see with
> better implementation of Read chunks, and no protocol changes
> would be necessary. Also, improving heuristics of delegation
> can reduce the need for large metadata operations on the wire.
> Tuning clients to emit smaller NFS COMPOUNDs would be beneficial.
>
> There have been only a few percentage points of boost in small
> IOPS, and of course no impact on I/O with larger payloads, in
> my testing.
>
>
> > I intend to implement it regardless of working group decisions about
> priorities or the pospect of controversy.
> > Perhaps this belongs in group Zero.  The work to get this
> > already completed specfication work through the IETF process is
> relatively small although there may be controversies that need to be
> resolved.  See below.
> >
> >
> > > Grouping Three:
> > > Pursue RPC-over-RDMA version 2. This would open a variety of avenues
> by which
> > > many of the perceived shortcomings of RPC-over-RDMA version 1 could be
> > > addressed.
> >
> >
> > > IMO Zero and One are where we can get the greatest bang for the buck
> in the
> > > near term.
> >
> > Two issues with this:
> >       • I believe Zero is out of scope for this discussion for reasons
> already given.
>
> The realities of resource limits must be considered.
>
>
> >       • The pNFS-RDMA layout type is not a near-term item.  I agree we
> should pursue it, but it is not going to happen soon, especially if much of
> the working group is busy doing implementation work.  In any case, when
> Christoph thinks this is ready to be a working group item, we should
> definitely consider this seriously.
>
> Perhaps you misunderstand what Grouping One is about. That
> grouping encompasses work that is active and ongoing, and
> the RDMA push-mode layout work happens to be in that bundle.
>
> I'm advocating that we prioritize the efforts around trunking
> (the drafts you and I are authoring) and around Christoph's
> NVMe SCSI layout, and help in any way we can with the RDMA
> push-mode layout.
>
>
> > > The current proposal for Grouping Two (draft-cel-nfsv4-rpcrdma-cm-pvt-msg)
> is
> > > controversial.
> >
> > if Tom, or anyone else, has objections to proceeding with this, we need
> to discuss this on the list and work to arrive at a reasonable resolution.
> My understaning is that Tom objects to this being standards-track.  While I
> believe it should be standards-track, I can live with an Informatinal RFC.
>  I just don't want ths to be an undocumented de facto standard.
>
> I don't object to it being Informational, nor do I object to
> the work moving ahead as a lower priority to Grouping One.
>
>
> > > Grouping Three would be an immense amount of work to generalize
> > > some things for less gain than we might see with work in Grouping Zero
> or One.
> >
> > I don't think "immense" is the right word but let's not argue about
> that.  In any case, we need
> > relef from the major performance issue that were somehow baked into
> Version One.
>
> That claim needs substantiation, which is why I implemented
> the experiment described in
>
> draft-cel-nfsv4-rpcrdma-cm-pvt-msg
>
> I don't believe this work is ready to pursue as a WG document
> until we have quantified the "major performance issue" and have
> confirmed that large inline thresholds and Remote Invalidation
> will relieve the problem, whatever it may be.
>
>
> > We can live with Version One if we have the implementation corresponding
> to cm-pvt-msg.
>
> Does anyone second that opinion?
>
> My feeling is that the parallelism effort (Grouping One) will
> bring much greater performance improvements, will benefit
> RPC-over-RDMA version 1 consumers, and therefore should be
> prioritized.
>
> The conclusion I draw from your comments is that you believe
> NetApp can't take advantage of a pNFS block layout (either
> NVMe or push-mode). However, NetApp was the sponsor of the
> Linux pNFS block mode (with XFS) work that Christoph recently
> completed.
>
> What about trunking?
>
>
> > I'm not sure whether this is Grouping Two (as above) or Zero (because we
> are are talking about existing implementations).
>
> As rpcrdma-cm-pvt-msg involves protocol specification and
> considerations of interoperability, and nothing in Grouping
> Zero has those issues, a separate Grouping is an appropriate
> place for it.
>
>
> > In any case we have running code and should take advantage of Chuck's
> good work in specifying this regardless of any potential controversy.
>
>
> --
> Chuck Lever
>
>
>
>