Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667bis-04.txt

David Noveck <davenoveck@gmail.com> Sat, 04 February 2017 17:37 UTC

Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4BC2D129460 for <nfsv4@ietfa.amsl.com>; Sat, 4 Feb 2017 09:37:44 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.799
X-Spam-Level:
X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tcLBPf2649uE for <nfsv4@ietfa.amsl.com>; Sat, 4 Feb 2017 09:37:39 -0800 (PST)
Received: from mail-oi0-x232.google.com (mail-oi0-x232.google.com [IPv6:2607:f8b0:4003:c06::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2B3681289C4 for <nfsv4@ietf.org>; Sat, 4 Feb 2017 09:37:39 -0800 (PST)
Received: by mail-oi0-x232.google.com with SMTP id j15so27947617oih.2 for <nfsv4@ietf.org>; Sat, 04 Feb 2017 09:37:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=dbgNYXIHiqCWG/DJEYhalJr7GryTf0PX6QqHR/zB6Y4=; b=msnBT+qDzcOBYoTptYRURqjTIWDKAWSEi6k1FoLbC+iZncGWVO3b4nIv4mQTAfALZh ETnui1h9uG1knFOrzFNR7yUdpm+l9fHDHjUW8drWTjEL5xWgofBA+FHSNshamf9m2Mn1 EPSh9FnbPdI9xWTOEBBYpnumjX0AYcNi8F9XArfIUYn5xGqpBhTET9PCPU9YTvyICg7F 5WZ3DGTQcNc3yr9AAIPezWwPCnIoM2oQAKPI5RvgHjdER7Kdk+urM0D8dFBZ11PIVQJn J1G/bzyoZ9RSETAR/O4tIHsLnWddfw98T348MRlfg2p5eKjfddHop6nyv9XdFPAuPPsr TxAA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=dbgNYXIHiqCWG/DJEYhalJr7GryTf0PX6QqHR/zB6Y4=; b=shlqi3SWzr9I8glxOBCkaVau6w03QHyeYzea4yyZUqJFZgQ//y/eoEFRM/ut2wzLGg mVVQs00wESrHZE/hOV+sJMCaQ/kNTgDeycr/dDy2o9NIV8D4nm5p5pseY/D24Edu52NP 4jPgwfUdKz2gtvNxk9z8kF9grThOp74G+GJmpEvLaGAkcumDv/yAvAlzd0hYgSvx3A7b 7X3AYadcabNphsAkgtTIIbbQVKCNKdOuLCHB55CXmqgXAJlhXa8kqGwQTsW7VWI55e6s A+godTznFFON18yPWXOLNgZptzrxEbheRAz+OUjLiRsdD+y485IvWVmv3Xyt8kR4FVp5 0pUw==
X-Gm-Message-State: AMke39nVuCu48Qirw/3goEp6QKRg3XZsrXwSKrBGblZ2/Mwh0lowZFUdWCUKuraAkIxwDjpeuAcy1cZWd8J/Lw==
X-Received: by 10.202.117.210 with SMTP id q201mr1330702oic.126.1486229858220; Sat, 04 Feb 2017 09:37:38 -0800 (PST)
MIME-Version: 1.0
Received: by 10.182.137.200 with HTTP; Sat, 4 Feb 2017 09:37:37 -0800 (PST)
In-Reply-To: <5B6B324E-2BA1-401E-9846-97DE023A579A@oracle.com>
References: <148495844040.13416.10356809202500126242.idtracker@ietfa.amsl.com> <338b603b-8be3-f7f3-d7e0-021d8185f8ec@oracle.com> <E0D9D91E-9245-4846-842A-1F75A9A8D4A4@oracle.com> <e69f5b01-cdeb-2159-45ff-485118a6022f@oracle.com> <33525335-7563-475D-9DC0-59BB6FE8DF18@oracle.com> <CADaq8jfMtABtHuDrMKjWSLQv-EX7dr9emUyUdJ3WG4p9X2Jfnw@mail.gmail.com> <B8FB2639-6666-4EDC-BC40-9A4F8F4BAF70@oracle.com> <CADaq8jdHJW7NFywhykOK-8Z25Qx7GhjQNnbBbz=fA8a-6_WKsg@mail.gmail.com> <D40250C2-D956-446E-9B98-FA423B75A9E9@oracle.com> <CADaq8jfVrdZKe=VSA8z=Nqd9hsL2M2xKQ8SHNy0+BtCeQGfAQA@mail.gmail.com> <5B6B324E-2BA1-401E-9846-97DE023A579A@oracle.com>
From: David Noveck <davenoveck@gmail.com>
Date: Sat, 04 Feb 2017 12:37:37 -0500
Message-ID: <CADaq8jfBGgVNjOCnkcFxbAQrNYJ3LMrsuoGPQnzHSzLYnz9WWw@mail.gmail.com>
To: Chuck Lever <chuck.lever@oracle.com>
Content-Type: multipart/alternative; boundary="001a1135064ac101df0547b7da51"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/FM07-gk-dmY6SosBWKxIduKQNdg>
Cc: "nfsv4@ietf.org" <nfsv4@ietf.org>
Subject: Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667bis-04.txt
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Feb 2017 17:37:44 -0000

> I wonder if we can continue to label
> this Direct Data Placement, since the buffer has to be
> "flipped" into place after receive

Good observation :-), although somewhat annoying :-(.

In any case  rtrext-02 will refer to Send-based Data Placement.

On Fri, Feb 3, 2017 at 4:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote:

>
> > On Feb 3, 2017, at 1:04 PM, David Noveck <davenoveck@gmail.com> wrote:
> >
> > A lot of this reads like pieces of a review of rtrext.  I'll deal with
> some issues within that later but first I want to focus on the concept of
> DDP-eligibility.
> >
> > I'm not sure exactly what it would mean for a concept such as this to be
> completely general and I don't think that's the issue.  For me, the
> important point is that it is a useful concept and that it has more than
> one possible implementation.  XDR items are categorized as either
> DDP-eligible or not.  I am not saying that additional useful
> categorizations cannot arise in the future but I don't know of any and that
> there doesn't seem to be any useful way to prepare for them.
> >
> > The existing categorization is such that there will be individual
> instances of items that implementations will not transfer using DDP.  This
> has always been treated as an implementation choice and ULBs do not say,
> for example, that READs of a few bytes are not DDP-eligible.  Instead, they
> are DDP-eligible and requesters have a way to specify that they be returned
> inline.  To do otherwise would make ULBs overly complicated and very hard
> (perhaps Sisyphean) to maintain.  ULBs are limited to specifying which
> items are DDP-eligible because that is what the requester and responder
> need to agree on.  If the requester wants to send data inline or have
> DDP-eligible response data returned inline that is up to him and ULBs do
> not need to say anything about this choice.
> >
> > The same applies when there are multiple possible forms of DDP with
> different performance characteristics.  The requester chooses and the ULB
> does not need to instruct him about which choice to make.  To try to create
> rules about what choices the requester could make would make a ULB just
> about impossible to do.  Let's not go there.
> >
> > Now let me give some background about rtrext.  Perhaps I've been unduly
> influenced by Talpey's figure of a million 8K random IOs per second, but
> the fact is that I was focused on that case.  In my experience, it is a
> very common case, deserving of significant attention.  I added message
> continuation to rtrext to allow reads of multiples of the block size (e.g.
> 8K) to get the same sort of benefit.
>
> There is no denying that is an important goal.
>
> Unfortunately I've not found a way to approach that level
> of throughput by shrinking per-op latency alone. It's
> important to realize that there is a lower bound for per-op
> latency that is determined by the physics of moving packets
> on a physical network fabric.
>
> High IOPS throughput is actually achieved by using multiple
> QPs through multiple physical interfaces and networks linking
> an SMB client and several storage targets (or in NFS terms:
> pNFS).
>
> So, if one RPC-over-RDMA connection can achieve 125KIOPS (and
> I don't think that's unattainable, even with RPC-over-RDMA V1),
> then eight connections might be able to swing a million IOPS,
> given a careful client implementation. Getting to four
> 250KIOPS connections might be a challenge.
>
> There are some critical pre-requisites to multi-channel
> operation:
>
> - bidirectional RPC on RPC-over-RDMA to enable NFSv4.1
> - multi-path NFS (Andy's work), which works best with NFSv4.1
> - pNFS/RDMA, which requires NFSv4.1, to enable the use of
> multiple DSes
>
> Linux is now close to having all of this in place.
>
>
> > Perhaps I did not devote the attention I should have to IO's that are
> not an integral number of blocks but my experience is that clients are
> typically structured to make this an unusual case.  YMMV.
>
> > I don't claim that send-based DDP is universally applicable and I expect
> other forms of DDP to continue to exist and be useful.  With regard to the
> READ/WRITE distinction, I accept that most of the benefits are for WRITE
> and that many implentations might reasonably choose to use V1-like DDP for
> READ while use send-based DDP for WRITE.  However, there are many
> environments in which remote invalidation is not possible (e.g. user-level
> server) and for these ther are significant benefits from using send-based
> DDP for READ as well.
> >
> > With regard to the sections that read like pieces of a review of rtrext,
> it is hard to respond because the text often makes reference to potential
> extensions for which there is no clear definition.   I think we have to
> address these issues:
> >       • For push mode, it makes sense to define this as another optional
> extension to Version Two.  We need a volunteer to write rpcrdma-pushext.
> If we don't have one now, we should try again after rpcrdma-version-two
> becomes a working group document.
>
> At the moment, push mode looks like a new pNFS layout type,
> thus it is independent of which version of RPC-over-RDMA is
> in use. Agreed, that all has to be written up. Christoph or
> I could make a start at it, but I'm not volunteering anyone's
> time. It would be a short document similar to
>
> http://xml2rfc.tools.ietf.org/public/rfc/bibxml3/reference.
> I-D.draft-hellwig-nfsv4-scsi-layout-nvme-00.xml
>
> which unfortunately appears to have expired.
>
>
> >       • With regard to speculation about a possible form of form of
> Send-based DDP with just Message Continuation, I'm skeptical but if there
> were to be some clear definition, it could be added as an option to
> rtrext-0x, for x>= 2.
>
> I was thinking of starting with an independent document
> (and extension) just to draw the lines clearly. I've been
> busy with other things, and haven't found an opportunity
> to write it up.
>
>
> > With regard to issues related to challenges due to large page sizes, I
> see the matter differently.  I think it is possible to avoid copying
> without using page flipping,but using what I'll call "buffer flipping",
> even when page sizes or large or there is no use of virtual memory at all.
> One might use page flipping only as a way to avoid copies on page-size
> units that are smaller than the buffer size (e.g. 4K pages with 8K buffers).
> >
> > Let's assume the buffer cache uses an 8K buffer size and that the page
> size is either larger or non-existent (i.e.. no virtual memory).  Receive
> buffer are structured to include a section for inline payload (1-2K)
> together with an 8K buffer within the buffer cache.  When posted for
> receive, this buffer should have no valid data within it and it the
> associated buffer header should indicate that is neither free nor assigned
> to a particular file location.  Once data is placed into it by a receive
> which gets a WRITE request, that buffer can be assigned to a location
> within a file and the buffer header updated to reflect this location.  The
> mapping from file and block number to buffer address is changed but the
> address of the buffer remains the same: it is the one into which the buffer
> was placed.
>
> In operating systems I'm familiar with, the page cache
> is the single base of all file I/O. Sub-page sized I/O
> (ie, a separate buffer cache) is no longer utilized.
>
> Even if a particular file system implementation were
> to use it, those buffers would not be exposed above
> the VFS, and thus NFS servers would have no visibility
> of them. Again, that's what I'm familiar with, maybe
> not universally true, and perhaps I've misunderstood
> something.
>
> A Message Continuation approach could use a "flipping"
> mechanism without the requester having to parcel out
> chunk lists.
>
> The key is separating the Transport header from the
> Payload. By fixing the size of the Transport header,
> a receiver can get the incoming Payload to land in
> a separate buffer from the Transport header, always.
>
> A sender might then rely on the fact that the
> receiver's inline threshold is related to it's
> preferred page/buffer cache size, and break each
> message up so that certain data items in the Payload
> stream land in independent buffers on the receiver
> (again, this is similar to an experimental approach
> being considered for NFS on TCP).
>
> RDMA Receive preserves the sender's Send ordering.
> The receiver simply concatenates the incoming messages
> related to the same XID, grouping them as is convenient
> to it.
>
> Receiving such messages should work correctly for any
> arbitrary division of Payload stream contents. Benefits
> accrue when receivers can advertise to senders exactly
> how they want to see RPC payloads divided into messages.
> Perhaps a ULB could be useful here.
>
> This arrangement is still unable to handle what I'll
> term "zero copy sub-buffer modification". In more
> abstract terms, I wonder if we can continue to label
> this Direct Data Placement, since the buffer has to be
> "flipped" into place after receive; it's not a bona
> fide update-in-place. What we are attempting to do
> here is more akin to zero-copy receive.
>
>
> > I think the conformance issues you have mentioned are worth thinking
> about but there will be differences of opinion about how important these
> are.  In any case, the discussion has resulted in some conclusions about
> use of various form of DDP that need to be captured somewhere other than in
> an email thread.
> >
> > I can see why you don't want this stuff in rfc5667bis.  I don't think it
> belongs in nfsulb (or in an update to rfc5667bis based on it) either.  For
> now, I'll add an implementation-choice section to rtrext-02, but ultimately
> I don't think this kind of stuff belongs in a standards-track document.  I
> think that eventually, when the set of extensions in Version Two is more
> settled (and larger than one), we could publish an Informational RFC that
> provides implementation guidance about performance-relevant choices that
> implementers might need to be made aware of.
>
> Yes, that is one possibility.
>
>
> > :
> >
> > On Thu, Feb 2, 2017 at 3:45 PM, Chuck Lever <chuck.lever@oracle.com>
> wrote:
> >
> > > On Feb 2, 2017, at 8:53 AM, David Noveck <davenoveck@gmail.com> wrote:
> > >
> > > > This section also mentioned inline thresholds and Reply chunks,
> > > > and that looks superfluous. I removed it.
> > >
> > > A whole lot of rfc5667 wound up essentially repeating stuff
> > > in rfc5666.  A lot of that has been removed and that work is
> > > continuing.
> > >
> > > The interesting question is when that process should stop.  In
> > > particular, we need to understand if it can stop short of simply
> > > providing the information required of ULBs in rfc5666bis.
> >
> > I'm totally OK with removing redundant language from rfc5667bis,
> > but we should not pretend that DDP eligibility is completely
> > general or that we understand perfectly how all DDP transport
> > mechanisms will work in the future.
> >
> > I'd like to be sure we are not creating a Sisyphean task.
> >
> >
> > > When
> > > I suggested going that way, you didn't like the idea and I had
> > > doubts about whether it was workable,
> >
> > For example, some repetition of rfc5666bis language is helpful as
> > implementation guidance.
> >
> >
> > > Nevertheless, if we
> > > have material beyond what we need to meet those requirements,
> > > we should understand why it is there.
> >
> > There are other things that can go into a ULB that might not
> > have been listed in rfc5666bis. Clearly one issue we have with
> > NFS in particular is how to deal with replay caching above the
> > transport.
> >
> >
> > > Meanwhile, in nfsulb, I've been going in the opposite direction,
> > > by leaving in the V1-specific material in and indicating that other
> > > implementations are possible and will be discussed in the documents
> > > for the appropriate versions and extensions.  As a result, while
> > > rfc5666bis is getting smaller, my document is larger than -04.
> > >
> > > These documents need to come together at some time, most likely
> > > after rfc5667bis is published (or at least in WGLC) and
> > > rpcrdma-version-two has been turned into a working group
> > > document.
> > >
> > > The prospect of doing so brings us back to our deferred discussion
> > > of the phyletics of the various forms of DDP.  My original plans had
> > > been to put this off until nfsulb was published (done), there was a
> > > new version of rtrext that addressed ULB issues (now working on it),
> and
> > > I added some ULB-related material to rpcrdma-version-two (on my todo
> > > list).
> > >
> > > As I looked at your comments about this issue, I still disagreed with
> them
> > > but decided that they raise issues that the discussion of ULB issues
> > > will have to address, and they might also lead to clarifications in the
> > > rest of rtrext.  So that we don't have to wait for some of these
> documents
> > > to come out (and read them all), let me address your issues below:
> > >
> > > > Explicit RDMA and send-based DDP are perhaps in the same
> > > > phylum but are truly distinct beasts.
> > >
> > > I agree they are different things.  There would be no point in defining
> > > send-based DDP if was the same as V1-DDP.  Nevertheless, they
> > > serve the same purpose: transferring DDP-eligible items so that the
> > > receiver gets them in an appropriately sized and aligned buffer.  I
> > > thinks that is all that matters for this discussion.
> >
> > The numbered items in my previous e-mail demonstrate that the
> > purpose of DDP eligibility is to encourage ULB and implementation
> > choices that are appropriate when using offloaded RDMA. They do not
> > all apply to Send-based DDP, though some do.
> >
> > Or put another way, DDP eligibility is not general at all! It is
> > only "general" because Send-based DDP has been adjusted to fit it.
> >
> > For example, you have argued below that Send-based DDP is aimed at
> > improving a particular size I/O; yet data item size is one important
> > criterion for a ULB author to decide whether to make a data item
> > DDP-eligible.
> >
> > Suppose a ULP has a 512-byte opaque data item (opaque, meaning that
> > no XDR encoding or decoding is required) that would benefit from
> > avoiding a data copy because it is in a frequently-used operation.
> > You might not choose to make it DDP-eligible because an explicit
> > RDMA operation is inefficient. But it might be a suitable candidate
> > for Send-based DDP.
> >
> >
> > > To get back to phyletics, I would put the branching much lower in the
> > > tree.  To me a better analogy is to consider them as two distinct,
> > > closely-related orders, like rodentia and lagomorpha (rabbits,
> > > hares, pikas).
> > >
> > > It is true that there are reasons on might choose to move a
> DDP-eligible
> > > data Item using one or the other but that is an implementation
> decision.
> > > Efficiency considerations  and  direction of transfer may affect this
> > > implementation decision, as is the fact , mentioned in rtrext, that if
> the
> > > client wants the response to be placed at a particular address, you
> > > shoulnd't use send-based DDP.  But none of this affects the fact that
> > > only DDP-eligible items can be placed by send-based DDP, just as is
> the case
> > > with V1-DDP.
> > >
> > >
> > > > 1. The rfc5666bis definition of DDP eligibility makes sense
> > > > if you are dealing with a mechanism that is more efficient
> > > > with large payloads than small. Send-based DDP is just the
> > > > opposite.
> > >
> > > It depends on what you mean by "large" and "small".
> > > Clearly however, neither technique makes sense for very
> > > small payloads and both mske most sense for payloads
> > > over 4K although the specfic will depend on efficiency and
> > > how the send-based DDP receiver chooses to structure its
> > > buffers.
> > >
> > > For me the sweet spot for sen-based DDP would be 8K
> > > transfers and for this reason structured buffers of 9k
> > > (consisting  of 1K primarily for inline payload
> > > together with an 8K aligned segment for direct placement)
> > > are a good choice.   Even larger buffer are possible but the
> > > fact that message continuation can be used means that
> > > send-based DDP should have no problems with 64K
> > > transfers.
> >
> > With Remote Invalidation, NFS READ can be handled with a Write
> > chunk with nearly the same efficiency and latency as a mechanism
> > that uses only Send. Yes, even with small NFS READ payloads,
> > whose performance doesn't matter much anyway. Let's take that off
> > the table.
> >
> > And I think we agree that payloads larger than a few dozen KB are
> > better transferred using explicit RDMA, for a variety of reasons.
> >
> > IMO the only significant case where Send-based DDP is interesting
> > is replacing RDMA Read of moderate-sized payloads (either NFS
> > WRITE data, or Long Call messages). Let's look at NFS WRITE first.
> >
> > Using a large inline threshold (say, 12KB), the server would have
> > to copy some or all of an 8KB payload into place after a large
> > Receive. Server implementations typically already have to copy
> > this data.
> >
> > Why?
> >
> > Because page-flipping a Receive buffer cannot work when the NFS
> > WRITE is modifying data in a file page that already has data on it.
> >
> > Remember that Receive buffers are all the same, all RPC Call
> > traffic goes through them. The hardware picks the Receive buffer
> > arbitrarily; the ULP can't choose in advance which RPC goes into
> > which Receive buffer. Flipping the buffer is the only choice to
> > avoid a data copy.
> >
> > Suppose the server platform's page size is 64KB, or even 2MB.
> > Then, 8KB NFS WRITEs are all sub-page modifications, and copying
> > would be required for each I/O. Explicit RDMA works fine in that
> > case without a data copy.
> >
> > Host CPU data copy has comparable or better latency to flipping
> > a receive buffer into a file, allocating a fresh receive buffer,
> > and DMA mapping it, even for payloads as large as 4KB or 8KB.
> > Eliminating a server-side data copy on small NFS WRITEs is gravy,
> > not real improvement.
> >
> > As far as I can tell, Send-based DDP is not able to handle the
> > sub-page case at all without a copy. That could add an extra rule
> > for using Send-based DDP over and above DDP eligibility, couldn't
> > it?
> >
> > As soon as one realizes that Send-based DDP is effective only
> > when the payload is aligned to the server's page size, depends on
> > the surrounding file contents, and that it is less efficient than
> > explicit RDMA when the payload size is larger than a few KB,
> > introducing the complexity of Send-based DDP to the protocol or
> > an implementation becomes somewhat less appealing.
> >
> > A better approach would attempt to address the actual problem,
> > which is the need to use RDMA Read for moderate payloads. This is
> > for both NFS WRITE and large RPC Call messages.
> >
> > 1. Long Call: large inline threshold or Message Continuation can
> > easily address this case; and it is actually not frequent enough
> > to be a significant performance issue.
> >
> > 2. NFS WRITE: Message Continuation works here, with care taken to
> > align the individual sub-messages to server pages, if that's
> > possible. Or, use push mode so the client uses RDMA Write in this
> > case.
> >
> > The latter case would again be similar to DDP-eligibility, but not
> > the same. Same argument data items are allowed, but additional
> > conditions and treatment are required.
> >
> >
> > > it is true that very large transfers might be better using
> > > V1-DDP but that is beside the point.  Any READ buffer is
> > > DDP-eligible (even for a 1-byte READ) and the choice of
> > > whether to use DDP i(or the sort to use) in any particular
> > > case is up to the implementation.  The ULB shouldn't
> > > be affected.
> >
> > > > 2. Re-assembling an RPC message that has been split by
> > > > message continuation or send-based DDP is done below the XDR
> > > > layer, since all message data is always available once Receive
> > > > is complete, and XDR decoding is completely ordered with the
> > > > receipt of the RPC message. With explicit RDMA, receipt of
> > > > some arguments MAY be deferred until after XDR decoding
> > > > begins.
> > >
> > > First of all, message continuation isn't relevant to this
> > > discussion.
> > >
> > > With regard to send-based DDP, it is true that you have
> > > the placed data immediately, while you might have to wait
> > > in the V1-DDP case, but this doesn't affect  DDP-eligibility.
> >
> > > In either case, before you process the request or response,
> > > all the data, whether inline or placed needs to be available.
> > > I don't see how anything else matters.
> >
> > It can matter greatly for NFS WRITE / RDMA Read.
> >
> > In order for a server to set up the RDMA Read so the payload goes
> > into either a file's page cache or into the correct NVM pages,
> > the server has to parse the NFS header information that indicates
> > what filehandle is involved and the offset and count, before it
> > constructs the RDMA Read(s).
> >
> > Without that information, the server performs the RDMA Read using
> > anonymous pages and either flips those pages into the file, or in
> > the case of NVM, it must copy the data into persistent storage
> > pages associated with that file.
> >
> > That's the reason delayed RDMA Read is still important.
> >
> > Now, Send-based DDP can handle NVM-backed files in one of two ways:
> >
> > 1. Receive into anonymous DRAM pages and copy the payload into NVM;
> > results in a (possibly undesirable) host CPU data copy
> >
> > 2. Receive into NVM Receive buffers; but then all of your NFS
> > operations will have to go through NVM. Maybe not an issue for
> > DRAM backed with flash, but it is a potential problem for
> > technologies like 3D XPoint which are 2x or more slower than
> > DRAM.
> >
> > Remember, for RDMA Read, we are talking about NFS WRITE or
> > pulling moderate-sized arguments from the client on a Call. In
> > that case, a server-side host CPU copy is less likely to interfere
> > with an application workload, and thus is not as onerous as it
> > might be on the client.
> >
> >
> > > > In fact, the current RPC-over-RDMA protocol is designed
> > > > around that. For example the mechanism of reduction relies on
> > > > only whole data items being excised to ensure XDR round-up of
> > > > variable-length data types is correct. That is because an
> > > > implementation MAY perform RDMA Read in the XDR layer, not
> > > > in the transport.
> > >
> > > It may be that send-based DDP could have been designed
> > > to accommodate cases that V1-DDP could not but the fact is that
> > > it wasn't because:
> > >       • There are no common cases in which it would provide a benefit.
> > >       • It would complicate the design to make this something other
> than a drop-in replacement for V1-DDP>
> > > > 3. Message continuation, like a Long message, enables an RPC
> > > > message payload to be split arbitrarily, without any regard
> > > > to DDP eligibility or data item boundaries.
> > >
> > > Exactly right but this just demonstrates that message
> > > continuation and DDP-eligibility should not affect one
> > > another and they don't.
> >
> > Seems to me you can do a form of Send-based DDP with just
> > Message Continuation; perhaps just enough that an additional
> > Send-based DDP mechanism is not necessary.
> >
> > And Send-based DDP isn't terribly valuable without a fully
> > send-based mechanism for transferring large payloads. So
> > it can't be done without Message Continuation, unless I'm
> > mistaken.
> >
> > These two mechanisms seem well-entwined.
> >
> >
> > > > 4. Send-based DDP is not an offload transfer. It involves the
> > > > host CPUs on both ends.
> > >
> > > Not sure what you mean  by that.
> >
> > For Send/Receive, the receiver's host CPU always takes an
> > interrupt, and the host CPU performs transport header
> > processing. (even worse for Message Continuation, where one
> > RPC can now mean more than one interrupt on the receiving
> > CPU).
> >
> > With RDMA Write, for example, the receiver does not take an
> > interrupt, and the wire header is processed by the HCA.
> >
> >
> > >  Both are the same in that:
> > >       • You don't need to copy data on ether side
> > >       • Both sides CPU"s need to deal with where the data is.  The
> adapter has to be told where the data is to sent/transferred from while the
> receiving side has to note where the placed data is (i.e not in the normal
> XDR stream).
> > > > Thus it doesn't make sense to restrict
> > > > the DDP eligibility of data items that require XDR marshaling
> > > > and unmarshaling.
> > >
> > > This sound like you think coping is required.  It isn't.
> >
> > Yes, it is. As above, sometimes with Receive a partial or
> > whole copy of the payload is unavoidable.
> >
> >
> > > There is no point in using send-based DDP to transfer an
> > > item that requires marshalling/unmarshalling and the situation
> > > is the same as  for V1-DDP.  There is no point in specifically
> > > placing data that would get no benefit from being specifically
> > > placed.
> >
> > > > A receiver has generic receive resources that have to
> > > > accommodate any RPC that is received (via Send/Receive).
> > >
> > > True.
> > >
> > > > DDP
> > > > eligibility is designed to prepare the receiver for exactly
> > > > one particular data item that MAY involve the ULP in the
> > > > reconstruction of the whole RPC message.
> > >
> > > True.
> >
> > > On Wed, Feb 1, 2017 at 6:04 PM, Chuck Lever <chuck.lever@oracle.com>
> wrote:
> > >
> > > > On Feb 1, 2017, at 5:04 PM, David Noveck <davenoveck@gmail.com>
> wrote:
> > > >
> > > > > Within NFS version 4.0, there are a few variable-length result data
> > > > > items whose maximum size cannot be estimated by clients reliably
> > > > > because there is no protocol-specified size limit on these
> > > > > structures.
> > > >
> > > > I like this but I have a problem with the word "structures".   The
> only case
> > > > we are talking about is a GETATTR response, and the XDR defines that
> as an
> > > > opaque array, even though the client and server both are supposed to
> know
> > > > what is in it.  "arrays" would work.
> > >
> > > Agreed, fixed.
> > >
> > > This section also mentioned inline thresholds and Reply chunks,
> > > and that looks superfluous. I removed it.
> > >
> > >
> > > > On Wed, Feb 1, 2017 at 4:47 PM, Chuck Lever <chuck.lever@oracle.com>
> wrote:
> > > >
> > > > > On Feb 1, 2017, at 4:27 PM, karen deitke <karen.deitke@oracle.com>
> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 1/30/2017 5:28 PM, Chuck Lever wrote:
> > > > >> Hi Karen, thanks for taking time to review.
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On Jan 27, 2017, at 6:40 PM, karen deitke <
> karen.deitke@oracle.com>
> > > > >>>  wrote:
> > > > >>>
> > > > >>>
> > > > >>> Hi Chuck,
> > > > >>>
> > > > >>> o Section 2.6, second to last paragraph, typo, priori should be
> prior.
> > > > >>>
> > > > >> "a priori" is Latin for "in advance" (literally, "from the
> earlier"). I've
> > > > >> replaced this phrase.
> > > > >>
> > > > > Thanks, I don't speak Latin :-)
> > > > >
> > > > >>
> > > > >>
> > > > >>
> > > > >>> o Same section, "Note however that many operations normally
> considered non-idempotent (e.g. WRITE, SETATTR) are actually idempotent".
> This was confusing because prior in the paragraph you     are talking about
> determine the reply size for variable length replies, which doesn't seem to
> be the case for WRITE, SETATTR.
> > > > >>>
> > > > >> That's correct, WRITE doesn't have a variable length reply. This
> section
> > > > >> discusses a possible COMPOUND with WRITE (fixed length result,
> non-idempotent)
> > > > >> and GETATTR (variable length result, idempotent). I've clarified
> it.
> > > > >>
> > > > > ok
> > > > >>
> > > > >>
> > > > >>
> > > > >>> o 4.2.1 seoond paragrah would read better if you removed the
> word "however"
> > > > >>>
> > > > >> I was not able to find the word "however" in the second paragraph
> of S4.2.1.
> > > > >>
> > > > >> There are certain NFS version 4 data items whose size cannot be
> > > > >>
> > > > >> chunk to disambiguate which chunk is associated with which
> argument
> > > > >> estimated by clients reliably, however, because there is no
> protocol-
> > > > >>
> > > > >> data item. However NFS version 4 server and client implementations
> > > > >> specified size limit on these structures. These include:
> > > > > Sorry, it in 4.2:
> > > > >
> > > > > "There are certain NFS version 4 data items whose size cannot be
> estimated by clients reliably, however, because there is no
> protocol-specified size limit on these structures."
> > > >
> > > > How about this:
> > > >
> > > > Within NFS version 4.0, there are a few variable-length result data
> > > > items whose maximum size cannot be estimated by clients reliably
> > > > because there is no protocol-specified size limit on these
> > > > structures.
> > > >
> > > >
> > > > >>>
> > > > >>> "If a client implementation is equipped to recognize that a
> transport error could mean that it provisioned an inadequately sized Reply
> chunk, it can retry the operation with a larger Reply chunk.  Otherwise,
> the client must terminate the RPC transaction."
> > > > >>>
> > > > >>> So you are saying for example that if the client sends a request
> in which it knows that it has "guessed" the maximum response but that guess
> may be wrong, if it detects the connection dropped for example, it could
> potentially interpret that as the guess was too small and send again?
> > > > >>>
> > > > >> As previously discussed.
> > > > >>
> > > > >>
> > > > >>
> > > > >>> 4.3, second from last bullet, "If an READ", should be "If a READ"
> > > > >>>
> > > > >> Fixed.
> > > > >>
> > > > >>
> > > > >>
> > > > >>> 4.5.  Why does the xid in the rdma header have to be used to get
> the session info, wouldnt the session info in the sequence_op that was sent
> in the first place have this info?
> > > > >>>
> > > > >> Here's the last paragraph of Section 4.5:
> > > > >>
> > > > >>    In addition, within the error response, the requester does not
> have
> > > > >>    the result of the execution of the SEQUENCE operation, which
> > > > >>    identifies the session, slot, and sequence id for the request
> which
> > > > >>    has failed.  The xid associated with the request, obtained
> from the
> > > > >>    rdma_xid field of the RDMA_ERROR or RDMA_MSG message, must be
> used to
> > > > >>    determine the session and slot for the request which failed,
> and the
> > > > >>    slot must be properly retired.  If this is not done, the slot
> could
> > > > >>    be rendered permanently unavailable.
> > > > >>
> > > > >> The mention of RDMA_MSG is probably incorrect. In that case, the
> server
> > > > >> was able to return a SEQUENCE result, and that should be useable
> by the
> > > > >> client.
> > > > >>
> > > > >> Without an RPC Reply message, however, the client matches the XID
> in the
> > > > >> ERR_CHUNK message to a previous call and that will have the
> matching
> > > > >> SEQUENCE operation.
> > > > >>
> > > > > I'm still not following this.  Is the issue that the  client would
> resend, but the seq of the slot is incremented and we wouldn't potentially
> get anything the server had in the slot replay?
> > > >
> > > > Is there a way to recover if the server can't send a SEQUENCE result
> for
> > > > that slot? Seems like the slot would be stuck on the same RPC until
> the
> > > > session is destroyed.
> > > >
> > > >
> > > > > Karen
> > > > >>
> > > > >> That makes this rather a layering violation, and perhaps a reason
> why
> > > > >> retransmitting with a larger Reply chunk might be a cure worse
> than the
> > > > >> disease.
> > > > >>
> > > > >> I'm beginning to believe that making this situation always a
> permanent
> > > > >> error, as rfc5666bis does, is a better protocol choice.
> > > > >>
> > > > >>
> > > > >>
> > > > >>> Karen
> > > > >>> On 1/20/2017 5:27 PM,
> > > > >>> internet-drafts@ietf.org
> > > > >>>  wrote:
> > > > >>>
> > > > >>>> A New Internet-Draft is available from the on-line
> Internet-Drafts directories.
> > > > >>>> This draft is a work item of the Network File System Version 4
> of the IETF.
> > > > >>>>
> > > > >>>>         Title           : Network File System (NFS) Upper Layer
> Binding To RPC-Over-RDMA
> > > > >>>>         Author          : Charles Lever
> > > > >>>>    Filename        : draft-ietf-nfsv4-rfc5667bis-04.txt
> > > > >>>>    Pages           : 18
> > > > >>>>    Date            : 2017-01-20
> > > > >>>>
> > > > >>>> Abstract:
> > > > >>>>    This document specifies Upper Layer Bindings of Network File
> System
> > > > >>>>    (NFS) protocol versions to RPC-over-RDMA.  Upper Layer
> Bindings are
> > > > >>>>    required to enable RPC-based protocols, such as NFS, to use
> Direct
> > > > >>>>    Data Placement on RPC-over-RDMA.  This document obsoletes
> RFC 5667.
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> The IETF datatracker status page for this draft is:
> > > > >>>>
> > > > >>>>
> > > > >>>> https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rfc5667bis/
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> There's also a htmlized version available at:
> > > > >>>>
> > > > >>>>
> > > > >>>> https://tools.ietf.org/html/draft-ietf-nfsv4-rfc5667bis-04
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> A diff from the previous version is available at:
> > > > >>>>
> > > > >>>>
> > > > >>>> https://www.ietf.org/rfcdiff?url2=draft-ietf-nfsv4-
> rfc5667bis-04
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Please note that it may take a couple of minutes from the time
> of submission
> > > > >>>> until the htmlized version and diff are available at
> tools.ietf.org.
> > > > >>>>
> > > > >>>> Internet-Drafts are also available by anonymous FTP at:
> > > > >>>>
> > > > >>>>
> > > > >>>> ftp://ftp.ietf.org/internet-drafts/
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> _______________________________________________
> > > > >>>> nfsv4 mailing list
> > > > >>>>
> > > > >>>>
> > > > >>>> nfsv4@ietf.org
> > > > >>>> https://www.ietf.org/mailman/listinfo/nfsv4
> > > > >>> _______________________________________________
> > > > >>> nfsv4 mailing list
> > > > >>>
> > > > >>> nfsv4@ietf.org
> > > > >>> https://www.ietf.org/mailman/listinfo/nfsv4
> > > > >> --
> > > > >> Chuck Lever
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > > > --
> > > > Chuck Lever
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > nfsv4 mailing list
> > > > nfsv4@ietf.org
> > > > https://www.ietf.org/mailman/listinfo/nfsv4
> > > >
> > >
> > > --
> > > Chuck Lever
> > >
> > >
> > >
> > >
> >
> > --
> > Chuck Lever
> >
> >
> >
> >
>
> --
> Chuck Lever
>
>
>
>