Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667bis-04.txt

Chuck Lever <chuck.lever@oracle.com> Fri, 03 February 2017 21:10 UTC

Return-Path: <chuck.lever@oracle.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CA92A129996 for <nfsv4@ietfa.amsl.com>; Fri, 3 Feb 2017 13:10:03 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.918
X-Spam-Level:
X-Spam-Status: No, score=-6.918 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, RP_MATCHES_RCVD=-3.199, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zFlFnQVlODLR for <nfsv4@ietfa.amsl.com>; Fri, 3 Feb 2017 13:10:01 -0800 (PST)
Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CAB461298C6 for <nfsv4@ietf.org>; Fri, 3 Feb 2017 13:10:00 -0800 (PST)
Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id v13L9wM5018716 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Feb 2017 21:09:59 GMT
Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id v13L9wl7008210 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Feb 2017 21:09:58 GMT
Received: from abhmp0011.oracle.com (abhmp0011.oracle.com [141.146.116.17]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id v13L9v6X026484; Fri, 3 Feb 2017 21:09:58 GMT
Received: from anon-dhcp-171.1015granger.net (/68.46.169.226) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 03 Feb 2017 13:09:57 -0800
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jfVrdZKe=VSA8z=Nqd9hsL2M2xKQ8SHNy0+BtCeQGfAQA@mail.gmail.com>
Date: Fri, 03 Feb 2017 16:09:55 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <5B6B324E-2BA1-401E-9846-97DE023A579A@oracle.com>
References: <148495844040.13416.10356809202500126242.idtracker@ietfa.amsl.com> <338b603b-8be3-f7f3-d7e0-021d8185f8ec@oracle.com> <E0D9D91E-9245-4846-842A-1F75A9A8D4A4@oracle.com> <e69f5b01-cdeb-2159-45ff-485118a6022f@oracle.com> <33525335-7563-475D-9DC0-59BB6FE8DF18@oracle.com> <CADaq8jfMtABtHuDrMKjWSLQv-EX7dr9emUyUdJ3WG4p9X2Jfnw@mail.gmail.com> <B8FB2639-6666-4EDC-BC40-9A4F8F4BAF70@oracle.com> <CADaq8jdHJW7NFywhykOK-8Z25Qx7GhjQNnbBbz=fA8a-6_WKsg@mail.gmail.com> <D40250C2-D956-446E-9B98-FA423B75A9E9@oracle.com> <CADaq8jfVrdZKe=VSA8z=Nqd9hsL2M2xKQ8SHNy0+BtCeQGfAQA@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
X-Mailer: Apple Mail (2.3124)
X-Source-IP: userv0021.oracle.com [156.151.31.71]
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/cs7DltUrCqt0w_jtobtcQgPCBa0>
Cc: "nfsv4@ietf.org" <nfsv4@ietf.org>
Subject: Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667bis-04.txt
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 03 Feb 2017 21:10:04 -0000

> On Feb 3, 2017, at 1:04 PM, David Noveck <davenoveck@gmail.com> wrote:
> 
> A lot of this reads like pieces of a review of rtrext.  I'll deal with some issues within that later but first I want to focus on the concept of DDP-eligibility.
> 
> I'm not sure exactly what it would mean for a concept such as this to be completely general and I don't think that's the issue.  For me, the important point is that it is a useful concept and that it has more than one possible implementation.  XDR items are categorized as either DDP-eligible or not.  I am not saying that additional useful categorizations cannot arise in the future but I don't know of any and that there doesn't seem to be any useful way to prepare for them.
> 
> The existing categorization is such that there will be individual instances of items that implementations will not transfer using DDP.  This has always been treated as an implementation choice and ULBs do not say, for example, that READs of a few bytes are not DDP-eligible.  Instead, they are DDP-eligible and requesters have a way to specify that they be returned inline.  To do otherwise would make ULBs overly complicated and very hard (perhaps Sisyphean) to maintain.  ULBs are limited to specifying which items are DDP-eligible because that is what the requester and responder need to agree on.  If the requester wants to send data inline or have DDP-eligible response data returned inline that is up to him and ULBs do not need to say anything about this choice.
> 
> The same applies when there are multiple possible forms of DDP with different performance characteristics.  The requester chooses and the ULB does not need to instruct him about which choice to make.  To try to create rules about what choices the requester could make would make a ULB just about impossible to do.  Let's not go there.
> 
> Now let me give some background about rtrext.  Perhaps I've been unduly influenced by Talpey's figure of a million 8K random IOs per second, but the fact is that I was focused on that case.  In my experience, it is a very common case, deserving of significant attention.  I added message continuation to rtrext to allow reads of multiples of the block size (e.g. 8K) to get the same sort of benefit.

There is no denying that is an important goal.

Unfortunately I've not found a way to approach that level
of throughput by shrinking per-op latency alone. It's
important to realize that there is a lower bound for per-op
latency that is determined by the physics of moving packets
on a physical network fabric.

High IOPS throughput is actually achieved by using multiple
QPs through multiple physical interfaces and networks linking
an SMB client and several storage targets (or in NFS terms:
pNFS).

So, if one RPC-over-RDMA connection can achieve 125KIOPS (and
I don't think that's unattainable, even with RPC-over-RDMA V1),
then eight connections might be able to swing a million IOPS,
given a careful client implementation. Getting to four
250KIOPS connections might be a challenge.

There are some critical pre-requisites to multi-channel
operation:

- bidirectional RPC on RPC-over-RDMA to enable NFSv4.1
- multi-path NFS (Andy's work), which works best with NFSv4.1
- pNFS/RDMA, which requires NFSv4.1, to enable the use of
multiple DSes

Linux is now close to having all of this in place.


> Perhaps I did not devote the attention I should have to IO's that are not an integral number of blocks but my experience is that clients are typically structured to make this an unusual case.  YMMV.

> I don't claim that send-based DDP is universally applicable and I expect other forms of DDP to continue to exist and be useful.  With regard to the READ/WRITE distinction, I accept that most of the benefits are for WRITE and that many implentations might reasonably choose to use V1-like DDP for READ while use send-based DDP for WRITE.  However, there are many environments in which remote invalidation is not possible (e.g. user-level server) and for these ther are significant benefits from using send-based DDP for READ as well.
> 
> With regard to the sections that read like pieces of a review of rtrext, it is hard to respond because the text often makes reference to potential extensions for which there is no clear definition.   I think we have to address these issues:
> 	• For push mode, it makes sense to define this as another optional extension to Version Two.  We need a volunteer to write rpcrdma-pushext.  If we don't have one now, we should try again after rpcrdma-version-two becomes a working group document.

At the moment, push mode looks like a new pNFS layout type,
thus it is independent of which version of RPC-over-RDMA is
in use. Agreed, that all has to be written up. Christoph or
I could make a start at it, but I'm not volunteering anyone's
time. It would be a short document similar to

http://xml2rfc.tools.ietf.org/public/rfc/bibxml3/reference.I-D.draft-hellwig-nfsv4-scsi-layout-nvme-00.xml

which unfortunately appears to have expired.


> 	• With regard to speculation about a possible form of form of Send-based DDP with just Message Continuation, I'm skeptical but if there were to be some clear definition, it could be added as an option to rtrext-0x, for x>= 2.

I was thinking of starting with an independent document
(and extension) just to draw the lines clearly. I've been
busy with other things, and haven't found an opportunity
to write it up.


> With regard to issues related to challenges due to large page sizes, I see the matter differently.  I think it is possible to avoid copying without using page flipping,but using what I'll call "buffer flipping", even when page sizes or large or there is no use of virtual memory at all.  One might use page flipping only as a way to avoid copies on page-size units that are smaller than the buffer size (e.g. 4K pages with 8K buffers). 
> 
> Let's assume the buffer cache uses an 8K buffer size and that the page size is either larger or non-existent (i.e.. no virtual memory).  Receive buffer are structured to include a section for inline payload (1-2K) together with an 8K buffer within the buffer cache.  When posted for receive, this buffer should have no valid data within it and it the associated buffer header should indicate that is neither free nor assigned to a particular file location.  Once data is placed into it by a receive which gets a WRITE request, that buffer can be assigned to a location within a file and the buffer header updated to reflect this location.  The mapping from file and block number to buffer address is changed but the address of the buffer remains the same: it is the one into which the buffer was placed.

In operating systems I'm familiar with, the page cache
is the single base of all file I/O. Sub-page sized I/O
(ie, a separate buffer cache) is no longer utilized.

Even if a particular file system implementation were
to use it, those buffers would not be exposed above
the VFS, and thus NFS servers would have no visibility
of them. Again, that's what I'm familiar with, maybe
not universally true, and perhaps I've misunderstood
something.

A Message Continuation approach could use a "flipping"
mechanism without the requester having to parcel out
chunk lists.

The key is separating the Transport header from the
Payload. By fixing the size of the Transport header,
a receiver can get the incoming Payload to land in
a separate buffer from the Transport header, always.

A sender might then rely on the fact that the
receiver's inline threshold is related to it's
preferred page/buffer cache size, and break each
message up so that certain data items in the Payload
stream land in independent buffers on the receiver
(again, this is similar to an experimental approach
being considered for NFS on TCP).

RDMA Receive preserves the sender's Send ordering.
The receiver simply concatenates the incoming messages
related to the same XID, grouping them as is convenient
to it.

Receiving such messages should work correctly for any
arbitrary division of Payload stream contents. Benefits
accrue when receivers can advertise to senders exactly
how they want to see RPC payloads divided into messages.
Perhaps a ULB could be useful here.

This arrangement is still unable to handle what I'll
term "zero copy sub-buffer modification". In more
abstract terms, I wonder if we can continue to label
this Direct Data Placement, since the buffer has to be
"flipped" into place after receive; it's not a bona
fide update-in-place. What we are attempting to do
here is more akin to zero-copy receive.


> I think the conformance issues you have mentioned are worth thinking about but there will be differences of opinion about how important these are.  In any case, the discussion has resulted in some conclusions about use of various form of DDP that need to be captured somewhere other than in an email thread.
> 
> I can see why you don't want this stuff in rfc5667bis.  I don't think it belongs in nfsulb (or in an update to rfc5667bis based on it) either.  For now, I'll add an implementation-choice section to rtrext-02, but ultimately I don't think this kind of stuff belongs in a standards-track document.  I think that eventually, when the set of extensions in Version Two is more settled (and larger than one), we could publish an Informational RFC that provides implementation guidance about performance-relevant choices that implementers might need to be made aware of.

Yes, that is one possibility.


> :
> 
> On Thu, Feb 2, 2017 at 3:45 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> > On Feb 2, 2017, at 8:53 AM, David Noveck <davenoveck@gmail.com> wrote:
> >
> > > This section also mentioned inline thresholds and Reply chunks,
> > > and that looks superfluous. I removed it.
> >
> > A whole lot of rfc5667 wound up essentially repeating stuff
> > in rfc5666.  A lot of that has been removed and that work is
> > continuing.
> >
> > The interesting question is when that process should stop.  In
> > particular, we need to understand if it can stop short of simply
> > providing the information required of ULBs in rfc5666bis.
> 
> I'm totally OK with removing redundant language from rfc5667bis,
> but we should not pretend that DDP eligibility is completely
> general or that we understand perfectly how all DDP transport
> mechanisms will work in the future.
> 
> I'd like to be sure we are not creating a Sisyphean task.
> 
> 
> > When
> > I suggested going that way, you didn't like the idea and I had
> > doubts about whether it was workable,
> 
> For example, some repetition of rfc5666bis language is helpful as
> implementation guidance.
> 
> 
> > Nevertheless, if we
> > have material beyond what we need to meet those requirements,
> > we should understand why it is there.
> 
> There are other things that can go into a ULB that might not
> have been listed in rfc5666bis. Clearly one issue we have with
> NFS in particular is how to deal with replay caching above the
> transport.
> 
> 
> > Meanwhile, in nfsulb, I've been going in the opposite direction,
> > by leaving in the V1-specific material in and indicating that other
> > implementations are possible and will be discussed in the documents
> > for the appropriate versions and extensions.  As a result, while
> > rfc5666bis is getting smaller, my document is larger than -04.
> >
> > These documents need to come together at some time, most likely
> > after rfc5667bis is published (or at least in WGLC) and
> > rpcrdma-version-two has been turned into a working group
> > document.
> >
> > The prospect of doing so brings us back to our deferred discussion
> > of the phyletics of the various forms of DDP.  My original plans had
> > been to put this off until nfsulb was published (done), there was a
> > new version of rtrext that addressed ULB issues (now working on it), and
> > I added some ULB-related material to rpcrdma-version-two (on my todo
> > list).
> >
> > As I looked at your comments about this issue, I still disagreed with them
> > but decided that they raise issues that the discussion of ULB issues
> > will have to address, and they might also lead to clarifications in the
> > rest of rtrext.  So that we don't have to wait for some of these documents
> > to come out (and read them all), let me address your issues below:
> >
> > > Explicit RDMA and send-based DDP are perhaps in the same
> > > phylum but are truly distinct beasts.
> >
> > I agree they are different things.  There would be no point in defining
> > send-based DDP if was the same as V1-DDP.  Nevertheless, they
> > serve the same purpose: transferring DDP-eligible items so that the
> > receiver gets them in an appropriately sized and aligned buffer.  I
> > thinks that is all that matters for this discussion.
> 
> The numbered items in my previous e-mail demonstrate that the
> purpose of DDP eligibility is to encourage ULB and implementation
> choices that are appropriate when using offloaded RDMA. They do not
> all apply to Send-based DDP, though some do.
> 
> Or put another way, DDP eligibility is not general at all! It is
> only "general" because Send-based DDP has been adjusted to fit it.
> 
> For example, you have argued below that Send-based DDP is aimed at
> improving a particular size I/O; yet data item size is one important
> criterion for a ULB author to decide whether to make a data item
> DDP-eligible.
> 
> Suppose a ULP has a 512-byte opaque data item (opaque, meaning that
> no XDR encoding or decoding is required) that would benefit from
> avoiding a data copy because it is in a frequently-used operation.
> You might not choose to make it DDP-eligible because an explicit
> RDMA operation is inefficient. But it might be a suitable candidate
> for Send-based DDP.
> 
> 
> > To get back to phyletics, I would put the branching much lower in the
> > tree.  To me a better analogy is to consider them as two distinct,
> > closely-related orders, like rodentia and lagomorpha (rabbits,
> > hares, pikas).
> >
> > It is true that there are reasons on might choose to move a DDP-eligible
> > data Item using one or the other but that is an implementation decision.
> > Efficiency considerations  and  direction of transfer may affect this
> > implementation decision, as is the fact , mentioned in rtrext, that if the
> > client wants the response to be placed at a particular address, you
> > shoulnd't use send-based DDP.  But none of this affects the fact that
> > only DDP-eligible items can be placed by send-based DDP, just as is the case
> > with V1-DDP.
> >
> >
> > > 1. The rfc5666bis definition of DDP eligibility makes sense
> > > if you are dealing with a mechanism that is more efficient
> > > with large payloads than small. Send-based DDP is just the
> > > opposite.
> >
> > It depends on what you mean by "large" and "small".
> > Clearly however, neither technique makes sense for very
> > small payloads and both mske most sense for payloads
> > over 4K although the specfic will depend on efficiency and
> > how the send-based DDP receiver chooses to structure its
> > buffers.
> >
> > For me the sweet spot for sen-based DDP would be 8K
> > transfers and for this reason structured buffers of 9k
> > (consisting  of 1K primarily for inline payload
> > together with an 8K aligned segment for direct placement)
> > are a good choice.   Even larger buffer are possible but the
> > fact that message continuation can be used means that
> > send-based DDP should have no problems with 64K
> > transfers.
> 
> With Remote Invalidation, NFS READ can be handled with a Write
> chunk with nearly the same efficiency and latency as a mechanism
> that uses only Send. Yes, even with small NFS READ payloads,
> whose performance doesn't matter much anyway. Let's take that off
> the table.
> 
> And I think we agree that payloads larger than a few dozen KB are
> better transferred using explicit RDMA, for a variety of reasons.
> 
> IMO the only significant case where Send-based DDP is interesting
> is replacing RDMA Read of moderate-sized payloads (either NFS
> WRITE data, or Long Call messages). Let's look at NFS WRITE first.
> 
> Using a large inline threshold (say, 12KB), the server would have
> to copy some or all of an 8KB payload into place after a large
> Receive. Server implementations typically already have to copy
> this data.
> 
> Why?
> 
> Because page-flipping a Receive buffer cannot work when the NFS
> WRITE is modifying data in a file page that already has data on it.
> 
> Remember that Receive buffers are all the same, all RPC Call
> traffic goes through them. The hardware picks the Receive buffer
> arbitrarily; the ULP can't choose in advance which RPC goes into
> which Receive buffer. Flipping the buffer is the only choice to
> avoid a data copy.
> 
> Suppose the server platform's page size is 64KB, or even 2MB.
> Then, 8KB NFS WRITEs are all sub-page modifications, and copying
> would be required for each I/O. Explicit RDMA works fine in that
> case without a data copy.
> 
> Host CPU data copy has comparable or better latency to flipping
> a receive buffer into a file, allocating a fresh receive buffer,
> and DMA mapping it, even for payloads as large as 4KB or 8KB.
> Eliminating a server-side data copy on small NFS WRITEs is gravy,
> not real improvement.
> 
> As far as I can tell, Send-based DDP is not able to handle the
> sub-page case at all without a copy. That could add an extra rule
> for using Send-based DDP over and above DDP eligibility, couldn't
> it?
> 
> As soon as one realizes that Send-based DDP is effective only
> when the payload is aligned to the server's page size, depends on
> the surrounding file contents, and that it is less efficient than
> explicit RDMA when the payload size is larger than a few KB,
> introducing the complexity of Send-based DDP to the protocol or
> an implementation becomes somewhat less appealing.
> 
> A better approach would attempt to address the actual problem,
> which is the need to use RDMA Read for moderate payloads. This is
> for both NFS WRITE and large RPC Call messages.
> 
> 1. Long Call: large inline threshold or Message Continuation can
> easily address this case; and it is actually not frequent enough
> to be a significant performance issue.
> 
> 2. NFS WRITE: Message Continuation works here, with care taken to
> align the individual sub-messages to server pages, if that's
> possible. Or, use push mode so the client uses RDMA Write in this
> case.
> 
> The latter case would again be similar to DDP-eligibility, but not
> the same. Same argument data items are allowed, but additional
> conditions and treatment are required.
> 
> 
> > it is true that very large transfers might be better using
> > V1-DDP but that is beside the point.  Any READ buffer is
> > DDP-eligible (even for a 1-byte READ) and the choice of
> > whether to use DDP i(or the sort to use) in any particular
> > case is up to the implementation.  The ULB shouldn't
> > be affected.
> 
> > > 2. Re-assembling an RPC message that has been split by
> > > message continuation or send-based DDP is done below the XDR
> > > layer, since all message data is always available once Receive
> > > is complete, and XDR decoding is completely ordered with the
> > > receipt of the RPC message. With explicit RDMA, receipt of
> > > some arguments MAY be deferred until after XDR decoding
> > > begins.
> >
> > First of all, message continuation isn't relevant to this
> > discussion.
> >
> > With regard to send-based DDP, it is true that you have
> > the placed data immediately, while you might have to wait
> > in the V1-DDP case, but this doesn't affect  DDP-eligibility.
> 
> > In either case, before you process the request or response,
> > all the data, whether inline or placed needs to be available.
> > I don't see how anything else matters.
> 
> It can matter greatly for NFS WRITE / RDMA Read.
> 
> In order for a server to set up the RDMA Read so the payload goes
> into either a file's page cache or into the correct NVM pages,
> the server has to parse the NFS header information that indicates
> what filehandle is involved and the offset and count, before it
> constructs the RDMA Read(s).
> 
> Without that information, the server performs the RDMA Read using
> anonymous pages and either flips those pages into the file, or in
> the case of NVM, it must copy the data into persistent storage
> pages associated with that file.
> 
> That's the reason delayed RDMA Read is still important.
> 
> Now, Send-based DDP can handle NVM-backed files in one of two ways:
> 
> 1. Receive into anonymous DRAM pages and copy the payload into NVM;
> results in a (possibly undesirable) host CPU data copy
> 
> 2. Receive into NVM Receive buffers; but then all of your NFS
> operations will have to go through NVM. Maybe not an issue for
> DRAM backed with flash, but it is a potential problem for
> technologies like 3D XPoint which are 2x or more slower than
> DRAM.
> 
> Remember, for RDMA Read, we are talking about NFS WRITE or
> pulling moderate-sized arguments from the client on a Call. In
> that case, a server-side host CPU copy is less likely to interfere
> with an application workload, and thus is not as onerous as it
> might be on the client.
> 
> 
> > > In fact, the current RPC-over-RDMA protocol is designed
> > > around that. For example the mechanism of reduction relies on
> > > only whole data items being excised to ensure XDR round-up of
> > > variable-length data types is correct. That is because an
> > > implementation MAY perform RDMA Read in the XDR layer, not
> > > in the transport.
> >
> > It may be that send-based DDP could have been designed
> > to accommodate cases that V1-DDP could not but the fact is that
> > it wasn't because:
> >       • There are no common cases in which it would provide a benefit.
> >       • It would complicate the design to make this something other than a drop-in replacement for V1-DDP>
> > > 3. Message continuation, like a Long message, enables an RPC
> > > message payload to be split arbitrarily, without any regard
> > > to DDP eligibility or data item boundaries.
> >
> > Exactly right but this just demonstrates that message
> > continuation and DDP-eligibility should not affect one
> > another and they don't.
> 
> Seems to me you can do a form of Send-based DDP with just
> Message Continuation; perhaps just enough that an additional
> Send-based DDP mechanism is not necessary.
> 
> And Send-based DDP isn't terribly valuable without a fully
> send-based mechanism for transferring large payloads. So
> it can't be done without Message Continuation, unless I'm
> mistaken.
> 
> These two mechanisms seem well-entwined.
> 
> 
> > > 4. Send-based DDP is not an offload transfer. It involves the
> > > host CPUs on both ends.
> >
> > Not sure what you mean  by that.
> 
> For Send/Receive, the receiver's host CPU always takes an
> interrupt, and the host CPU performs transport header
> processing. (even worse for Message Continuation, where one
> RPC can now mean more than one interrupt on the receiving
> CPU).
> 
> With RDMA Write, for example, the receiver does not take an
> interrupt, and the wire header is processed by the HCA.
> 
> 
> >  Both are the same in that:
> >       • You don't need to copy data on ether side
> >       • Both sides CPU"s need to deal with where the data is.  The adapter has to be told where the data is to sent/transferred from while the receiving side has to note where the placed data is (i.e not in the normal XDR stream).
> > > Thus it doesn't make sense to restrict
> > > the DDP eligibility of data items that require XDR marshaling
> > > and unmarshaling.
> >
> > This sound like you think coping is required.  It isn't.
> 
> Yes, it is. As above, sometimes with Receive a partial or
> whole copy of the payload is unavoidable.
> 
> 
> > There is no point in using send-based DDP to transfer an
> > item that requires marshalling/unmarshalling and the situation
> > is the same as  for V1-DDP.  There is no point in specifically
> > placing data that would get no benefit from being specifically
> > placed.
> 
> > > A receiver has generic receive resources that have to
> > > accommodate any RPC that is received (via Send/Receive).
> >
> > True.
> >
> > > DDP
> > > eligibility is designed to prepare the receiver for exactly
> > > one particular data item that MAY involve the ULP in the
> > > reconstruction of the whole RPC message.
> >
> > True.
> 
> > On Wed, Feb 1, 2017 at 6:04 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> >
> > > On Feb 1, 2017, at 5:04 PM, David Noveck <davenoveck@gmail.com> wrote:
> > >
> > > > Within NFS version 4.0, there are a few variable-length result data
> > > > items whose maximum size cannot be estimated by clients reliably
> > > > because there is no protocol-specified size limit on these
> > > > structures.
> > >
> > > I like this but I have a problem with the word "structures".   The only case
> > > we are talking about is a GETATTR response, and the XDR defines that as an
> > > opaque array, even though the client and server both are supposed to know
> > > what is in it.  "arrays" would work.
> >
> > Agreed, fixed.
> >
> > This section also mentioned inline thresholds and Reply chunks,
> > and that looks superfluous. I removed it.
> >
> >
> > > On Wed, Feb 1, 2017 at 4:47 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > >
> > > > On Feb 1, 2017, at 4:27 PM, karen deitke <karen.deitke@oracle.com> wrote:
> > > >
> > > >
> > > >
> > > > On 1/30/2017 5:28 PM, Chuck Lever wrote:
> > > >> Hi Karen, thanks for taking time to review.
> > > >>
> > > >>
> > > >>
> > > >>> On Jan 27, 2017, at 6:40 PM, karen deitke <karen.deitke@oracle.com>
> > > >>>  wrote:
> > > >>>
> > > >>>
> > > >>> Hi Chuck,
> > > >>>
> > > >>> o Section 2.6, second to last paragraph, typo, priori should be prior.
> > > >>>
> > > >> "a priori" is Latin for "in advance" (literally, "from the earlier"). I've
> > > >> replaced this phrase.
> > > >>
> > > > Thanks, I don't speak Latin :-)
> > > >
> > > >>
> > > >>
> > > >>
> > > >>> o Same section, "Note however that many operations normally considered non-idempotent (e.g. WRITE, SETATTR) are actually idempotent".  This was confusing because prior in the paragraph you     are talking about determine the reply size for variable length replies, which doesn't seem to be the case for WRITE, SETATTR.
> > > >>>
> > > >> That's correct, WRITE doesn't have a variable length reply. This section
> > > >> discusses a possible COMPOUND with WRITE (fixed length result, non-idempotent)
> > > >> and GETATTR (variable length result, idempotent). I've clarified it.
> > > >>
> > > > ok
> > > >>
> > > >>
> > > >>
> > > >>> o 4.2.1 seoond paragrah would read better if you removed the word "however"
> > > >>>
> > > >> I was not able to find the word "however" in the second paragraph of S4.2.1.
> > > >>
> > > >> There are certain NFS version 4 data items whose size cannot be
> > > >>
> > > >> chunk to disambiguate which chunk is associated with which argument
> > > >> estimated by clients reliably, however, because there is no protocol-
> > > >>
> > > >> data item. However NFS version 4 server and client implementations
> > > >> specified size limit on these structures. These include:
> > > > Sorry, it in 4.2:
> > > >
> > > > "There are certain NFS version 4 data items whose size cannot be estimated by clients reliably, however, because there is no protocol-specified size limit on these structures."
> > >
> > > How about this:
> > >
> > > Within NFS version 4.0, there are a few variable-length result data
> > > items whose maximum size cannot be estimated by clients reliably
> > > because there is no protocol-specified size limit on these
> > > structures.
> > >
> > >
> > > >>>
> > > >>> "If a client implementation is equipped to recognize that a transport error could mean that it provisioned an inadequately sized Reply chunk, it can retry the operation with a larger Reply chunk.  Otherwise, the client must terminate the RPC transaction."
> > > >>>
> > > >>> So you are saying for example that if the client sends a request in which it knows that it has "guessed" the maximum response but that guess may be wrong, if it detects the connection dropped for example, it could potentially interpret that as the guess was too small and send again?
> > > >>>
> > > >> As previously discussed.
> > > >>
> > > >>
> > > >>
> > > >>> 4.3, second from last bullet, "If an READ", should be "If a READ"
> > > >>>
> > > >> Fixed.
> > > >>
> > > >>
> > > >>
> > > >>> 4.5.  Why does the xid in the rdma header have to be used to get the session info, wouldnt the session info in the sequence_op that was sent in the first place have this info?
> > > >>>
> > > >> Here's the last paragraph of Section 4.5:
> > > >>
> > > >>    In addition, within the error response, the requester does not have
> > > >>    the result of the execution of the SEQUENCE operation, which
> > > >>    identifies the session, slot, and sequence id for the request which
> > > >>    has failed.  The xid associated with the request, obtained from the
> > > >>    rdma_xid field of the RDMA_ERROR or RDMA_MSG message, must be used to
> > > >>    determine the session and slot for the request which failed, and the
> > > >>    slot must be properly retired.  If this is not done, the slot could
> > > >>    be rendered permanently unavailable.
> > > >>
> > > >> The mention of RDMA_MSG is probably incorrect. In that case, the server
> > > >> was able to return a SEQUENCE result, and that should be useable by the
> > > >> client.
> > > >>
> > > >> Without an RPC Reply message, however, the client matches the XID in the
> > > >> ERR_CHUNK message to a previous call and that will have the matching
> > > >> SEQUENCE operation.
> > > >>
> > > > I'm still not following this.  Is the issue that the  client would resend, but the seq of the slot is incremented and we wouldn't potentially get anything the server had in the slot replay?
> > >
> > > Is there a way to recover if the server can't send a SEQUENCE result for
> > > that slot? Seems like the slot would be stuck on the same RPC until the
> > > session is destroyed.
> > >
> > >
> > > > Karen
> > > >>
> > > >> That makes this rather a layering violation, and perhaps a reason why
> > > >> retransmitting with a larger Reply chunk might be a cure worse than the
> > > >> disease.
> > > >>
> > > >> I'm beginning to believe that making this situation always a permanent
> > > >> error, as rfc5666bis does, is a better protocol choice.
> > > >>
> > > >>
> > > >>
> > > >>> Karen
> > > >>> On 1/20/2017 5:27 PM,
> > > >>> internet-drafts@ietf.org
> > > >>>  wrote:
> > > >>>
> > > >>>> A New Internet-Draft is available from the on-line Internet-Drafts directories.
> > > >>>> This draft is a work item of the Network File System Version 4 of the IETF.
> > > >>>>
> > > >>>>         Title           : Network File System (NFS) Upper Layer Binding To RPC-Over-RDMA
> > > >>>>         Author          : Charles Lever
> > > >>>>    Filename        : draft-ietf-nfsv4-rfc5667bis-04.txt
> > > >>>>    Pages           : 18
> > > >>>>    Date            : 2017-01-20
> > > >>>>
> > > >>>> Abstract:
> > > >>>>    This document specifies Upper Layer Bindings of Network File System
> > > >>>>    (NFS) protocol versions to RPC-over-RDMA.  Upper Layer Bindings are
> > > >>>>    required to enable RPC-based protocols, such as NFS, to use Direct
> > > >>>>    Data Placement on RPC-over-RDMA.  This document obsoletes RFC 5667.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> The IETF datatracker status page for this draft is:
> > > >>>>
> > > >>>>
> > > >>>> https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rfc5667bis/
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> There's also a htmlized version available at:
> > > >>>>
> > > >>>>
> > > >>>> https://tools.ietf.org/html/draft-ietf-nfsv4-rfc5667bis-04
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> A diff from the previous version is available at:
> > > >>>>
> > > >>>>
> > > >>>> https://www.ietf.org/rfcdiff?url2=draft-ietf-nfsv4-rfc5667bis-04
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Please note that it may take a couple of minutes from the time of submission
> > > >>>> until the htmlized version and diff are available at tools.ietf.org.
> > > >>>>
> > > >>>> Internet-Drafts are also available by anonymous FTP at:
> > > >>>>
> > > >>>>
> > > >>>> ftp://ftp.ietf.org/internet-drafts/
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> nfsv4 mailing list
> > > >>>>
> > > >>>>
> > > >>>> nfsv4@ietf.org
> > > >>>> https://www.ietf.org/mailman/listinfo/nfsv4
> > > >>> _______________________________________________
> > > >>> nfsv4 mailing list
> > > >>>
> > > >>> nfsv4@ietf.org
> > > >>> https://www.ietf.org/mailman/listinfo/nfsv4
> > > >> --
> > > >> Chuck Lever
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > > --
> > > Chuck Lever
> > >
> > >
> > >
> > > _______________________________________________
> > > nfsv4 mailing list
> > > nfsv4@ietf.org
> > > https://www.ietf.org/mailman/listinfo/nfsv4
> > >
> >
> > --
> > Chuck Lever
> >
> >
> >
> >
> 
> --
> Chuck Lever
> 
> 
> 
> 

--
Chuck Lever