Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667bis-04.txt
David Noveck <davenoveck@gmail.com> Sat, 04 February 2017 17:37 UTC
Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4BC2D129460 for <nfsv4@ietfa.amsl.com>; Sat, 4 Feb 2017 09:37:44 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.799
X-Spam-Level:
X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tcLBPf2649uE for <nfsv4@ietfa.amsl.com>; Sat, 4 Feb 2017 09:37:39 -0800 (PST)
Received: from mail-oi0-x232.google.com (mail-oi0-x232.google.com [IPv6:2607:f8b0:4003:c06::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2B3681289C4 for <nfsv4@ietf.org>; Sat, 4 Feb 2017 09:37:39 -0800 (PST)
Received: by mail-oi0-x232.google.com with SMTP id j15so27947617oih.2 for <nfsv4@ietf.org>; Sat, 04 Feb 2017 09:37:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=dbgNYXIHiqCWG/DJEYhalJr7GryTf0PX6QqHR/zB6Y4=; b=msnBT+qDzcOBYoTptYRURqjTIWDKAWSEi6k1FoLbC+iZncGWVO3b4nIv4mQTAfALZh ETnui1h9uG1knFOrzFNR7yUdpm+l9fHDHjUW8drWTjEL5xWgofBA+FHSNshamf9m2Mn1 EPSh9FnbPdI9xWTOEBBYpnumjX0AYcNi8F9XArfIUYn5xGqpBhTET9PCPU9YTvyICg7F 5WZ3DGTQcNc3yr9AAIPezWwPCnIoM2oQAKPI5RvgHjdER7Kdk+urM0D8dFBZ11PIVQJn J1G/bzyoZ9RSETAR/O4tIHsLnWddfw98T348MRlfg2p5eKjfddHop6nyv9XdFPAuPPsr TxAA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=dbgNYXIHiqCWG/DJEYhalJr7GryTf0PX6QqHR/zB6Y4=; b=shlqi3SWzr9I8glxOBCkaVau6w03QHyeYzea4yyZUqJFZgQ//y/eoEFRM/ut2wzLGg mVVQs00wESrHZE/hOV+sJMCaQ/kNTgDeycr/dDy2o9NIV8D4nm5p5pseY/D24Edu52NP 4jPgwfUdKz2gtvNxk9z8kF9grThOp74G+GJmpEvLaGAkcumDv/yAvAlzd0hYgSvx3A7b 7X3AYadcabNphsAkgtTIIbbQVKCNKdOuLCHB55CXmqgXAJlhXa8kqGwQTsW7VWI55e6s A+godTznFFON18yPWXOLNgZptzrxEbheRAz+OUjLiRsdD+y485IvWVmv3Xyt8kR4FVp5 0pUw==
X-Gm-Message-State: AMke39nVuCu48Qirw/3goEp6QKRg3XZsrXwSKrBGblZ2/Mwh0lowZFUdWCUKuraAkIxwDjpeuAcy1cZWd8J/Lw==
X-Received: by 10.202.117.210 with SMTP id q201mr1330702oic.126.1486229858220; Sat, 04 Feb 2017 09:37:38 -0800 (PST)
MIME-Version: 1.0
Received: by 10.182.137.200 with HTTP; Sat, 4 Feb 2017 09:37:37 -0800 (PST)
In-Reply-To: <5B6B324E-2BA1-401E-9846-97DE023A579A@oracle.com>
References: <148495844040.13416.10356809202500126242.idtracker@ietfa.amsl.com> <338b603b-8be3-f7f3-d7e0-021d8185f8ec@oracle.com> <E0D9D91E-9245-4846-842A-1F75A9A8D4A4@oracle.com> <e69f5b01-cdeb-2159-45ff-485118a6022f@oracle.com> <33525335-7563-475D-9DC0-59BB6FE8DF18@oracle.com> <CADaq8jfMtABtHuDrMKjWSLQv-EX7dr9emUyUdJ3WG4p9X2Jfnw@mail.gmail.com> <B8FB2639-6666-4EDC-BC40-9A4F8F4BAF70@oracle.com> <CADaq8jdHJW7NFywhykOK-8Z25Qx7GhjQNnbBbz=fA8a-6_WKsg@mail.gmail.com> <D40250C2-D956-446E-9B98-FA423B75A9E9@oracle.com> <CADaq8jfVrdZKe=VSA8z=Nqd9hsL2M2xKQ8SHNy0+BtCeQGfAQA@mail.gmail.com> <5B6B324E-2BA1-401E-9846-97DE023A579A@oracle.com>
From: David Noveck <davenoveck@gmail.com>
Date: Sat, 04 Feb 2017 12:37:37 -0500
Message-ID: <CADaq8jfBGgVNjOCnkcFxbAQrNYJ3LMrsuoGPQnzHSzLYnz9WWw@mail.gmail.com>
To: Chuck Lever <chuck.lever@oracle.com>
Content-Type: multipart/alternative; boundary="001a1135064ac101df0547b7da51"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/FM07-gk-dmY6SosBWKxIduKQNdg>
Cc: "nfsv4@ietf.org" <nfsv4@ietf.org>
Subject: Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667bis-04.txt
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Feb 2017 17:37:44 -0000
> I wonder if we can continue to label > this Direct Data Placement, since the buffer has to be > "flipped" into place after receive Good observation :-), although somewhat annoying :-(. In any case rtrext-02 will refer to Send-based Data Placement. On Fri, Feb 3, 2017 at 4:09 PM, Chuck Lever <chuck.lever@oracle.com> wrote: > > > On Feb 3, 2017, at 1:04 PM, David Noveck <davenoveck@gmail.com> wrote: > > > > A lot of this reads like pieces of a review of rtrext. I'll deal with > some issues within that later but first I want to focus on the concept of > DDP-eligibility. > > > > I'm not sure exactly what it would mean for a concept such as this to be > completely general and I don't think that's the issue. For me, the > important point is that it is a useful concept and that it has more than > one possible implementation. XDR items are categorized as either > DDP-eligible or not. I am not saying that additional useful > categorizations cannot arise in the future but I don't know of any and that > there doesn't seem to be any useful way to prepare for them. > > > > The existing categorization is such that there will be individual > instances of items that implementations will not transfer using DDP. This > has always been treated as an implementation choice and ULBs do not say, > for example, that READs of a few bytes are not DDP-eligible. Instead, they > are DDP-eligible and requesters have a way to specify that they be returned > inline. To do otherwise would make ULBs overly complicated and very hard > (perhaps Sisyphean) to maintain. ULBs are limited to specifying which > items are DDP-eligible because that is what the requester and responder > need to agree on. If the requester wants to send data inline or have > DDP-eligible response data returned inline that is up to him and ULBs do > not need to say anything about this choice. > > > > The same applies when there are multiple possible forms of DDP with > different performance characteristics. The requester chooses and the ULB > does not need to instruct him about which choice to make. To try to create > rules about what choices the requester could make would make a ULB just > about impossible to do. Let's not go there. > > > > Now let me give some background about rtrext. Perhaps I've been unduly > influenced by Talpey's figure of a million 8K random IOs per second, but > the fact is that I was focused on that case. In my experience, it is a > very common case, deserving of significant attention. I added message > continuation to rtrext to allow reads of multiples of the block size (e.g. > 8K) to get the same sort of benefit. > > There is no denying that is an important goal. > > Unfortunately I've not found a way to approach that level > of throughput by shrinking per-op latency alone. It's > important to realize that there is a lower bound for per-op > latency that is determined by the physics of moving packets > on a physical network fabric. > > High IOPS throughput is actually achieved by using multiple > QPs through multiple physical interfaces and networks linking > an SMB client and several storage targets (or in NFS terms: > pNFS). > > So, if one RPC-over-RDMA connection can achieve 125KIOPS (and > I don't think that's unattainable, even with RPC-over-RDMA V1), > then eight connections might be able to swing a million IOPS, > given a careful client implementation. Getting to four > 250KIOPS connections might be a challenge. > > There are some critical pre-requisites to multi-channel > operation: > > - bidirectional RPC on RPC-over-RDMA to enable NFSv4.1 > - multi-path NFS (Andy's work), which works best with NFSv4.1 > - pNFS/RDMA, which requires NFSv4.1, to enable the use of > multiple DSes > > Linux is now close to having all of this in place. > > > > Perhaps I did not devote the attention I should have to IO's that are > not an integral number of blocks but my experience is that clients are > typically structured to make this an unusual case. YMMV. > > > I don't claim that send-based DDP is universally applicable and I expect > other forms of DDP to continue to exist and be useful. With regard to the > READ/WRITE distinction, I accept that most of the benefits are for WRITE > and that many implentations might reasonably choose to use V1-like DDP for > READ while use send-based DDP for WRITE. However, there are many > environments in which remote invalidation is not possible (e.g. user-level > server) and for these ther are significant benefits from using send-based > DDP for READ as well. > > > > With regard to the sections that read like pieces of a review of rtrext, > it is hard to respond because the text often makes reference to potential > extensions for which there is no clear definition. I think we have to > address these issues: > > • For push mode, it makes sense to define this as another optional > extension to Version Two. We need a volunteer to write rpcrdma-pushext. > If we don't have one now, we should try again after rpcrdma-version-two > becomes a working group document. > > At the moment, push mode looks like a new pNFS layout type, > thus it is independent of which version of RPC-over-RDMA is > in use. Agreed, that all has to be written up. Christoph or > I could make a start at it, but I'm not volunteering anyone's > time. It would be a short document similar to > > http://xml2rfc.tools.ietf.org/public/rfc/bibxml3/reference. > I-D.draft-hellwig-nfsv4-scsi-layout-nvme-00.xml > > which unfortunately appears to have expired. > > > > • With regard to speculation about a possible form of form of > Send-based DDP with just Message Continuation, I'm skeptical but if there > were to be some clear definition, it could be added as an option to > rtrext-0x, for x>= 2. > > I was thinking of starting with an independent document > (and extension) just to draw the lines clearly. I've been > busy with other things, and haven't found an opportunity > to write it up. > > > > With regard to issues related to challenges due to large page sizes, I > see the matter differently. I think it is possible to avoid copying > without using page flipping,but using what I'll call "buffer flipping", > even when page sizes or large or there is no use of virtual memory at all. > One might use page flipping only as a way to avoid copies on page-size > units that are smaller than the buffer size (e.g. 4K pages with 8K buffers). > > > > Let's assume the buffer cache uses an 8K buffer size and that the page > size is either larger or non-existent (i.e.. no virtual memory). Receive > buffer are structured to include a section for inline payload (1-2K) > together with an 8K buffer within the buffer cache. When posted for > receive, this buffer should have no valid data within it and it the > associated buffer header should indicate that is neither free nor assigned > to a particular file location. Once data is placed into it by a receive > which gets a WRITE request, that buffer can be assigned to a location > within a file and the buffer header updated to reflect this location. The > mapping from file and block number to buffer address is changed but the > address of the buffer remains the same: it is the one into which the buffer > was placed. > > In operating systems I'm familiar with, the page cache > is the single base of all file I/O. Sub-page sized I/O > (ie, a separate buffer cache) is no longer utilized. > > Even if a particular file system implementation were > to use it, those buffers would not be exposed above > the VFS, and thus NFS servers would have no visibility > of them. Again, that's what I'm familiar with, maybe > not universally true, and perhaps I've misunderstood > something. > > A Message Continuation approach could use a "flipping" > mechanism without the requester having to parcel out > chunk lists. > > The key is separating the Transport header from the > Payload. By fixing the size of the Transport header, > a receiver can get the incoming Payload to land in > a separate buffer from the Transport header, always. > > A sender might then rely on the fact that the > receiver's inline threshold is related to it's > preferred page/buffer cache size, and break each > message up so that certain data items in the Payload > stream land in independent buffers on the receiver > (again, this is similar to an experimental approach > being considered for NFS on TCP). > > RDMA Receive preserves the sender's Send ordering. > The receiver simply concatenates the incoming messages > related to the same XID, grouping them as is convenient > to it. > > Receiving such messages should work correctly for any > arbitrary division of Payload stream contents. Benefits > accrue when receivers can advertise to senders exactly > how they want to see RPC payloads divided into messages. > Perhaps a ULB could be useful here. > > This arrangement is still unable to handle what I'll > term "zero copy sub-buffer modification". In more > abstract terms, I wonder if we can continue to label > this Direct Data Placement, since the buffer has to be > "flipped" into place after receive; it's not a bona > fide update-in-place. What we are attempting to do > here is more akin to zero-copy receive. > > > > I think the conformance issues you have mentioned are worth thinking > about but there will be differences of opinion about how important these > are. In any case, the discussion has resulted in some conclusions about > use of various form of DDP that need to be captured somewhere other than in > an email thread. > > > > I can see why you don't want this stuff in rfc5667bis. I don't think it > belongs in nfsulb (or in an update to rfc5667bis based on it) either. For > now, I'll add an implementation-choice section to rtrext-02, but ultimately > I don't think this kind of stuff belongs in a standards-track document. I > think that eventually, when the set of extensions in Version Two is more > settled (and larger than one), we could publish an Informational RFC that > provides implementation guidance about performance-relevant choices that > implementers might need to be made aware of. > > Yes, that is one possibility. > > > > : > > > > On Thu, Feb 2, 2017 at 3:45 PM, Chuck Lever <chuck.lever@oracle.com> > wrote: > > > > > On Feb 2, 2017, at 8:53 AM, David Noveck <davenoveck@gmail.com> wrote: > > > > > > > This section also mentioned inline thresholds and Reply chunks, > > > > and that looks superfluous. I removed it. > > > > > > A whole lot of rfc5667 wound up essentially repeating stuff > > > in rfc5666. A lot of that has been removed and that work is > > > continuing. > > > > > > The interesting question is when that process should stop. In > > > particular, we need to understand if it can stop short of simply > > > providing the information required of ULBs in rfc5666bis. > > > > I'm totally OK with removing redundant language from rfc5667bis, > > but we should not pretend that DDP eligibility is completely > > general or that we understand perfectly how all DDP transport > > mechanisms will work in the future. > > > > I'd like to be sure we are not creating a Sisyphean task. > > > > > > > When > > > I suggested going that way, you didn't like the idea and I had > > > doubts about whether it was workable, > > > > For example, some repetition of rfc5666bis language is helpful as > > implementation guidance. > > > > > > > Nevertheless, if we > > > have material beyond what we need to meet those requirements, > > > we should understand why it is there. > > > > There are other things that can go into a ULB that might not > > have been listed in rfc5666bis. Clearly one issue we have with > > NFS in particular is how to deal with replay caching above the > > transport. > > > > > > > Meanwhile, in nfsulb, I've been going in the opposite direction, > > > by leaving in the V1-specific material in and indicating that other > > > implementations are possible and will be discussed in the documents > > > for the appropriate versions and extensions. As a result, while > > > rfc5666bis is getting smaller, my document is larger than -04. > > > > > > These documents need to come together at some time, most likely > > > after rfc5667bis is published (or at least in WGLC) and > > > rpcrdma-version-two has been turned into a working group > > > document. > > > > > > The prospect of doing so brings us back to our deferred discussion > > > of the phyletics of the various forms of DDP. My original plans had > > > been to put this off until nfsulb was published (done), there was a > > > new version of rtrext that addressed ULB issues (now working on it), > and > > > I added some ULB-related material to rpcrdma-version-two (on my todo > > > list). > > > > > > As I looked at your comments about this issue, I still disagreed with > them > > > but decided that they raise issues that the discussion of ULB issues > > > will have to address, and they might also lead to clarifications in the > > > rest of rtrext. So that we don't have to wait for some of these > documents > > > to come out (and read them all), let me address your issues below: > > > > > > > Explicit RDMA and send-based DDP are perhaps in the same > > > > phylum but are truly distinct beasts. > > > > > > I agree they are different things. There would be no point in defining > > > send-based DDP if was the same as V1-DDP. Nevertheless, they > > > serve the same purpose: transferring DDP-eligible items so that the > > > receiver gets them in an appropriately sized and aligned buffer. I > > > thinks that is all that matters for this discussion. > > > > The numbered items in my previous e-mail demonstrate that the > > purpose of DDP eligibility is to encourage ULB and implementation > > choices that are appropriate when using offloaded RDMA. They do not > > all apply to Send-based DDP, though some do. > > > > Or put another way, DDP eligibility is not general at all! It is > > only "general" because Send-based DDP has been adjusted to fit it. > > > > For example, you have argued below that Send-based DDP is aimed at > > improving a particular size I/O; yet data item size is one important > > criterion for a ULB author to decide whether to make a data item > > DDP-eligible. > > > > Suppose a ULP has a 512-byte opaque data item (opaque, meaning that > > no XDR encoding or decoding is required) that would benefit from > > avoiding a data copy because it is in a frequently-used operation. > > You might not choose to make it DDP-eligible because an explicit > > RDMA operation is inefficient. But it might be a suitable candidate > > for Send-based DDP. > > > > > > > To get back to phyletics, I would put the branching much lower in the > > > tree. To me a better analogy is to consider them as two distinct, > > > closely-related orders, like rodentia and lagomorpha (rabbits, > > > hares, pikas). > > > > > > It is true that there are reasons on might choose to move a > DDP-eligible > > > data Item using one or the other but that is an implementation > decision. > > > Efficiency considerations and direction of transfer may affect this > > > implementation decision, as is the fact , mentioned in rtrext, that if > the > > > client wants the response to be placed at a particular address, you > > > shoulnd't use send-based DDP. But none of this affects the fact that > > > only DDP-eligible items can be placed by send-based DDP, just as is > the case > > > with V1-DDP. > > > > > > > > > > 1. The rfc5666bis definition of DDP eligibility makes sense > > > > if you are dealing with a mechanism that is more efficient > > > > with large payloads than small. Send-based DDP is just the > > > > opposite. > > > > > > It depends on what you mean by "large" and "small". > > > Clearly however, neither technique makes sense for very > > > small payloads and both mske most sense for payloads > > > over 4K although the specfic will depend on efficiency and > > > how the send-based DDP receiver chooses to structure its > > > buffers. > > > > > > For me the sweet spot for sen-based DDP would be 8K > > > transfers and for this reason structured buffers of 9k > > > (consisting of 1K primarily for inline payload > > > together with an 8K aligned segment for direct placement) > > > are a good choice. Even larger buffer are possible but the > > > fact that message continuation can be used means that > > > send-based DDP should have no problems with 64K > > > transfers. > > > > With Remote Invalidation, NFS READ can be handled with a Write > > chunk with nearly the same efficiency and latency as a mechanism > > that uses only Send. Yes, even with small NFS READ payloads, > > whose performance doesn't matter much anyway. Let's take that off > > the table. > > > > And I think we agree that payloads larger than a few dozen KB are > > better transferred using explicit RDMA, for a variety of reasons. > > > > IMO the only significant case where Send-based DDP is interesting > > is replacing RDMA Read of moderate-sized payloads (either NFS > > WRITE data, or Long Call messages). Let's look at NFS WRITE first. > > > > Using a large inline threshold (say, 12KB), the server would have > > to copy some or all of an 8KB payload into place after a large > > Receive. Server implementations typically already have to copy > > this data. > > > > Why? > > > > Because page-flipping a Receive buffer cannot work when the NFS > > WRITE is modifying data in a file page that already has data on it. > > > > Remember that Receive buffers are all the same, all RPC Call > > traffic goes through them. The hardware picks the Receive buffer > > arbitrarily; the ULP can't choose in advance which RPC goes into > > which Receive buffer. Flipping the buffer is the only choice to > > avoid a data copy. > > > > Suppose the server platform's page size is 64KB, or even 2MB. > > Then, 8KB NFS WRITEs are all sub-page modifications, and copying > > would be required for each I/O. Explicit RDMA works fine in that > > case without a data copy. > > > > Host CPU data copy has comparable or better latency to flipping > > a receive buffer into a file, allocating a fresh receive buffer, > > and DMA mapping it, even for payloads as large as 4KB or 8KB. > > Eliminating a server-side data copy on small NFS WRITEs is gravy, > > not real improvement. > > > > As far as I can tell, Send-based DDP is not able to handle the > > sub-page case at all without a copy. That could add an extra rule > > for using Send-based DDP over and above DDP eligibility, couldn't > > it? > > > > As soon as one realizes that Send-based DDP is effective only > > when the payload is aligned to the server's page size, depends on > > the surrounding file contents, and that it is less efficient than > > explicit RDMA when the payload size is larger than a few KB, > > introducing the complexity of Send-based DDP to the protocol or > > an implementation becomes somewhat less appealing. > > > > A better approach would attempt to address the actual problem, > > which is the need to use RDMA Read for moderate payloads. This is > > for both NFS WRITE and large RPC Call messages. > > > > 1. Long Call: large inline threshold or Message Continuation can > > easily address this case; and it is actually not frequent enough > > to be a significant performance issue. > > > > 2. NFS WRITE: Message Continuation works here, with care taken to > > align the individual sub-messages to server pages, if that's > > possible. Or, use push mode so the client uses RDMA Write in this > > case. > > > > The latter case would again be similar to DDP-eligibility, but not > > the same. Same argument data items are allowed, but additional > > conditions and treatment are required. > > > > > > > it is true that very large transfers might be better using > > > V1-DDP but that is beside the point. Any READ buffer is > > > DDP-eligible (even for a 1-byte READ) and the choice of > > > whether to use DDP i(or the sort to use) in any particular > > > case is up to the implementation. The ULB shouldn't > > > be affected. > > > > > > 2. Re-assembling an RPC message that has been split by > > > > message continuation or send-based DDP is done below the XDR > > > > layer, since all message data is always available once Receive > > > > is complete, and XDR decoding is completely ordered with the > > > > receipt of the RPC message. With explicit RDMA, receipt of > > > > some arguments MAY be deferred until after XDR decoding > > > > begins. > > > > > > First of all, message continuation isn't relevant to this > > > discussion. > > > > > > With regard to send-based DDP, it is true that you have > > > the placed data immediately, while you might have to wait > > > in the V1-DDP case, but this doesn't affect DDP-eligibility. > > > > > In either case, before you process the request or response, > > > all the data, whether inline or placed needs to be available. > > > I don't see how anything else matters. > > > > It can matter greatly for NFS WRITE / RDMA Read. > > > > In order for a server to set up the RDMA Read so the payload goes > > into either a file's page cache or into the correct NVM pages, > > the server has to parse the NFS header information that indicates > > what filehandle is involved and the offset and count, before it > > constructs the RDMA Read(s). > > > > Without that information, the server performs the RDMA Read using > > anonymous pages and either flips those pages into the file, or in > > the case of NVM, it must copy the data into persistent storage > > pages associated with that file. > > > > That's the reason delayed RDMA Read is still important. > > > > Now, Send-based DDP can handle NVM-backed files in one of two ways: > > > > 1. Receive into anonymous DRAM pages and copy the payload into NVM; > > results in a (possibly undesirable) host CPU data copy > > > > 2. Receive into NVM Receive buffers; but then all of your NFS > > operations will have to go through NVM. Maybe not an issue for > > DRAM backed with flash, but it is a potential problem for > > technologies like 3D XPoint which are 2x or more slower than > > DRAM. > > > > Remember, for RDMA Read, we are talking about NFS WRITE or > > pulling moderate-sized arguments from the client on a Call. In > > that case, a server-side host CPU copy is less likely to interfere > > with an application workload, and thus is not as onerous as it > > might be on the client. > > > > > > > > In fact, the current RPC-over-RDMA protocol is designed > > > > around that. For example the mechanism of reduction relies on > > > > only whole data items being excised to ensure XDR round-up of > > > > variable-length data types is correct. That is because an > > > > implementation MAY perform RDMA Read in the XDR layer, not > > > > in the transport. > > > > > > It may be that send-based DDP could have been designed > > > to accommodate cases that V1-DDP could not but the fact is that > > > it wasn't because: > > > • There are no common cases in which it would provide a benefit. > > > • It would complicate the design to make this something other > than a drop-in replacement for V1-DDP> > > > > 3. Message continuation, like a Long message, enables an RPC > > > > message payload to be split arbitrarily, without any regard > > > > to DDP eligibility or data item boundaries. > > > > > > Exactly right but this just demonstrates that message > > > continuation and DDP-eligibility should not affect one > > > another and they don't. > > > > Seems to me you can do a form of Send-based DDP with just > > Message Continuation; perhaps just enough that an additional > > Send-based DDP mechanism is not necessary. > > > > And Send-based DDP isn't terribly valuable without a fully > > send-based mechanism for transferring large payloads. So > > it can't be done without Message Continuation, unless I'm > > mistaken. > > > > These two mechanisms seem well-entwined. > > > > > > > > 4. Send-based DDP is not an offload transfer. It involves the > > > > host CPUs on both ends. > > > > > > Not sure what you mean by that. > > > > For Send/Receive, the receiver's host CPU always takes an > > interrupt, and the host CPU performs transport header > > processing. (even worse for Message Continuation, where one > > RPC can now mean more than one interrupt on the receiving > > CPU). > > > > With RDMA Write, for example, the receiver does not take an > > interrupt, and the wire header is processed by the HCA. > > > > > > > Both are the same in that: > > > • You don't need to copy data on ether side > > > • Both sides CPU"s need to deal with where the data is. The > adapter has to be told where the data is to sent/transferred from while the > receiving side has to note where the placed data is (i.e not in the normal > XDR stream). > > > > Thus it doesn't make sense to restrict > > > > the DDP eligibility of data items that require XDR marshaling > > > > and unmarshaling. > > > > > > This sound like you think coping is required. It isn't. > > > > Yes, it is. As above, sometimes with Receive a partial or > > whole copy of the payload is unavoidable. > > > > > > > There is no point in using send-based DDP to transfer an > > > item that requires marshalling/unmarshalling and the situation > > > is the same as for V1-DDP. There is no point in specifically > > > placing data that would get no benefit from being specifically > > > placed. > > > > > > A receiver has generic receive resources that have to > > > > accommodate any RPC that is received (via Send/Receive). > > > > > > True. > > > > > > > DDP > > > > eligibility is designed to prepare the receiver for exactly > > > > one particular data item that MAY involve the ULP in the > > > > reconstruction of the whole RPC message. > > > > > > True. > > > > > On Wed, Feb 1, 2017 at 6:04 PM, Chuck Lever <chuck.lever@oracle.com> > wrote: > > > > > > > On Feb 1, 2017, at 5:04 PM, David Noveck <davenoveck@gmail.com> > wrote: > > > > > > > > > Within NFS version 4.0, there are a few variable-length result data > > > > > items whose maximum size cannot be estimated by clients reliably > > > > > because there is no protocol-specified size limit on these > > > > > structures. > > > > > > > > I like this but I have a problem with the word "structures". The > only case > > > > we are talking about is a GETATTR response, and the XDR defines that > as an > > > > opaque array, even though the client and server both are supposed to > know > > > > what is in it. "arrays" would work. > > > > > > Agreed, fixed. > > > > > > This section also mentioned inline thresholds and Reply chunks, > > > and that looks superfluous. I removed it. > > > > > > > > > > On Wed, Feb 1, 2017 at 4:47 PM, Chuck Lever <chuck.lever@oracle.com> > wrote: > > > > > > > > > On Feb 1, 2017, at 4:27 PM, karen deitke <karen.deitke@oracle.com> > wrote: > > > > > > > > > > > > > > > > > > > > On 1/30/2017 5:28 PM, Chuck Lever wrote: > > > > >> Hi Karen, thanks for taking time to review. > > > > >> > > > > >> > > > > >> > > > > >>> On Jan 27, 2017, at 6:40 PM, karen deitke < > karen.deitke@oracle.com> > > > > >>> wrote: > > > > >>> > > > > >>> > > > > >>> Hi Chuck, > > > > >>> > > > > >>> o Section 2.6, second to last paragraph, typo, priori should be > prior. > > > > >>> > > > > >> "a priori" is Latin for "in advance" (literally, "from the > earlier"). I've > > > > >> replaced this phrase. > > > > >> > > > > > Thanks, I don't speak Latin :-) > > > > > > > > > >> > > > > >> > > > > >> > > > > >>> o Same section, "Note however that many operations normally > considered non-idempotent (e.g. WRITE, SETATTR) are actually idempotent". > This was confusing because prior in the paragraph you are talking about > determine the reply size for variable length replies, which doesn't seem to > be the case for WRITE, SETATTR. > > > > >>> > > > > >> That's correct, WRITE doesn't have a variable length reply. This > section > > > > >> discusses a possible COMPOUND with WRITE (fixed length result, > non-idempotent) > > > > >> and GETATTR (variable length result, idempotent). I've clarified > it. > > > > >> > > > > > ok > > > > >> > > > > >> > > > > >> > > > > >>> o 4.2.1 seoond paragrah would read better if you removed the > word "however" > > > > >>> > > > > >> I was not able to find the word "however" in the second paragraph > of S4.2.1. > > > > >> > > > > >> There are certain NFS version 4 data items whose size cannot be > > > > >> > > > > >> chunk to disambiguate which chunk is associated with which > argument > > > > >> estimated by clients reliably, however, because there is no > protocol- > > > > >> > > > > >> data item. However NFS version 4 server and client implementations > > > > >> specified size limit on these structures. These include: > > > > > Sorry, it in 4.2: > > > > > > > > > > "There are certain NFS version 4 data items whose size cannot be > estimated by clients reliably, however, because there is no > protocol-specified size limit on these structures." > > > > > > > > How about this: > > > > > > > > Within NFS version 4.0, there are a few variable-length result data > > > > items whose maximum size cannot be estimated by clients reliably > > > > because there is no protocol-specified size limit on these > > > > structures. > > > > > > > > > > > > >>> > > > > >>> "If a client implementation is equipped to recognize that a > transport error could mean that it provisioned an inadequately sized Reply > chunk, it can retry the operation with a larger Reply chunk. Otherwise, > the client must terminate the RPC transaction." > > > > >>> > > > > >>> So you are saying for example that if the client sends a request > in which it knows that it has "guessed" the maximum response but that guess > may be wrong, if it detects the connection dropped for example, it could > potentially interpret that as the guess was too small and send again? > > > > >>> > > > > >> As previously discussed. > > > > >> > > > > >> > > > > >> > > > > >>> 4.3, second from last bullet, "If an READ", should be "If a READ" > > > > >>> > > > > >> Fixed. > > > > >> > > > > >> > > > > >> > > > > >>> 4.5. Why does the xid in the rdma header have to be used to get > the session info, wouldnt the session info in the sequence_op that was sent > in the first place have this info? > > > > >>> > > > > >> Here's the last paragraph of Section 4.5: > > > > >> > > > > >> In addition, within the error response, the requester does not > have > > > > >> the result of the execution of the SEQUENCE operation, which > > > > >> identifies the session, slot, and sequence id for the request > which > > > > >> has failed. The xid associated with the request, obtained > from the > > > > >> rdma_xid field of the RDMA_ERROR or RDMA_MSG message, must be > used to > > > > >> determine the session and slot for the request which failed, > and the > > > > >> slot must be properly retired. If this is not done, the slot > could > > > > >> be rendered permanently unavailable. > > > > >> > > > > >> The mention of RDMA_MSG is probably incorrect. In that case, the > server > > > > >> was able to return a SEQUENCE result, and that should be useable > by the > > > > >> client. > > > > >> > > > > >> Without an RPC Reply message, however, the client matches the XID > in the > > > > >> ERR_CHUNK message to a previous call and that will have the > matching > > > > >> SEQUENCE operation. > > > > >> > > > > > I'm still not following this. Is the issue that the client would > resend, but the seq of the slot is incremented and we wouldn't potentially > get anything the server had in the slot replay? > > > > > > > > Is there a way to recover if the server can't send a SEQUENCE result > for > > > > that slot? Seems like the slot would be stuck on the same RPC until > the > > > > session is destroyed. > > > > > > > > > > > > > Karen > > > > >> > > > > >> That makes this rather a layering violation, and perhaps a reason > why > > > > >> retransmitting with a larger Reply chunk might be a cure worse > than the > > > > >> disease. > > > > >> > > > > >> I'm beginning to believe that making this situation always a > permanent > > > > >> error, as rfc5666bis does, is a better protocol choice. > > > > >> > > > > >> > > > > >> > > > > >>> Karen > > > > >>> On 1/20/2017 5:27 PM, > > > > >>> internet-drafts@ietf.org > > > > >>> wrote: > > > > >>> > > > > >>>> A New Internet-Draft is available from the on-line > Internet-Drafts directories. > > > > >>>> This draft is a work item of the Network File System Version 4 > of the IETF. > > > > >>>> > > > > >>>> Title : Network File System (NFS) Upper Layer > Binding To RPC-Over-RDMA > > > > >>>> Author : Charles Lever > > > > >>>> Filename : draft-ietf-nfsv4-rfc5667bis-04.txt > > > > >>>> Pages : 18 > > > > >>>> Date : 2017-01-20 > > > > >>>> > > > > >>>> Abstract: > > > > >>>> This document specifies Upper Layer Bindings of Network File > System > > > > >>>> (NFS) protocol versions to RPC-over-RDMA. Upper Layer > Bindings are > > > > >>>> required to enable RPC-based protocols, such as NFS, to use > Direct > > > > >>>> Data Placement on RPC-over-RDMA. This document obsoletes > RFC 5667. > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> The IETF datatracker status page for this draft is: > > > > >>>> > > > > >>>> > > > > >>>> https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rfc5667bis/ > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> There's also a htmlized version available at: > > > > >>>> > > > > >>>> > > > > >>>> https://tools.ietf.org/html/draft-ietf-nfsv4-rfc5667bis-04 > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> A diff from the previous version is available at: > > > > >>>> > > > > >>>> > > > > >>>> https://www.ietf.org/rfcdiff?url2=draft-ietf-nfsv4- > rfc5667bis-04 > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Please note that it may take a couple of minutes from the time > of submission > > > > >>>> until the htmlized version and diff are available at > tools.ietf.org. > > > > >>>> > > > > >>>> Internet-Drafts are also available by anonymous FTP at: > > > > >>>> > > > > >>>> > > > > >>>> ftp://ftp.ietf.org/internet-drafts/ > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> _______________________________________________ > > > > >>>> nfsv4 mailing list > > > > >>>> > > > > >>>> > > > > >>>> nfsv4@ietf.org > > > > >>>> https://www.ietf.org/mailman/listinfo/nfsv4 > > > > >>> _______________________________________________ > > > > >>> nfsv4 mailing list > > > > >>> > > > > >>> nfsv4@ietf.org > > > > >>> https://www.ietf.org/mailman/listinfo/nfsv4 > > > > >> -- > > > > >> Chuck Lever > > > > >> > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > -- > > > > Chuck Lever > > > > > > > > > > > > > > > > _______________________________________________ > > > > nfsv4 mailing list > > > > nfsv4@ietf.org > > > > https://www.ietf.org/mailman/listinfo/nfsv4 > > > > > > > > > > -- > > > Chuck Lever > > > > > > > > > > > > > > > > -- > > Chuck Lever > > > > > > > > > > -- > Chuck Lever > > > >
- [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667bis-0… internet-drafts
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… karen deitke
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… karen deitke
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… karen deitke
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… Chuck Lever
- Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667b… David Noveck