Re: [nfsv4] rfc5667bis open issues

Chuck Lever <chuck.lever@oracle.com> Mon, 26 September 2016 17:20 UTC

Return-Path: <chuck.lever@oracle.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0001412B24B for <nfsv4@ietfa.amsl.com>; Mon, 26 Sep 2016 10:20:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.518
X-Spam-Level:
X-Spam-Status: No, score=-6.518 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-2.316, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9b7alBMuBlkE for <nfsv4@ietfa.amsl.com>; Mon, 26 Sep 2016 10:20:17 -0700 (PDT)
Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BAB3E12B21A for <nfsv4@ietf.org>; Mon, 26 Sep 2016 10:20:17 -0700 (PDT)
Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id u8QHKGCG026366 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 26 Sep 2016 17:20:16 GMT
Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0022.oracle.com (8.14.4/8.13.8) with ESMTP id u8QHKFR9003535 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 26 Sep 2016 17:20:16 GMT
Received: from abhmp0013.oracle.com (abhmp0013.oracle.com [141.146.116.19]) by aserv0121.oracle.com (8.13.8/8.13.8) with ESMTP id u8QHKD0G032547; Mon, 26 Sep 2016 17:20:14 GMT
Received: from [172.20.9.88] (/207.241.138.226) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 26 Sep 2016 10:20:13 -0700
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jcSxc6BQKJ1SZ=OrpRcEGpgpfdLDcPpBp=GfGQJwkbLEw@mail.gmail.com>
Date: Mon, 26 Sep 2016 13:20:11 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <11DB3812-B605-4426-A316-176CE31910B2@oracle.com>
References: <15F62327-B73F-45CF-B4A5-8535955E954F@oracle.com> <65E80EDE-6031-4A83-9B73-3A88C91F8E6A@oracle.com> <CADaq8jc50Ca6eDZ3D6zRvfG+Q2DngNN6+mN9WKXj9AS=d1iQVg@mail.gmail.com> <D0ECCDF7-F785-4419-AA93-33B2054C4737@oracle.com> <CADaq8jcSxc6BQKJ1SZ=OrpRcEGpgpfdLDcPpBp=GfGQJwkbLEw@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
X-Mailer: Apple Mail (2.3124)
X-Source-IP: userv0022.oracle.com [156.151.31.74]
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/pa-fq07uq35AFrOck7l0bIbG0_4>
Cc: NFSv4 <nfsv4@ietf.org>
Subject: Re: [nfsv4] rfc5667bis open issues
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2016 17:20:19 -0000

> On Sep 24, 2016, at 8:11 AM, David Noveck <davenoveck@gmail.com> wrote:
> 
> >The issue is that the language allows ERR_CHUNK to be
> > returned after a server has processed the RPC request when 
> > a client has not provided adequate Write list or Reply chunk
> > resources to convey the reply.
> 
> I forgot that we made ERR_CHUNK ambiguous, apparently because of reluctance to add things to the XDR.  There is no pressing need to do this since the responder is aware of the difference and his management of the DRC can follow based on his knowledge.
> 
> However, in the context of Version Two it would be better if we avoided those ambiguities, given that we have lots of space for distinct error codes.
> 
> > In that case, it makes sense for the server to have added 
> > the request to its DRC.
> 
> Agree that in the case in which the server has executed the request, it should add the request to the DRC.  In practical terms, there are not likely to be cases in which there is a non-idempotent request with a reply longer than 1K.

For NFSv3, that is largely true.

For NFSv4, I believe a large reply to a non-idempotent request is
possible, and may even be common. Any time a client does something
like this:

  { SEQUENCE, PUTFH, SETATTR, GETATTR }

Where the GETATTR requests an ACL or security label, is problematic
if the client does not estimate the reply buffer size correctly.

However, this case is for when the client has a bug; it's not a
case where we expect one or the other side to perform heroic
recovery. ERR_CHUNK would terminate the RPC on the client, which
would very likely return EIO to the application. I think that's
about the sanest outcome we can expect.


> As far as the session case, the server will consider the request executed but the client does not have a reply containing the slot and sequence.  To deal with that case, he would have to use rdma_xid in the message with the ERR_CHUNK to get the needed context and so conclude that the slot was available for reuse.

This feels like something that belongs in 5667bis. I'm no real
expert on session behavior. Can you sketch some text that can be
added?


> > What I propose is that if the first READ_PLUS returns 
> > NFS4_CONTENT_HOLE, the server would return an empty first
> > Write chunk. Then the second READ_PLUS result always 
> > lines up with the second Write chunk, which IMO is much better > for clients.
> 
> I'm OK with this but I think you will need to adjust the text to reflect the fact that READ_PLUS can return an array of read_plus_content's, although, in practice, those that return more than one are extremely rare.

I hadn't realized READ_PLUS returned an array.

If an NFS server is allowed to structure its reply in a way that
the client cannot predict, then I think we'll have to limit the
way READ_PLUS uses DDP. I propose these rules:

- The client can provide no more than one Write chunk if it expects
NFS4_CONTENT_DATA. (No Write chunk or an empty Write chunk, following
the previous rules, would be for when the client predicts that the
reply can go inline).

- If that Write chunk is non-empty, it MUST be large enough to
receive all expected payload bytes in a single NFS4_CONTENT_DATA
element.

- The server uses that Write chunk for the first array element that
has an NFS4_CONTENT_DATA arm.


Then we have a choice, depending on whether it is more desirable
to return data in a single round-trip, or more desirable to preserve
holes. Either:

- If the server finds that the array has grown larger than can be
returned inline or via the supplied Reply chunk, it MUST return
the payload data in a single NFS4_CONTENT_DATA element via the
provided Write chunk.

Or:

- The server MUST return as much payload as it can fit within the
resources provided by the client, and return it as a short READ
result. The client is responsible for retrying the READ_PLUS to
read the remaining payload.

Somehow we have to deal with the case where the server cannot fit
any of the payload in the client-provided resources.


READ_PLUS is actually a poor fit for offloaded DDP anyway. The whole
point of offload is that the client has to do no work; the payload
arrives in its memory without any effort on its part.

I would just as soon require that, on RDMA transports, READ_PLUS
returns only a hole or exactly one contiguous piece of content.


--
Chuck Lever