Re: [nfsv4] rfc5667bis open issues

Chuck Lever <chuck.lever@oracle.com> Mon, 26 September 2016 17:20 UTC

Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jcSxc6BQKJ1SZ=OrpRcEGpgpfdLDcPpBp=GfGQJwkbLEw@mail.gmail.com>
Date: Mon, 26 Sep 2016 13:20:11 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <11DB3812-B605-4426-A316-176CE31910B2@oracle.com>
References: <15F62327-B73F-45CF-B4A5-8535955E954F@oracle.com> <65E80EDE-6031-4A83-9B73-3A88C91F8E6A@oracle.com> <CADaq8jc50Ca6eDZ3D6zRvfG+Q2DngNN6+mN9WKXj9AS=d1iQVg@mail.gmail.com> <D0ECCDF7-F785-4419-AA93-33B2054C4737@oracle.com> <CADaq8jcSxc6BQKJ1SZ=OrpRcEGpgpfdLDcPpBp=GfGQJwkbLEw@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/pa-fq07uq35AFrOck7l0bIbG0_4>
Cc: NFSv4 <nfsv4@ietf.org>
Subject: Re: [nfsv4] rfc5667bis open issues
Precedence: list

> On Sep 24, 2016, at 8:11 AM, David Noveck <davenoveck@gmail.com> wrote:
> 
> >The issue is that the language allows ERR_CHUNK to be
> > returned after a server has processed the RPC request when 
> > a client has not provided adequate Write list or Reply chunk
> > resources to convey the reply.
> 
> I forgot that we made ERR_CHUNK ambiguous, apparently because of reluctance to add things to the XDR.  There is no pressing need to do this since the responder is aware of the difference and his management of the DRC can follow based on his knowledge.
> 
> However, in the context of Version Two it would be better if we avoided those ambiguities, given that we have lots of space for distinct error codes.
> 
> > In that case, it makes sense for the server to have added 
> > the request to its DRC.
> 
> Agree that in the case in which the server has executed the request, it should add the request to the DRC.  In practical terms, there are not likely to be cases in which there is a non-idempotent request with a reply longer than 1K.

For NFSv3, that is largely true.

For NFSv4, I believe a large reply to a non-idempotent request is
possible, and may even be common. Any time a client does something
like this:

  { SEQUENCE, PUTFH, SETATTR, GETATTR }

Where the GETATTR requests an ACL or security label, is problematic
if the client does not estimate the reply buffer size correctly.

However, this case is for when the client has a bug; it's not a
case where we expect one or the other side to perform heroic
recovery. ERR_CHUNK would terminate the RPC on the client, which
would very likely return EIO to the application. I think that's
about the sanest outcome we can expect.

> As far as the session case, the server will consider the request executed but the client does not have a reply containing the slot and sequence.  To deal with that case, he would have to use rdma_xid in the message with the ERR_CHUNK to get the needed context and so conclude that the slot was available for reuse.

This feels like something that belongs in 5667bis. I'm no real
expert on session behavior. Can you sketch some text that can be
added?

> > What I propose is that if the first READ_PLUS returns 
> > NFS4_CONTENT_HOLE, the server would return an empty first
> > Write chunk. Then the second READ_PLUS result always 
> > lines up with the second Write chunk, which IMO is much better > for clients.
> 
> I'm OK with this but I think you will need to adjust the text to reflect the fact that READ_PLUS can return an array of read_plus_content's, although, in practice, those that return more than one are extremely rare.

I hadn't realized READ_PLUS returned an array.

If an NFS server is allowed to structure its reply in a way that
the client cannot predict, then I think we'll have to limit the
way READ_PLUS uses DDP. I propose these rules:

- The client can provide no more than one Write chunk if it expects
NFS4_CONTENT_DATA. (No Write chunk or an empty Write chunk, following
the previous rules, would be for when the client predicts that the
reply can go inline).

- If that Write chunk is non-empty, it MUST be large enough to
receive all expected payload bytes in a single NFS4_CONTENT_DATA
element.

- The server uses that Write chunk for the first array element that
has an NFS4_CONTENT_DATA arm.

Then we have a choice, depending on whether it is more desirable
to return data in a single round-trip, or more desirable to preserve
holes. Either:

- If the server finds that the array has grown larger than can be
returned inline or via the supplied Reply chunk, it MUST return
the payload data in a single NFS4_CONTENT_DATA element via the
provided Write chunk.

Or:

- The server MUST return as much payload as it can fit within the
resources provided by the client, and return it as a short READ
result. The client is responsible for retrying the READ_PLUS to
read the remaining payload.

Somehow we have to deal with the case where the server cannot fit
any of the payload in the client-provided resources.

READ_PLUS is actually a poor fit for offloaded DDP anyway. The whole
point of offload is that the client has to do no work; the payload
arrives in its memory without any effort on its part.

I would just as soon require that, on RDMA transports, READ_PLUS
returns only a hole or exactly one contiguous piece of content.

--
Chuck Lever

[nfsv4] rfc5667bis open issues Chuck Lever
Re: [nfsv4] rfc5667bis open issues Chuck Lever
Re: [nfsv4] rfc5667bis open issues David Noveck
Re: [nfsv4] rfc5667bis open issues Chuck Lever
Re: [nfsv4] rfc5667bis open issues David Noveck
Re: [nfsv4] rfc5667bis open issues Chuck Lever
Re: [nfsv4] rfc5667bis open issues Chuck Lever
Re: [nfsv4] rfc5667bis open issues David Noveck
Re: [nfsv4] rfc5667bis open issues Chuck Lever