Re: [nfsv4] rfc5667bis open issues

Chuck Lever <chuck.lever@oracle.com> Fri, 23 September 2016 16:37 UTC

Return-Path: <chuck.lever@oracle.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7E0E412BD43 for <nfsv4@ietfa.amsl.com>; Fri, 23 Sep 2016 09:37:41 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.518
X-Spam-Level:
X-Spam-Status: No, score=-6.518 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-2.316, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id C-dayHcb5Lr6 for <nfsv4@ietfa.amsl.com>; Fri, 23 Sep 2016 09:37:40 -0700 (PDT)
Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0219412BD24 for <nfsv4@ietf.org>; Fri, 23 Sep 2016 09:37:39 -0700 (PDT)
Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id u8NGbb8u029734 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 23 Sep 2016 16:37:38 GMT
Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserv0021.oracle.com (8.13.8/8.13.8) with ESMTP id u8NGbbub023345 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 23 Sep 2016 16:37:37 GMT
Received: from abhmp0012.oracle.com (abhmp0012.oracle.com [141.146.116.18]) by aserv0121.oracle.com (8.13.8/8.13.8) with ESMTP id u8NGbWvD030570; Fri, 23 Sep 2016 16:37:35 GMT
Received: from [10.151.96.44] (/148.87.13.8) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 23 Sep 2016 09:37:32 -0700
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jc50Ca6eDZ3D6zRvfG+Q2DngNN6+mN9WKXj9AS=d1iQVg@mail.gmail.com>
Date: Fri, 23 Sep 2016 09:37:37 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <D0ECCDF7-F785-4419-AA93-33B2054C4737@oracle.com>
References: <15F62327-B73F-45CF-B4A5-8535955E954F@oracle.com> <65E80EDE-6031-4A83-9B73-3A88C91F8E6A@oracle.com> <CADaq8jc50Ca6eDZ3D6zRvfG+Q2DngNN6+mN9WKXj9AS=d1iQVg@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
X-Mailer: Apple Mail (2.3124)
X-Source-IP: aserv0021.oracle.com [141.146.126.233]
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/DyongnReUwBtYzYTD5Dc0Re1zNc>
Cc: NFSv4 <nfsv4@ietf.org>
Subject: Re: [nfsv4] rfc5667bis open issues
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 23 Sep 2016 16:37:41 -0000

> On Sep 22, 2016, at 5:51 PM, David Noveck <davenoveck@gmail.com> wrote:
> 
> > Does this count as a dropped RPC reply, such that an NFSv4 server
> > would have to drop the connection?
> 
> I don't see any reason why it should.  Apart from RPC-over-RDMA, does this apply to other XDR decode issues on the server? After all, it is getting a reply, even if the content is essentially "I don't understand the request."
> 
> > When an NFS server returns one of these responses, does it 
> > have to enter the reply in its DRC ? 
> 
> I should hope not.  The purpose of the DRC is to prevent repeated execution of non-idempotent requests.  A request that you can't decode does not change any state on the server.

The issue is that the language allows ERR_CHUNK to be returned
after a server has processed the RPC request when a client has
not provided adequate Write list or Reply chunk resources to
convey the reply.

In that case, it makes sense for the server to have added the
request to its DRC.


> > What if any implications are there for an NFSv4.1 session 
> > (slot retired? ignored?)
> 
> Why does this have to have to be addressed in 5667bis?  I don't see why XDR decode errors need to be addressed differently in the RPC-over-RDMA case.

ERR_CHUNK is not necessarily an XDR decode error. See above.

If returning ERR_CHUNK in the cases where a result has been
generated but a reply can not be formed is not workable, the
alternative is for the server to drop the connection without
replying, IMO.

Some text could be introduced that suggests that servers take
care to ensure there are enough reply resources before they
begin processing an RPC. That may be challenging for some
server implementations, though.


> > The current text of rfc5666bis (Section 5.3.2) suggests that when
> > multiple Write chunks are provided for an RPC, and the responder
> > doesn't use one of them, it should use that chunk for the next
> > DDP-eligible XDR data item.
> 
> It does say that and the treatment there is limited to the case of multiple write chunks.  The treatment for the analogous case when there is a single write chunk is addressed in 4.4.6.2.  Th treatment is the same:  each DDP-eligible item is matched with the first available write chunk until either there are no more write chunks or no more DDP-eligible items in the reply.
> 
> > The problematic text is actually this part of rfc5666bis:
> 
> I don't see why this is described as "problematic".  Is there a suggestion that we might change this text to something less problematic?  I don't see how to do that.  I believe we should leave the text as it is.

However, we can say that the default behavior for RPC-over-RDMA
is to skip to the next result, but that there is no prohibition
for ULBs to amend that default. I took that approach below.


> Within Version One, there is no way to tie write chunks to particular DDP-eligible items.  While one might think of that as a problem it is not one that can be addressed in Version One or in rfc5667bis.

There is a way to tie these together. Here is the text I have
adopted for rfc5667bis (revision not yet submitted). First:

 
2.2.1.  Empty Write Chunks

   Section 4.4.6.2 of [I-D.ietf-nfsv4-rfc5666bis] defines the concept of
   unused Write chunks.  An unused Write chunk is a Write chunk with
   either zero segments or where all segments in the Write chunk have
   zero length.  In this document these are referred to as "empty" Write
   chunks.  A "non-empty" Write chunk has at least one segment of non-
   zero length.

   An NFS client might wish an NFS server to return a DDP-eligible
   result inline.  If there is only one DDP-eligible result item in the
   reply, the NFS client simply specifies an empty Write list to force
   the NFS server to return that result inline.  If there are multiple
   DDP-eligible results, the NFS client specifies empty Write chunks for
   each DDP-eligible data item that it wishes to be returned inline.

   An NFS server might encounter an XDR union result where there are
   arms that have a DDP-eligible result, and arms that do not.  If the
   NFS client has provided a non-empty Write chunk that matches with a
   DDP-eligible result, but the response does not contain that result,
   the NFS server MUST return an empty Write chunk in that position in
   the Write list.


And then, in Section 4.3:
 
   The mechanism specified in Section 5.3.2 of
   [I-D.ietf-nfsv4-rfc5666bis]) is applied here, with some additional
   restrictions.  In the following list, a "READ" operation refers to
   either a READ, READ_PLUS, or READLINK operation.

   o  If an NFS client does not wish to use direct placement for any
      DDP-eligible item in an NFS reply, it leaves the Write list empty.

   o  The first chunk in the Write list MUST be used by the first READ
      operation in an NFS version 4 COMPOUND procedure.  The next Write
      chunk is used by the next READ operation, and so on.

   o  If an NFS client has provided a matching non-empty Write chunk,
      then the corresponding READ operation MUST return its data by
      placing data into that chunk.

   o  If an NFS client has provided an empty matching Write chunk, then
      the corresponding READ operation MUST return its result inline.

   o  If a READ operation returns a union arm which does not contain a
      DDP-eligible result, and the NFS client has provided a matching
      non-empty Write chunk, the NFS server MUST return an empty Write
      chunk in that Write list position.

   o  If there are more READ operations than Write chunks, then any
      remaining READ operations in the COMPOUND MUST return their
      results inline.


This problem has not arisen before for two reasons:

- So far NFSv4 clients do not build COMPOUNDs that contain more
DDP-eligible results than NFSv3 clients use (ie, only one per RPC).

- So far union results are used only in error cases; and when an
operation returns an error, it is always the last one in the
COMPOUND reply.

But now we have NFSv4.2 READ_PLUS. READ_PLUS can return:

- NFS4_CONTENT_DATA, which has an opaque DDP-eligible result
- NFS4_CONTENT_HOLE, which has no DDP-eligible result
- an error status, which has no DDP-eligible result

So we now have a situation where there are operations that can follow
a result that might or might not be returned in a Write chunk.

Suppose you have a COMPOUND that looks like this:

   { SEQUENCE, PUTFH, READ_PLUS(2048), READ_PLUS(9000), GETATTR }

Using the currently proposed scheme, the client must set up two Write
chunks that can receive 9000 bytes, to handle the case where the first
READ_PLUS returns NFS4_CONTENT_HOLE. After the reply is received, the
client has to ensure that the first returned Write chunk is matched to
the second XDR result, which could be troublesome for some
implementations.

What I propose is that if the first READ_PLUS returns NFS4_CONTENT_HOLE,
the server would return an empty first Write chunk. Then the second
READ_PLUS result always lines up with the second Write chunk, which IMO
is much better for clients.


--
Chuck Lever