Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis-09

Chuck Lever <chuck.lever@oracle.com> Thu, 04 May 2017 15:00 UTC

Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <BN6PR03MB2449B8D6FEE4975D78858456A0160@BN6PR03MB2449.namprd03.prod.outlook.com>
Date: Thu, 04 May 2017 11:00:41 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <263A2988-0B52-4706-B00F-41D121F7DA42@oracle.com>
References: <CADaq8jdkGgL+H-yoO-+bTNbSYiE_1us9cN5SXY8QV0gfYfK0Ng@mail.gmail.com> <ce42960d-d1e9-8fa6-e98e-3e9b1a2af7d6@oracle.com> <f66e8e66-ba54-ff57-945a-7951eab2f8b1@talpey.com> <BB65A737-BDBD-4A23-9CEE-2EA153293842@oracle.com> <33468014-6695-a2da-1af8-f1f355fbe986@talpey.com> <CADaq8jcJJQ3TiVX6fFURg22YgNg=Cd7ezNQewjt6fgNK4LrPVg@mail.gmail.com> <F417EA11-D49F-420D-A64F-AE6A382B920C@oracle.com> <7213a956-6157-d0a6-432d-1da8d555d8e9@talpey.com> <A7BB8A22-53E3-4910-A6DE-C6103343D309@oracle.com> <6974E7E7-051B-4F28-A61A-DF6F841B248B@oracle.com> <af6ed8c5-6a7d-08ed-590b-1774f34e05f2@talpey.com> <F842F8E7-B576-4781-A845-F13317593F88@oracle.com> <1451a113-115b-5c43-5cfe-f0c5e21b59d6@talpey.com> <C91AC1D8-C884-490B-8738-7279DEC0F372@oracle.com> <CADaq8jc6X6y5WXuptVevhNopG9Nbfca8FUV6zYCBTADs5ohvag@mail.gmail.com> <F7941956-149D-4B4C-B793-444FC61A9517@oracle.com> <e7ff236f-29e4-06d8-86c9-486f95f9db14@oracle.com> <505CD860-4167-49FB-8162-B5FE6083E7AF@oracle.com> <BN6PR03MB2449B8D6FE! E4975D78858456A0160@BN6PR03MB2449.namprd03.prod.outlook.com>
To: NFSv4 <nfsv4@ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/y-_0HbM2gjWoEmZtSMIyMbuXWXs>
Subject: Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis-09
Precedence: list

> On May 3, 2017, at 2:00 AM, Tom Talpey <ttalpey@microsoft.com> wrote:
> 
>> -----Original Message-----
>> From: nfsv4 [mailto:nfsv4-bounces@ietf.org] On Behalf Of Chuck Lever
>> Sent: Tuesday, May 2, 2017 6:48 PM
>> To: NFSv4 <nfsv4@ietf.org>
>> Subject: Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis-09
>> 
>> 
>>> On May 1, 2017, at 5:02 PM, karen deitke <karen.deitke@oracle.com> wrote:
>>> 
>>> 
>>> 
>>> On 4/28/2017 11:32 AM, Chuck Lever wrote:
>>>>> On Apr 27, 2017, at 10:24 AM, David Noveck <davenoveck@gmail.com>
>> wrote:
>>>>> 
>>>>>> Correct. But since the protocol creates the problem, the protocol
>>>>>> definition needs to say something about dealing with it.
>>>>> The protocol did not create this problem.
>>>> I respectfully disagree.
>>>> 
>>>> The transport protocol design allows arbitrarily complex Read and
>>>> Write lists. This is a common practice in protocol design to permit
>>>> wide latitude for innovation by implementers.
>>>> 
>>>> NFSv4 also allows arbitrarily complex COMPOUNDs for similar reasons,
>>>> but a) clients do not today make use of this, and b) newer minor
>>>> versions recognize that implementation limits have to be
>>>> communicated.
>>>> 
>>>> The protocol problem we are trying to address is that there is no
>>>> mechanism (either via a specified limit or via a run-time
>>>> negotiation) that allows implementations to choose limits that are
>>>> less than infinity while allowing acceptable interoperability.
>>>> 
>>>> 
>>>>> Up until about a month ago,
>>>>> the protocol allowed multiple chunks and we believed that the
>>>>> implementations did as well.
>>>> It's true that the transport protocol hasn't changed, but I think up
>>>> until a month ago, we simply didn't realize that permissive chunk
>>>> list complexity limits was an interopera- bility issue.
>>>> 
>>>> The Linux implementations take a lot of short cuts because they are
>>>> really advanced prototypes, not fully mature implementations of the
>>>> transport. One of the short cuts has been that they implement just
>>>> the minimum number of chunk combinations needed for most NFS
>>>> operations. They do not implement all possible chunk combinations,
>>>> nor do they support arbitrarily long chunk lists, because they never
>>>> had to.
>>>> 
>>>> I'm aware that other implementers have taken a philoso- phically
>>>> similar approach to shorten the time it takes to get a working client
>>>> and server, and I'm sure that will be the case for implementations in
>>>> the future. As protocol designers I don't think we should ignore this
>>>> kind of expediency.
>>>> 
>>>> In practice, for now chunk list complexity limits really aren't a
>>>> problem at all, because of a) above, except for the desire to have
>>>> servers accept a Read list that contains a Position Zero Read chunk
>>>> and a normal Read chunk at once.
>>>> 
>>>> So this is a situation that is hazardous, but might not be
>>>> encountered in practice for years. I don't want to spill a lot more
>>>> electrons or brain cells on fixing something that has minimal
>>>> consequences for the set of implementations and ULPs we have today.
>>> Agreed.  The solaris server can currently handle an offset 0 read chunk, and
>> another read chunk.  What happens if we receive more than that?  Its
>> uncertain.  Our client doesn't implement this, nor have we seen it from other
>> clients.  That is, its never been tested. Same is true for more than 1write chunk.
>>> 
>>>> 
>>>> 
>>>>> Then we found out that a lot implementations have these restrcctions
>>>>> and we are trying to deal with that situation.  The protocol has
>>>>> stayed the same but we now know that some implementations have these
>> restrictions.
>>>>> 
>>>>> Even though the protocol did not create this situation, this
>>>>> document is the only opportunity we have to tell clients how to deal with
>> these restrictions.
>>>> Or we could give implementers a base set of chunk list capabilities
>>>> that must be observed. For all versions of NFS on RPC-over-RDMA
>>>> Version One, make the limit one normal Read chunk plus one PZRC (both
>>>> with multiple Read segments), and one Write chunk (with multiple
>>>> segments). Replace discussion of handling
>>>> NFSv4 COMPOUNDs with more than one DDP-eligible element with a few
>>>> rules that determine which single operation in a multiple READ or
>>>> WRITE COMPOUND gets to use DDP.
>>>> 
>>>> Real support for multiple chunks with NFSv4 COMPOUNDs will have to
>>>> wait until another version of RPC-over-RDMA is available.
>>>> 
>>>> Not sure what to do about segment count limits.
>>> Agreed.  Currently the solaris server does not have a limit on the segment
>> count, usually only what will fit in the recv buffer that holds the rdma header
>> itself.
>> 
>> Here's an expansion of S5.4.2 to include most of Tom's proposed text and a
>> description of current implementation behavior. Please don't hesitate to argue
>> in favor of anything I left out, or anything I should remove.
>> 
> 
> Thanks Chuck, great improvement. Some comments on the update:
> 
>> 5.4.2.  Complexity Considerations
>> 
>>   The RPC-over-RDMA Version One protocol does not place any limit on
>>   the number of chunks or segments that may appear in the Read or Write
>>   lists.  However, for various reasons NFS version 4 server
>>   implementations often have practical limits on the number of chunks
>>   or segments they are prepared to process in one message.
>> 
>>   These implementation limits are especially important when Kerberos
>>   integrity or privacy is in use [RFC7861].  GSS services increase the
>>   size of credential material in RPC headers, forcing more frequent use
> 
> GSS payloads don't always force this, I'd suggest "potentially requiring" or similar
> softened text.
> 
>>   of Position-Zero Read chunks and Reply chunks.  This can increase the
>>   complexity of chunk lists independent of the NFS version 4 COMPOUND
>>   being conveyed.
>> 
>>   To avoid encountering server chunk list complexity limits, NFS
>>   version 4 clients SHOULD restrict their RPC-over-RDMA Version One
>>   messages to simple combinations of chunks:
> 
> Again, "restrict" may be too strong. If the client is certain that a certain chunk list
> is acceptable, it's perfectly OK for it to send it. Suggest the following rules being
> cast as "safe" or at least "following the Internet Principles" as "conservative in
> what you send".
> 
>> 
>>   o  The Read list contains no more than one Position-Zero Read chunk
>>      and one Read chunk with a non-zero Position.
>> 
>>   o  The Write list contains no more than one chunk.
>> 
>>   o  The inline threshold restricts the number of segments that may
>>      appear in either list.
>> 
>>   NFS version 4 clients wishing to send more complex chunk lists can
>>   provide configuration interfaces to bound the complexity of NFS
>>   version 4 COMPOUNDs, limit the number of elements in scatter-gather
>>   operations, and avoid other sources of RPC-over-RDMA chunk overruns
>>   at the peer.
> 
> Good.
> 
>>   An NFS Version 4 server has some flexibility in how it indicates that
>>   an RPC-over-RDMA Version One message constructed by an NFS Version 4
>>   client is valid but cannot be processed.  Examples include:
>> 
>>   o  A problem is detected at the transport layer (i.e., during
>>      transport header processing).  The server returns an RDMA_ERROR
>>      message with the err field set to ERR_CHUNK.
>> 
>>   o  A problem is detected during XDR decoding of the request (e.g.,
>>      during re-assembly of the RPC Call message by the RPC layer).  The
>>      server returns an RPC reply with its "reply_stat" field set to
>>      MSG_ACCEPTED and its "accept_stat" field set to GARBAGE_ARGS.
>> 
>>   o  A problem is detected in the Upper Layer (i.e., by the NFS version
>>      4 implementation).  The server sends an NFS reply with a status of
> 
> The two previous bullets used "returns" while this uses "sends". Intentional?
> 
>>      NFS4ERR_RESOURCE.
>> 
>>   After receiving one of these errors, an NFS version 4 client SHOULD
>>   NOT retransmit the failing request, as the result would be the same
>>   error.  It SHOULD immediately terminate the RPC transaction
>>   associated with the XID in the reply.
> 
> Now this is interesting. ERR_CHUNK and GARBAGE_ARGS are clear, but I
> thought NFS4ERR_RESOURCE is generally retryable. So, how does the client
> determine this error is actually from the RDMA encoding, and not retry?
> This protocol should not require another protocol to change existing behavior.

I've tried to address your concerns, and I changed the segment count
limit to document what we know about the current Solaris implementations.
I liked the "be conservative in what you send" idea, but I couldn't find
a way to work it in that was not awkward.

If this text is OK, I'll submit a fresh I-D revision after WGLC ends on
Friday.


5.4.2.  Chunk List Complexity

   The RPC-over-RDMA Version One protocol does not place any limit on
   the number of chunks or segments that may appear in the Read or Write
   lists.  However, for various reasons NFS version 4 server
   implementations often have practical limits on the number of chunks
   or segments they are prepared to process in one message.

   These implementation limits are especially important when Kerberos
   integrity or privacy is in use [RFC7861].  GSS services increase the
   size of credential material in RPC headers, potentially requiring
   more frequent use of Long messages.  This can increase the complexity
   of chunk lists independent of the NFS version 4 COMPOUND being
   conveyed.

   To avoid encountering server chunk list complexity limits, NFS
   version 4 clients SHOULD follow the below prescriptions when
   constructing transport headers:

   o  The Read list can contain either a Position-Zero Read chunk, one
      Read chunk with a non-zero Position, or both.

   o  The Write list can contain no more than one Write chunk.

   o  Any chunk can contain up to sixteen RDMA segments.

   NFS version 4 clients wishing to send more complex chunk lists can
   provide configuration interfaces to bound the complexity of NFS
   version 4 COMPOUNDs, limit the number of elements in scatter-gather
   operations, and avoid other sources of RPC-over-RDMA chunk overruns
   at the receiving peer.

   An NFS Version 4 server has some flexibility in how it indicates that
   an RPC-over-RDMA Version One message received from an NFS Version 4
   client is valid but cannot be processed.  Examples include:

   o  A problem is detected by the transport layer while parsing the
      transport header in an RPC Call message.  The server responds with
      an RDMA_ERROR message with the err field set to ERR_CHUNK.

   o  A problem is detected during XDR decoding of the RPC Call message
      while the RPC layer reassembles the call's XDR stream.  The server
      responds with an RPC reply with its "reply_stat" field set to
      MSG_ACCEPTED and its "accept_stat" field set to GARBAGE_ARGS.

   After receiving one of these errors, an NFS version 4 client SHOULD
   NOT retransmit the failing request, as the result would be the same
   error.  It SHOULD immediately terminate the RPC transaction
   associated with the XID in the reply.


--
Chuck Lever

[nfsv4] Review of draft-ietf-nfsv4-rfc5667bis-09 David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… karen deitke
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… karen deitke
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… karen deitke
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… karen deitke
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… karen deitke
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… karen deitke
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… karen deitke
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Trond Myklebust
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… spencer shepler
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… David Noveck
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Chuck Lever
Re: [nfsv4] Review of draft-ietf-nfsv4-rfc5667bis… Tom Talpey