Re: [nfsv4] I-D Action: draft-ietf-nfsv4-rfc5667bis-04.txt

> On Feb 1, 2017, at 1:52 PM, David Noveck <davenoveck@gmail.com> wrote:
> 
> > I mean the converse: An extension _may_ add ULB requirements.
> > Thus there have to be guidelines written somewhere about how
> > that is accomplished, even if they might not be immediately
> > used.
> 
> I think the best way to handle this is for me to propose changes to
> rpcrdma-version-two and let you comment on them.
> 
> 
> > > I do think we
> > > need some text in rpcrdma-version-two that explains what the extensions
> > > have to say if they propose different implementation of abstractions
> > > referred to by ULB(s).
> 
> > That's also fine.
> 
> So now I have two items on my to-do list for rpcrdma-version-two.  I'll probably do them together.

All of the above is what I had in mind, too.

> > You want to take advantage of it before these extensions are
> > finished, before some have even been reviewed and discussed
> > publicly. That is what I object to.
> 
> A couple of points:
> 	• I want to discuss how these extensions would be used by NFS and it appears to me that the simplest way of doing that is for me to publish nfsulb.
> 	• This is particularly important since the only ULBs in existence are the ones for NFS.
> 	• The extensions have been published.  It is true that they haven't had the review they should have but I can't make people review things.  I'm not proposing that they be made working group documents, let alone saying they are ready for WGLC.  They aren't ready but it would help to begin discussion of how these would be used in NFS.

I have reviewed this work, but the review became out of date after
Berlin and I haven't had a chance to revisit it to prepare it again
for posting.

> 	• Publishing nfsulb would allow Chuck to take whichever changes he wishes into rfc5667bis and defer/reject those he is uncomfortable with

So would continuing iterative review on rfc5667bis, but OK, I
can do it this way too.

> > I would rather wait until it is clear that rfc5667bis cannot
> > accommodate your goals, and focus for now on:
> 
> > 1. continued iterative review of rfc5667bis, and
> 
> I 'm OK with that, but I feel that rather than continue an abstract 
> discussion of the phyletics of various DDP species, I think I can 
> publish nfsulb as an example of my approach and Chuck and I, 
> together with the rest of the group, can discuss my proposed changes
>  and Chuck can decide whether the are appropriate to rfc5667bis.
> 
> > 2. composing language for rpcrdma-version-two that provides
> > guidelines for ULB updates in extensions.
> 
> I'm going to do that.
> 
> One remaining issue concerns Chuck potential role as a co-author of
> nfsulb.  Based on his comments, I assume that he wouldn't be comfortable 
> with that and so I intend to instead credit him in the acknowledgements 
> section. If Chuck wants, we can change that.

Acknowledgement sounds fine.

> On Wed, Feb 1, 2017 at 12:01 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> > On Jan 31, 2017, at 7:59 PM, David Noveck <davenoveck@gmail.com> wrote:
> >
> > > I've attempted to make the language more generic, but I
> > > regard that as a constructive but ultimately unattainable
> > > goal because it requires predicting the future with a high
> > > degree of accuracy.
> >
> > I don't agree.
> 
> I think you do agree; as you said below, the issue is what
> are the actual completion criteria.
> 
> 
> > I think we need use more abstraction in
> > stating the requirements than rfc5667 did.
> 
> Obviously I agree with that, since there already is more
> abstraction in rfc5667bis than there was in RFC 5667, and
> I have stated that the work isn't done yet.
> 
> 
> > The NFS specs
> > work with filesystems that were not known when the
> > spec were written.  Only when we extend the
> > abstraction, as we did with xattrs, do we have to
> > write a new extension document.
> >
> > > At some point we have to stop and say it is good enough.
> >
> > We agree on that but disagree about what that point is.
> >
> >
> > > > The concept of an
> > > > extensible version 2 transport would become unworkable in that
> > > > case.
> >
> > > IMO that's overstating it. We have a lot to learn about how to
> > extend RPC-over-RDMA.
> >
> > > The extension rules in rpcrdma-version-two
> > > have to state how extension specifications deal with adding
> > > new ULB requirements.
> >
> > I don;t think extensions need to add ULB requirements.
> 
> I mean the converse: An extension _may_ add ULB requirements.
> Thus there have to be guidelines written somewhere about how
> that is accomplished, even if they might not be immediately
> used.
> 
> 
> > I do think we
> > need some text in rpcrdma-version-two that explains what the extensions
> > have to say if they propose different implementation of abstractions
> > referred to by ULB(s).
> 
> That's also fine.
> 
> 
> > I'll draft some proposed text in this regard and run it by you.
> >
> > > Right now, we don't know what requirements
> > > even existing proposed extensions might add. For instance:
> >
> > > - Do we really believe that rfc5666bis DDP eligibility is an
> > > appropriate rule to apply, without any change, to send-based DDP?
> >
> > I believe they do.  rtrext is written using that concept which
> > you created and required that ULBs define.  That gives us a good
> > basis to make ULBs more generic.  I just want to take
> > advantage of it sooner than you do.
> 
> You want to take advantage of it before these extensions are
> finished, before some have even been reviewed and discussed
> publicly. That is what I object to.
> 
> As previously stated, I don't have a problem making rfc5667bis
> more generic. But care must be taken. It can go too far, and
> it especially isn't a good idea to push this generality based
> on protocol designs that are not mature.
> 
> For instance, as to the technical issue about whether DDP
> eligibility is appropriate for send-based DDP without change:
> I believe it isn't.
> 
> 1. The rfc5666bis definition of DDP eligibility makes sense
> if you are dealing with a mechanism that is more efficient
> with large payloads than small. Send-based DDP is just the
> opposite.
> 
> 
> 2. Re-assembling an RPC message that has been split by
> message continuation or send-based DDP is done below the XDR
> layer, since all message data is always available once Receive
> is complete, and XDR decoding is completely ordered with the
> receipt of the RPC message. With explicit RDMA, receipt of
> some arguments MAY be deferred until after XDR decoding
> begins.
> 
> In fact, the current RPC-over-RDMA protocol is designed
> around that. For example the mechanism of reduction relies on
> only whole data items being excised to ensure XDR round-up of
> variable-length data types is correct. That is because an
> implementation MAY perform RDMA Read in the XDR layer, not
> in the transport.
> 
> 
> 3. Message continuation, like a Long message, enables an RPC
> message payload to be split arbitrarily, without any regard
> to DDP eligibility or data item boundaries.
> 
> 
> 4. Send-based DDP is not an offload transfer. It involves the
> host CPUs on both ends. Thus it doesn't make sense to restrict
> the DDP eligibility of data items that require XDR marshaling
> and unmarshaling.
> 
> 
> A receiver has generic receive resources that have to
> accommodate any RPC that is received (via Send/Receive). DDP
> eligibility is designed to prepare the receiver for exactly
> one particular data item that MAY involve the ULP in the
> reconstruction of the whole RPC message.
> 
> Explicit RDMA and send-based DDP are perhaps in the same
> phylum but are truly distinct beasts.
> 
> 
> > > I don't think that's a foregone conclusion.
> >
> > Perhaps not but I believe it is the case.
> >
> > > - Can you explain here why you believe that message continuation
> > > will have an impact on the way ULBs are written?
> >
> > Right now rfc5667 says that if a message is longer than the
> > size of the receivers buffer it must go in a reply chunk.  That
> > isn't compatible with message continuation.
> >
> > Instead, it should say that if a message is longer than the longest
> > message that a transport can handle using sends, it should be
> > transferred using explicit RDMA operations.
> 
> That is a clear, simple, and actionable review comment,
> unlike "rfc5667bis forecloses on the use of message
> continuation".
> 
> This specific issue wasn't mentioned in your previous
> comments about this I-D, so there was no way for me to
> address it before.
> 
> An alternate way to address the comment is to remove that
> kind of language from rfc5667bis, since the general
> topic of Long messages is already covered by rfc5666bis.
> I'll go have a look.
> 
> 
> > That would leave it up to the transport to define the longest message
> > that can transferred by sends and how longer messages are transferred
> > using explicit RDMA operations, which rfc5666bis already does.
> >
> > > The point being that we have a set of ULB composition rules in
> > > rfc5666bis. You seem to be adding new ones that we haven't all
> > > agreed on, and then dinging me (repeatedly) for ignoring them.
> >
> > I haven't done that.  If it seems to you that I have, it would help
> > if you specified one of these new rules.
> 
> > > I'm much more comfortable closing the book on rfc5667bis with the
> > > set of transports we have in RFCs now, and then opening another
> > > update when we have real documented transport changes to address.
> >
> > If you want to close the book on rfc56667bis, I think I can put
> > off what I was planning to do.
> >
> > > Otherwise rfc5667bis will never be finished.
> >
> > I think you are unduly pessimistic but I think we accommodate your
> > desire to close on rfc5667.  I'll send separate mail on the retry issue.
> 
> It is not pessimism. It is that the targets are moving, and
> not fully documented: a recognizable pattern for creeping
> scopes and slipping delivery schedules.
> 
> 
> > > Let's design for what we have,
> >
> > OK.  rfc5667bis will address what we have. limited to protocols for which there
> > are implementations, and working group documents that are far advanced.
> >
> > Another document will address what we have in the sense that there are existing specifications
> > for the feature.  In other words it will hande version-two and rtrext.  I think it will handle a lot
> > more, although you might not agree.
> >
> > > clear the table, and move on. There's
> > > no reason to believe that will create an unending sequence of updates
> > > to the NFS ULB. There will be no more ULB updates than there are
> > > extensions, and quite likely fewer.
> >
> > Exactly.  I think we can get by with exactly one.
> >
> > So what I expect to do is to submit draft-dnoveck-nfsv4-nfsulb, illustrating my
> > approach to making the ULBs for NFS transport-generic.  Since it is based
> > on rfc5667bis, I will, with your permission, list you as a co-author.
> 
> I would rather wait until it is clear that rfc5667bis cannot
> accommodate your goals, and focus for now on:
> 
> 1. continued iterative review of rfc5667bis, and
> 
> 2. composing language for rpcrdma-version-two that provides
> guidelines for ULB updates in extensions.
> 
> 
> > On Tue, Jan 31, 2017 at 6:02 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> >
> > > On Jan 31, 2017, at 2:54 PM, David Noveck <davenoveck@gmail.com> wrote:
> > >
> > > > Thus I'm considering removing the retry language from rfc5667bis.
> > >
> > > I don't think that should be done.  Let me explain why.
> > >
> > > > First, the mandate I recall receiving in Dallas for rfc5667bis was:
> > > >
> > > > 1. Document existing implementations
> > > > 2. Fix mistakes and omissions
> > > > 3. Extend the NFS ULB to properly cover NFSv4.1 and NFSv4.2
> > >
> > > And those have been done :-)
> > >
> > > > Since then we have added:
> > >
> > > > 4. Make it align with new language in rfc5666bis
> > >
> > > One part of which is to require that you have a reliable means of
> > > bounding reply sizes.  Allowing retry is a means to accomplish
> > > that requirement without essentially waving the problem away
> > > and guessing/hoping that it will not be bothersome.   Even if this
> > > is not a problem in practical terms, I don;t see how we can present
> > > an rfc5667bis that doesn't meet the requirements in rfc5666bis.
> > >
> > > > And snuck in:
> > >
> > > I don't think there was anything sneaky here.  The "mandate"
> > > you mention was not a legislative act or executive order.
> > > It was a plan to go forward with and we should be able
> > > to make changes if necessary.
> >
> > > > 5. Make it align with the language of rpcrdma-version-two and
> > > > rpcrdma-rtrext (which I'm also struggling with, but that's a
> > > > separate topic)
> > >
> > > I think the goal is not to align it with the particulars of any
> > > particular version or extension but to make it generic, so
> > > that the transport can be changed without continually revising
> > > the ULB for new versions and extensions.
> >
> > I've attempted to make the language more generic, but I
> > regard that as a constructive but ultimately unattainable
> > goal because it requires predicting the future with a high
> > degree of accuracy.
> >
> > At some point we have to stop and say it is good enough.
> >
> >
> > > The concept of an
> > > extensible version 2 transport would become unworkable in that
> > > case.
> >
> > IMO that's overstating it. We have a lot to learn about how to
> > extend RPC-over-RDMA. The extension rules in rpcrdma-version-two
> > have to state how extension specifications deal with adding
> > new ULB requirements. Right now, we don't know what requirements
> > even existing proposed extensions might add. For instance:
> >
> > - Do we really believe that rfc5666bis DDP eligibility is an
> > appropriate rule to apply, without any change, to send-based DDP?
> > I don't think that's a foregone conclusion.
> >
> > - Can you explain here why you believe that message continuation
> > will have an impact on the way ULBs are written?
> >
> > The point being that we have a set of ULB composition rules in
> > rfc5666bis. You seem to be adding new ones that we haven't all
> > agreed on, and then dinging me (repeatedly) for ignoring them.
> >
> >
> > > As an example, the current rfc5667bis, in numerous places
> > > assumes that no mesaage continuation feature is available.  We have
> > > suffered, for a long time, from the lack of this feature which SMB
> > > Direct has had for a long time.  Even apart from the controversial
> > > concept of send-based DDP, I don't see how we can have an rfc5667bis
> > > that essentially forecloses a message continuation extension.
> >
> > I'm much more comfortable closing the book on rfc5667bis with the
> > set of transports we have in RFCs now, and then opening another
> > update when we have real documented transport changes to address.
> > Otherwise rfc5667bis will never be finished.
> >
> > Let's design for what we have, clear the table, and move on. There's
> > no reason to believe that will create an unending sequence of updates
> > to the NFS ULB. There will be no more ULB updates than there are
> > extensions, and quite likely fewer.
> >
> >
> > > > As I see it, "retrying short Reply chunks" is a new and untried
> > > > piece of protocol. It attempts to accommodate a future,
> > > > unimplemented transport or ULP, and therefore it lies outside the
> > > > mandated scope of rfc5667bis IMO.
> > >
> > > But that position would essentially ties us to Version One forever, which,
> > > aside from the particulars of the discussion in Dallas, is not where we want
> > > to be.,
> >
> > It doesn't tie anyone to anything. The WG is allowed to open
> > updates of existing RFCs as many times as they like. I believe
> > there is adequate process here to deal properly with it.
> >
> >
> > > I think Version One can accommodate retry, even though it
> > > has some weaknesses with regard to non-idempotent operations.
> >
> > ERR_CHUNK can be returned for a lot of reasons. As we learned
> > when trying to add extensibility to V1, that could make it
> > inadequate for this purpose.
> >
> > If short Reply chunk retry is mentioned in rfc5667bis, I prefer
> > that it is discussed in the following terms:
> >
> > 1. It is an optional remedy, written as implementation advice.
> >
> > 2. The choice to retry is at the discretion of the ULP on the
> > requester (not the transport)
> >
> > 3. Retry is done via a distinct RPC (fresh XID)
> >
> > 4. Short Reply chunk retry is permitted only for NFSACL GETACL
> > and NFSv4.0 GETATTR
> >
> >
> > > It is true that version Two will probably have better, more complete
> > > support.
> >
> > Possibly all that is necessary is an unambiguos error code.
> >
> >
> > > It may be that there is something in rfc566bis as currently written, that
> > > makes this difficult to do at this late stage.  I could live with that,
> > > but given that rfc5667bis needs to be made transport-generic, I think
> > > it should definitely allow this in future transport Versions, if those
> > > versions support it, and not forclose it, just as it should not foreclose
> > > message continuation.
> >
> > Another way to "allow retry in future versions" is not to
> > mention it at all, which is what I proposed in the previous
> > e-mail.
> >
> >
> > > > Second, it's unnecessary in the real world:
> > >
> > > I agree that it is very unlikely to be used and that many implementations
> > > may choose not to implement it.
> > >
> > > > I'm not aware of a single instance in the field where a server has
> > > > had to reject a client's request because of a short Reply chunk.
> > > > It's certainly never occurred during my testing, but lots of other
> > > > issues have.
> > >
> > > OK, but the problem is that, without it, when and if this happens,
> > > we have a situation in which it is clear there is a bug, but it cannot be
> > > determined whether the fault is due to the requester or the responder.
> >
> > The responder reports, via a local system log, that the requester
> > is in error. The Solaris NFS/RDMA server implementation reports
> > similar errors this way all the time, in addition to sending
> > GARBAGE_ARGS replies, so I know this works.
> >
> > It can be done with V1 implementations, and it can be done even
> > if requesters do not support retrying.
> >
> >
> > > In order to allow interoperable implementations, this kind of situation
> > > must be avoided.
> >
> > > If the spec allows retry but the requester implementation does not provide it,
> > > then it has a bug and it can decode it is willing to live with a small possibility of
> > > failure.
> > >
> > > >Third, the places where retry might be useful are all legacy
> > > > protocols:
> > >
> > > If you use the term "legacy protocols" as it is used in rfc5667bis, this is not so.
> >
> > To be clear, I meant legacy in the sense that the book is closed
> > on NFSACL and NFSv4.0.
> >
> >
> > > Also, rfc5666bis requires reliable reply estimation of all ULPs, rather than
> > > only of non-legacy ones.
> >
> > The intent of that requirement is to ensure interoperation.
> >
> > You haven't explained why "failing a request that is not formed
> > according to the rules" is not sufficient to establish that
> > implementations interoperate successfully.
> >
> > If the rfc5666bis requirement is too stern or hazy or in some
> > other way not sensible, we should fix it.
> >
> > But I don't see anything that states a ULB is prevented from
> > identifying gray areas or possible cases that can't be handled.
> >
> >
> > > On Tue, Jan 31, 2017 at 12:04 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > >
> > > > On Jan 31, 2017, at 7:11 AM, David Noveck <davenoveck@gmail.com> wrote:
> > > >
> > > > > Without an RPC Reply message, however, the client matches the XID in the
> > > > > ERR_CHUNK message to a previous call and that will have the matching
> > > > > SEQUENCE operation.
> > > >
> > > > Right.
> > >
> > > As I clicked Send yesterday, I blinked and "retried" became "retired".
> > > Although I'm a fan of "retired", a different term might be less likely
> > > to confuse readers in this context. Suggestions welcome.
> > >
> > > So now I read this text
> > >
> > > >    In addition, within the error response, the requester does not have
> > > >    the result of the execution of the SEQUENCE operation, which
> > > >    identifies the session, slot, and sequence id for the request which
> > > >    has failed.  The xid associated with the request, obtained from the
> > > >    rdma_xid field of the RDMA_ERROR or RDMA_MSG message, must be used to
> > > >    determine the session and slot for the request which failed, and the
> > > >    slot must be properly retired.  If this is not done, the slot could
> > > >    be rendered permanently unavailable.
> > >
> > > to mean "No short Reply chunk retries are permitted when a session is
> > > in use." I'm much more comfortable with this requirement.
> > >
> > > And, now there is one less use case for retrying a short Reply chunk.
> > >
> > >
> > > > > That makes this rather a layering violation,
> > > >
> > > > It looks to me like this layering violation is not serious and I think it is
> > > > unavoidable.
> > >
> > > Let's consider how the layers might work together in this case.
> > >
> > > Sizing a Reply chunk is done by the ULP. If the Reply chunk turns out
> > > to be too small, it is the ULP, not the transport, that will have to
> > > replace that chunk with a larger Reply chunk.
> > >
> > > The ULP therefore must be fully aware that the previous attempt
> > > failed, and why. It also has to be capable of deciding whether or not
> > > a retry is advisable, given other factors such as responder reply
> > > caching.
> > >
> > > Since the ULP is driving the retry, given the separation of duties
> > > between a ULP and the RPC layer, it is possible that a new XID will
> > > be used for the retried operation. The responder does not have a way
> > > of recognizing the link between the failed XID and the new one.
> > >
> > > To address this, then the ULP would have to indicate what XID should
> > > be used in the retry's RPC (and RPC-over-RDMA) header.
> > >
> > >
> > > > > and perhaps a reason why
> > > > > retransmitting with a larger Reply chunk might be a cure worse than the
> > > > > disease.
> > > >
> > > > Note that the layering violation does not arise when retrying the operation
> > > > with a larger reply chunk.
> > > >
> > > > The case when this layering violation/misdemeanor occurs is when
> > > > the operation is failed and the slot needs to be made available again.  Retrying
> > > > the operation with a larger reply chunk would make make this situation.
> > > > less likely to occur. However, since 4.1 has a reliable means of limiting reply
> > > > size (unlike v4.0), it appears that this is beside the point for session-based
> > > > minor versions of nfsv4.
> > > >
> > > > Note that the disease here is the fact that v4.0  (and some auxiliary
> > > > protocols) does not have a reliable means of determining reply size limits.
> > > > We can't cure that disease as these protocols are unchangeable.
> > > >
> > > > Allowing retry with a larger reply chunk is not a cure but it is a treatment which
> > > > ameliorates the problem.  As far as I can see, it is a safe and effective treatment.
> > >
> > > The safe and most conservative course is to terminate the RPC, in all
> > > cases. Some ULPs might be able to tolerate retry, others might not.
> > >
> > > Each ULP has to make that decision on a case-by-case basis. The
> > > transport by itself cannot arbitrarily perform a retry, it is now
> > > clear.
> > >
> > > Thus I'm considering removing the retry language from rfc5667bis.
> > >
> > > First, the mandate I recall receiving in Dallas for rfc5667bis was:
> > >
> > > 1. Document existing implementations
> > > 2. Fix mistakes and omissions
> > > 3. Extend the NFS ULB to properly cover NFSv4.1 and NFSv4.2
> > >
> > > Since then we have added:
> > >
> > > 4. Make it align with new language in rfc5666bis
> > >
> > > And snuck in:
> > >
> > > 5. Make it align with the language of rpcrdma-version-two and
> > > rpcrdma-rtrext (which I'm also struggling with, but that's a
> > > separate topic)
> > >
> > > As I see it, "retrying short Reply chunks" is a new and untried
> > > piece of protocol. It attempts to accommodate a future,
> > > unimplemented transport or ULP, and therefore it lies outside the
> > > mandated scope of rfc5667bis IMO.
> > >
> > > Second, it's unnecessary in the real world:
> > >
> > > I'm not aware of a single instance in the field where a server has
> > > had to reject a client's request because of a short Reply chunk.
> > > It's certainly never occurred during my testing, but lots of other
> > > issues have.
> > >
> > > Third, the places where retry might be useful are all legacy
> > > protocols:
> > >
> > > As you pointed out, NFSv4.0 and NFSACL appear to be the two areas
> > > where we have concerns. The problem is largely addressed by NFSv4.1
> > > and newer minor versions, and even there, retrying is not permitted.
> > > The future does not need this new behavior, apparently.
> > >
> > > What retrying amounts to, in this context, is adding a workaround
> > > in the transport protocol for bugs in the Upper Layer Protocols and
> > > their implementations.
> > >
> > >
> > > > > I'm beginning to believe that making this situation always a permanent
> > > > > error, as rfc5666bis does, is a better protocol choice.
> > > >
> > > > I don;t see it that way.  It leaves us with an rfc5666bis  requirement for ULBs that
> > > > we would be unable to satisfy for a number of ULPs dealt with in rfc5667bis.
> > >
> > > This is a hole that can be closed simply by prescribing how each
> > > implementation must behave when a Reply chunk is short. That has
> > > already been done in rfc5666bis: the RPC fails. This prevents a
> > > transport deadlock and connection loss, and indicates a ULP
> > > implementation bug.
> > >
> > > Why does rfc5667bis need to go further? The consequences of a
> > > shortage are clear, and no longer catastrophic to other RPCs.
> > >
> > >
> > > --
> > > Chuck Lever
> > >
> > >
> > >
> > >
> >
> > --
> > Chuck Lever
> >
> >
> >
> >
> > _______________________________________________
> > nfsv4 mailing list
> > nfsv4@ietf.org
> > https://www.ietf.org/mailman/listinfo/nfsv4
> 
> --
> Chuck Lever
> 
> 
> 
> 

--
Chuck Lever