Re: [nfsv4] New Version Notification for draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt

One additional thing that has come up is the lack of a keepalive type 
activity on ib.  We've had some discussions with the infiniband driver 
folks as to where this could be done, hoping that it could be done at 
the lower transport level as is with tcp,and not in rpc/rdma area.  The 
issue is that if the client sends a request tot he server and is waiting 
for a response and the server crashes, and there is no other activity 
going on, we will wait forever, or until some other activity is 
initiated on the connect before we know that the connection is gone.

The suggestion was made that the client periodically issue a 0 length 
write.  This would not take up a recv buffer if the other side is still 
up, and can detect that the connection is gone in this situation.  
However the  ib layer folks do not want to implement this in their side, 
as they don't have the qp and cqp info of our configuration.  Their 
concern is, for example, if the 0 length write fails, it will still use 
a response in the completion queue, which could potentially, if full, 
result in a valid connection being lost if active.  Seems this wouldn't 
be the case for an inactive connection, but its possible that the rcq is 
a system wide shared rcq also.  Because of issues like this, the ib 
driver folks are recommending we do this in the rpc layer.

So, wondering, is/should this be something addressed in this version of 
rpcrdma?

Karen

On 8/23/2016 8:45 AM, David Noveck wrote:
> Thanks. I'll address these as part of 
> draft-dnoveck-nfsv4-rpcrdma-xcharext-03, which will be out in the 
> first half of September.
>
> Unless there are objections, that draft will adopt Karen's suggestion 
> to change "characteristics" to "properties".  The title of
> the document will change but the file name will stay the the same.
>
> > There's no discussion of RPC-over-RDMA credit accounting in this
> > document. There needs to be some discussion of credit consumption.
>
> Right.
>
> > In particular:
>
> > Requesters will have to have posted extra receive buffers (over
> > and above credits for forward channel replies and backchannel
> > requests) to deal with XCHAR messages.
>
> I think you mean that clients will.
>
> > Likewise, responders will
> > need to post similar extra receives for this purpose.
>
> Similarly, I think servers are being referred to.
>
> > Perhaps both peers should reserve one credit, and the specification
> > should insist that these operations are always single-threaded on a
> > connection.
>
> I think single-threading might be an implementation choice but I'm
> reluctant to make it part of the protocol.
>
> > Alternately, when xcharext is merged into rpcrdma-version-two,
> > there might be a generic discussion of non-RPC-payload-bearing
> > messages that could cover this issue.
>
> I think there will need to be generic text regarding the issues, but I 
> think it should
> be consistent with the following approach regarding messages related to
> transport.
>
>     In the case of messages that do not serve effectively as the response
>
>     to a previous message (i.e. ROPT_FIRSTPROP, ROPT_REQPROP,
>
>     ROPT_UPDPROP), it is the responsibility of the sender to ensure that
>
>     there is a credit available to enable sending the message, just as
>     would
>
>     be the case if it were sending an RPC request.
>
>     In the case of ROPT_RESPPROP meesages, it is the responsibility of
>
>     the sender of the original ROPT_REQPROP to post a receive to
>
>     receive the response, which is to be sent by the receiver of the
>     ROPT_REQPROP,
>
>     without consuming a credit.   This is similar to the case of
>     ending an RPC
>
>     response.
>
>
> > Overloading the term "initial"
>
> Your'e right.  It is overloaded.
>
> > Section 3 introduces an "initial set of transport characteristics"
> > and Section 4.1 defines "ROPT_INITXCH: Specify Initial
> > Characteristics". I think the use of "initial" means different
> > things in these two cases?
>
>
> It does.
>
> > Instead:
>
> > Section 3 could propose "initially-defined" characteristics.
>
>
> OK.
>
> > Section 4.1 could define ROPT_STARTXCH (like STARTTLS). I'm not
> > attached to that name, but it probably shouldn't be ROPT_INITXCH
> > if Section 3 defines "initial characteristics".
>
> > ROPT_CONNXCH
>
> > ROPT_FIRSTXCH
>
> > ROPT_EARLYXCH
>
> My current choice is ROPT_FIRSTPROP.
>
> > Section 4.4
>
> > The text in this section has been clarified to address previous
> > reviewer comments; thanks! There are a number of syntax and
> > grammatical errors that still need to be addressed (most often,
> > a few words are repeated in some sentences).
>
> I'll address those in -03.
>
> > The "argument structure" of ROPT_UPDXCH is:
>
> > struct optinfo_updxch {
> >    xcharvaloptupdxch_now;
> >    booloptupdxch_pendclr;
> > };
>
> > I prefer optupdxch_new instead of optupdxch_now; "_now" suggests this
> > field records a time and/or date stamp.
>
> OK.
>
> > It would help me to understand why a receiver needs to distinguish
> > between these types of notification.
>
> When I specified the possible situations in which these messages could 
> arise, I did not mean to imply that the receiver necessarily would 
> need to distinguish these notifications.  My focus was on the sender.
>
> > For instance, if pendclr is false, this could be either a rejection
> >of a pending change request,
>
> If it it is a *rejection* of a pending change request pendclr would true
>
> > or it could be an unsolicited change
> > notification.
>
> > How does the receiver make use of the difference?
>
> > Instead of a boolean, an enumeration of update event types would be
> > a little friendlier, and could be expressed in the same amount of
> > space (since an XDR boolean consumes 4 octets on the wire).
>
> I'm OK in principle, but it seems we are both uncertain if this is
> indeed "friendlier"
>
> > Based
> > on the discussion in Section 4.4, we have:
>
> > enum optupdxch_event_type {
> >    OPTUPD_UNSOL  = 1,
>
> pendclr = false as there is nothing pending to clear.
>
> >   OPTUPD_MORE   = 2
>
> pemdclr=false but there is still a pending request.
> ,
> >   OPTUPD_DONE   = 3,
>
> pendclr = true and the request has completed successfully.
>
>    OPTUPD_REJECT = 4,
>
> This sound to me pendclr= true but the word "REJECT" suggests no 
> change at all was made.  In this case you might also have a partial change
> };
>
> > But since the rdma_xid field is not used to tie change requests to
> > these change update notifications, I'm not sure why the receiver
> > needs to know that a pending request has been completed.
>
> If the receiver keeps track of pending requests, it needs to know when 
> one is no longer pending.
>
> The receiver is not required to do this and some might not choose to 
> do so, but the protocol should provide
> an implmentation that does the means to keep track.
>
> This does not require the xid, but peer that requested the change can 
> keep track of the properites for
> which it has a pending requested change.
>
> > I think
> > REJECT might be more interesting than the difference between UNSOL,
> > DONE, and MORE.
>
> How about an enum that distinguished:
>
>   * Unsolicited
>   * Clear pending
>   * Still Pending
>
> The additional distinction among degrees of request satisfaction in 
> the last two
> cases could be the responsibility of the receiver to determine. Since 
> he made the request
> and has access to the current value, he could determine this himself.  
> Alternatively,
> we could have an int with two distinct bit fields:
>
>   * One distinguishing unsolicited/clear-pending/still-pending
>   * Another regarding degrees of request satisfaction.
>
>
>
> > There is a bit of a race here: a sender could send an unsolicited
> > update notification at the same time the receiver requests a change
> > of the same xchar. Could that result in a non-deterministic outcome?
>
> It shouldn't.  The point is that the property changes happen in a 
> sequence which
> is the same for all observers (no weird relativistic effects!) and 
> that the updates
> to the peer should happen that same sequence.  The important point is a
> RESPXCH indicating a succesful change request be in the proper place 
> in that
> sequence.
>
> > Would it ever be reasonable to send two or more updates
> > simultaneously for the same XCHAR?
>
> Since this is a single connection with sequenced delivery, there is
> no way to send updates simultaneously.  You can queue them for
> sending at the same time, but the delivery will reflect the order in
> which were queued.
>
> > (Requiring single-threading here would prevent that from occurring).
>
> Given that there is no response to UPDXCH, not sure how you
> could specify single-threading.  There is no way to define when
> it would be OK to send the next.
>
> > What if the sender emits two optinfo_updxch messages: both with
> > pendclr set to false, but one with an intermediate value, and one
> > with the original value. The result on the receiver could depend on
> > the order in which these messages arrive.
>
> It would.
>
> > Possibly some text
> > regarding the ordering of these messages is needed.
>
> I can add something.
>
> > What happens if the receiver of ROPT_REQXCH drops the request?
>
> >Isthere a timeout after which ROPT_REQXCH may be sent again?
>
> There is no timeout in the protocol.
>
> An implementation may choose to do so, but since this is sent
> on a reliable connection, it is hard to imagine it being worth doing.
>
> >What happens if an ROPT_RESPXCH is dropped? If ROPT_REQXCH is sent
> > again, the reply is :
>
> >  ROPT_RESPXCH with the requested value marked done ?
>
> This should result.
>
> >   ROPT_RESPXCH with a rejection (no change was done) ?
>
> It is true no change was done but requested value was achieved,
>
> >  ROPT_UPDXCH with the requested value and pendclr set to true ?
>
> ROPT_UPPDXCH is not an alternative to ROPT_RESPXCH.  It is
> a possible additional message.
>
> > I don't see language that disallows any of these responses. Which
> > one means "I already set this value" ? Sorry if I missed that.
>
> I can add some clarification.
>
> > Assuming that both sides support ROPT_UPDXCH, can an implementation
> > use ROPT_UPDXCH exclusively instead of ROPT_INITXCH?
>
> Yes but it  is kind of bogus.  You would be relying on the default 
> initial values
> and then changing them, which would be good in a test but, in real 
> life, it is
> asking for trouble.
>
> BTW, I've always wished there would be an RFC2119bis defining "BOGUS" and
> "BRAIN-DEAD" :-)
>
> > Assuming that both sides support ROPT_UPDXCH, may a peer change an
> > XCHAR and not send an unsolicited ROPT_UPDXCH?
> It may but in most cases it would be BOGUS (or BRAIN-DEAD).
>
> Suppose you raise the receive buffer size and don't tell you peer that 
> it is raised,  In
> that case. raising it is pretty pointless since the peer can't take 
> advantage of the bigger
> buffer.
>
> If you lower the receive buffer size and don't tell the peer it has 
> been lowered, then he
> is going to continue to assume a larger size and break things.
>
> On Mon, Aug 22, 2016 at 2:10 PM, Chuck Lever <chuck.lever@oracle.com 
> <mailto:chuck.lever@oracle.com>> wrote:
>
>     Remarks on rpcrdma-xcharext-02.
>
>
>     - Credit accounting
>
>     There's no discussion of RPC-over-RDMA credit accounting in this
>     document. There needs to be some discussion of credit consumption.
>     In particular:
>
>     Requesters will have to have posted extra receive buffers (over
>     and above credits for forward channel replies and backchannel
>     requests) to deal with XCHAR messages. Likewise, responders will
>     need to post similar extra receives for this purpose.
>
>     Perhaps both peers should reserve one credit, and the specification
>     should insist that these operations are always single-threaded on a
>     connection.
>
>     Alternately, when xcharext is merged into rpcrdma-version-two,
>     there might be a generic discussion of non-RPC-payload-bearing
>     messages that could cover this issue.
>
>
>     - Overloading the term "initial"
>
>     Section 3 introduces an "initial set of transport characteristics"
>     and Section 4.1 defines "ROPT_INITXCH: Specify Initial
>     Characteristics". I think the use of "initial" means different
>     things in these two cases?
>
>     Instead:
>
>     Section 3 could propose "initially-defined" characteristics.
>
>     Section 4.1 could define ROPT_STARTXCH (like STARTTLS). I'm not
>     attached to that name, but it probably shouldn't be ROPT_INITXCH
>     if Section 3 defines "initial characteristics".
>
>     ROPT_CONNXCH
>
>     ROPT_FIRSTXCH
>
>     ROPT_EARLYXCH
>
>
>     - Section 4.4
>
>     The text in this section has been clarified to address previous
>     reviewer comments; thanks! There are a number of syntax and
>     grammatical errors that still need to be addressed (most often,
>     a few words are repeated in some sentences).
>
>     The "argument structure" of ROPT_UPDXCH is:
>
>     struct optinfo_updxch {
>         xcharval        optupdxch_now;
>         bool            optupdxch_pendclr;
>     };
>
>     I prefer optupdxch_new instead of optupdxch_now; "_now" suggests this
>     field records a time and/or date stamp.
>
>     It would help me to understand why a receiver needs to distinguish
>     between these types of notification.
>
>     For instance, if pendclr is false, this could be either a rejection
>     of a pending change request, or it could be an unsolicited change
>     notification. How does the receiver make use of the difference?
>
>     Instead of a boolean, an enumeration of update event types would be
>     a little friendlier, and could be expressed in the same amount of
>     space (since an XDR boolean consumes 4 octets on the wire). Based
>     on the discussion in Section 4.4, we have:
>
>     enum optupdxch_event_type {
>        OPTUPD_UNSOL  = 1,
>        OPTUPD_MORE   = 2,
>        OPTUPD_DONE   = 3,
>        OPTUPD_REJECT = 4,
>     };
>
>     But since the rdma_xid field is not used to tie change requests to
>     these change update notifications, I'm not sure why the receiver
>     needs to know that a pending request has been completed. I think
>     REJECT might be more interesting than the difference between UNSOL,
>     DONE, and MORE.
>
>     There is a bit of a race here: a sender could send an unsolicited
>     update notification at the same time the receiver requests a change
>     of the same xchar. Could that result in a non-deterministic outcome?
>
>     Would it ever be reasonable to send two or more updates
>     simultaneously for the same XCHAR? (Requiring single-threading here
>     would prevent that from occurring).
>
>     What if the sender emits two optinfo_updxch messages: both with
>     pendclr set to false, but one with an intermediate value, and one
>     with the original value. The result on the receiver could depend on
>     the order in which these messages arrive. Possibly some text
>     regarding the ordering of these messages is needed.
>
>     What happens if the receiver of ROPT_REQXCH drops the request? Is
>     there a timeout after which ROPT_REQXCH may be sent again?
>
>     What happens if an ROPT_RESPXCH is dropped? If ROPT_REQXCH is sent
>     again, the reply is :
>
>       ROPT_RESPXCH with the requested value marked done ?
>       ROPT_RESPXCH with a rejection (no change was done) ?
>       ROPT_UPDXCH with the requested value and pendclr set to true ?
>
>     I don't see language that disallows any of these responses. Which
>     one means "I already set this value" ? Sorry if I missed that.
>
>     Assuming that both sides support ROPT_UPDXCH, can an implementation
>     use ROPT_UPDXCH exclusively instead of ROPT_INITXCH?
>
>     Assuming that both sides support ROPT_UPDXCH, may a peer change an
>     XCHAR and not send an unsolicited ROPT_UPDXCH?
>
>
>     > On Aug 18, 2016, at 4:23 PM, David Noveck <davenoveck@gmail.com
>     <mailto:davenoveck@gmail.com>> wrote:
>     >
>     > This is updated and it add some vowels (and consonants too) the
>     field and type names.  In particular "rq" --> "req".
>     >
>     > I'm aware that some people find "XCHAR" confusing. If someone
>     has an idea for a replacement, please propose it on the list.  If
>     the working group is OK with it, I'll produce a -03 incorporating it.
>     >
>     >
>     > ---------- Forwarded message ----------
>     > From: <internet-drafts@ietf.org <mailto:internet-drafts@ietf.org>>
>     > Date: Thu, Aug 18, 2016 at 4:16 PM
>     > Subject: New Version Notification for
>     draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt
>     > To: David Noveck <davenoveck@gmail.com
>     <mailto:davenoveck@gmail.com>>
>     >
>     >
>     >
>     > A new version of I-D, draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt
>     > has been successfully submitted by David Noveck and posted to the
>     > IETF repository.
>     >
>     > Name:           draft-dnoveck-nfsv4-rpcrdma-xcharext
>     > Revision:       02
>     > Title:          RPC-over-RDMA Extension to Manage Transport
>     Characteristics
>     > Document date:  2016-08-18
>     > Group:          Individual Submission
>     > Pages:          23
>     > URL:
>     https://www.ietf.org/internet-drafts/draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt
>     <https://www.ietf.org/internet-drafts/draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt>
>     > Status:
>     https://datatracker.ietf.org/doc/draft-dnoveck-nfsv4-rpcrdma-xcharext/
>     <https://datatracker.ietf.org/doc/draft-dnoveck-nfsv4-rpcrdma-xcharext/>
>     > Htmlized:
>     https://tools.ietf.org/html/draft-dnoveck-nfsv4-rpcrdma-xcharext-02 <https://tools.ietf.org/html/draft-dnoveck-nfsv4-rpcrdma-xcharext-02>
>     > Diff:
>     https://www.ietf.org/rfcdiff?url2=draft-dnoveck-nfsv4-rpcrdma-xcharext-02
>     <https://www.ietf.org/rfcdiff?url2=draft-dnoveck-nfsv4-rpcrdma-xcharext-02>
>     >
>     > Abstract:
>     >    This document specifies an extension to RPC-over-RDMA Version
>     Two.
>     >    The extension enables endpoints of an RPC-over-RDMA connection to
>     >    exchange information which can be used to optimize message
>     transfer.
>     >
>     >
>     >
>     >
>     > Please note that it may take a couple of minutes from the time
>     of submission
>     > until the htmlized version and diff are available at
>     tools.ietf.org <http://tools.ietf.org>.
>     >
>     > The IETF Secretariat
>     >
>     >
>     > _______________________________________________
>     > nfsv4 mailing list
>     > nfsv4@ietf.org <mailto:nfsv4@ietf.org>
>     > https://www.ietf.org/mailman/listinfo/nfsv4
>     <https://www.ietf.org/mailman/listinfo/nfsv4>
>
>     --
>     Chuck Lever
>
>
>
>
>
>
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4