Re: [nfsv4] New Version Notification for draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt

Hi Karen-

> On Aug 23, 2016, at 11:56 AM, karen deitke <karen.deitke@oracle.com> wrote:
> 
> One additional thing that has come up is the lack of a keepalive type activity on ib.  We've had some discussions with the infiniband driver folks as to where this could be done, hoping that it could be done at the lower transport level as is with tcp,and not in rpc/rdma area.  The issue is that if the client sends a request tot he server and is waiting for a response and the server crashes, and there is no other activity going on, we will wait forever, or until some other activity is initiated on the connect before we know that the connection is gone.
> 
> The suggestion was made that the client periodically issue a 0 length write.  This would not take up a recv buffer if the other side is still up, and can detect that the connection is gone in this situation.  However the  ib layer folks do not want to implement this in their side, as they don't have the qp and cqp info of our configuration.  Their concern is, for example, if the 0 length write fails, it will still use a response in the completion queue, which could potentially, if full, result in a valid connection being lost if active.  Seems this wouldn't be the case for an inactive connection, but its possible that the rcq is a system wide shared rcq also.  Because of issues like this, the ib driver folks are recommending we do this in the rpc layer.
> 
> So, wondering, is/should this be something addressed in this version of rpcrdma?

Have a look in the nfsv4@ietf.org mail archive for a thread entitled

  "Detecting loss of connection with RPC-over-RDMA"

that starts on January 26, 2016.

The zero-length Write approach was largely rejected by the WG in favor
of retaining a single spare RPC-over-RDMA credit for high-priority
requests.

However, I think the real problem is that the hardware retry count is
set to its maximum, which for some HCAs means it continues retrying
message transmission for a rather indefinitely long period of time.
Lowering that retry maximum might achieve what you need; that approach
was quite successful on the Linux client.

> Karen
> 
> On 8/23/2016 8:45 AM, David Noveck wrote:
>> Thanks. I'll address these as part of draft-dnoveck-nfsv4-rpcrdma-xcharext-03, which will be out in the first half of September.
>> 
>> Unless there are objections, that draft will adopt Karen's suggestion to change "characteristics" to "properties".  The title of
>> the document will change but the file name will stay the the same.
>> 
>> > There's no discussion of RPC-over-RDMA credit accounting in this
>> > document. There needs to be some discussion of credit consumption.
>> 
>> Right.
>> 
>> > In particular:
>> 
>> > Requesters will have to have posted extra receive buffers (over
>> > and above credits for forward channel replies and backchannel
>> > requests) to deal with XCHAR messages. 
>> 
>> I think you mean that clients will.
>> 
>> > Likewise, responders will
>> > need to post similar extra receives for this purpose.
>> 
>> Similarly, I think servers are being referred to.
>> 
>> > Perhaps both peers should reserve one credit, and the specification
>> > should insist that these operations are always single-threaded on a
>> > connection.
>> 
>> I think single-threading might be an implementation choice but I'm
>> reluctant to make it part of the protocol.
>> 
>> > Alternately, when xcharext is merged into rpcrdma-version-two,
>> > there might be a generic discussion of non-RPC-payload-bearing
>> > messages that could cover this issue.
>> 
>> I think there will need to be generic text regarding the issues, but I think it should
>> be consistent with the following approach regarding messages related to 
>> transport.
>> 
>> In the case of messages that do not serve effectively as the response
>> to a previous message (i.e. ROPT_FIRSTPROP, ROPT_REQPROP,
>> ROPT_UPDPROP), it is the responsibility of the sender to ensure that 
>> there is a credit available to enable sending the message, just as would
>> be the case if it were sending an RPC request.
>> 
>> In the case of ROPT_RESPPROP meesages, it is the responsibility of
>> the sender of the original ROPT_REQPROP to post a receive to 
>> receive the response, which is to be sent by the receiver of the ROPT_REQPROP,
>> without consuming a credit.   This is similar to the case of ending an RPC
>> response.
>> 
>> 
>> > Overloading the term "initial"
>> 
>> Your'e right.  It is overloaded.
>> 
>> > Section 3 introduces an "initial set of transport characteristics"
>> > and Section 4.1 defines "ROPT_INITXCH: Specify Initial
>> > Characteristics". I think the use of "initial" means different
>> > things in these two cases?
>> 
>> 
>> It does.
>> 
>> > Instead:
>> 
>> > Section 3 could propose "initially-defined" characteristics.
>> 
>> 
>> OK.
>> 
>> > Section 4.1 could define ROPT_STARTXCH (like STARTTLS). I'm not
>> > attached to that name, but it probably shouldn't be ROPT_INITXCH
>> > if Section 3 defines "initial characteristics".
>> 
>> > ROPT_CONNXCH
>> 
>> > ROPT_FIRSTXCH
>> 
>> > ROPT_EARLYXCH
>> 
>> My current choice is ROPT_FIRSTPROP.
>> 
>> > Section 4.4
>> 
>> > The text in this section has been clarified to address previous
>> > reviewer comments; thanks! There are a number of syntax and
>> > grammatical errors that still need to be addressed (most often,
>> > a few words are repeated in some sentences).
>> 
>> I'll address those in -03.
>> 
>> > The "argument structure" of ROPT_UPDXCH is:
>> 
>> > struct optinfo_updxch {
>> >    xcharval        optupdxch_now;
>> >    bool            optupdxch_pendclr;
>> > };
>> 
>> > I prefer optupdxch_new instead of optupdxch_now; "_now" suggests this
>> > field records a time and/or date stamp.
>> 
>> OK.
>> 
>> > It would help me to understand why a receiver needs to distinguish
>> > between these types of notification.
>> 
>> When I specified the possible situations in which these messages could arise, I did not mean to imply that the receiver necessarily would need to distinguish these notifications.  My focus was on the sender.
>> 
>> > For instance, if pendclr is false, this could be either a rejection
>> >of a pending change request, 
>> 
>> If it it is a *rejection* of a pending change request pendclr would true
>> 
>> > or it could be an unsolicited change
>> > notification. 
>> 
>> > How does the receiver make use of the difference?
>> 
>> > Instead of a boolean, an enumeration of update event types would be
>> > a little friendlier, and could be expressed in the same amount of
>> > space (since an XDR boolean consumes 4 octets on the wire). 
>> 
>> I'm OK in principle, but it seems we are both uncertain if this is
>> indeed "friendlier"
>> 
>> > Based
>> > on the discussion in Section 4.4, we have:
>> 
>> > enum optupdxch_event_type {
>> >    OPTUPD_UNSOL  = 1,
>> 
>> pendclr = false as there is nothing pending to clear.
>> 
>> >   OPTUPD_MORE   = 2
>> 
>> pemdclr=false but there is still a pending request.
>> ,
>> >   OPTUPD_DONE   = 3,
>> 
>> pendclr = true and the request has completed successfully.
>> 
>>    OPTUPD_REJECT = 4,
>> 
>> This sound to me pendclr= true but the word "REJECT" suggests no change at all was made.  In this case you might also have a partial change
>> };
>> 
>> > But since the rdma_xid field is not used to tie change requests to
>> > these change update notifications, I'm not sure why the receiver
>> > needs to know that a pending request has been completed. 
>> 
>> If the receiver keeps track of pending requests, it needs to know when one is no longer pending.
>> 
>> The receiver is not required to do this and some might not choose to do so, but the protocol should provide
>> an implmentation that does the means to keep track.
>> 
>> This does not require the xid, but peer that requested the change can keep track of the properites for
>> which it has a pending requested change.
>> 
>> > I think
>> > REJECT might be more interesting than the difference between UNSOL,
>> > DONE, and MORE.
>> 
>> How about an enum that distinguished:
>> 	• Unsolicited
>> 	• Clear pending
>> 	• Still Pending
>> The additional distinction among degrees of request satisfaction in the last two
>> cases could be the responsibility of the receiver to determine. Since he made the request
>> and has access to the current value, he could determine this himself.  Alternatively,
>> we could have an int with two distinct bit fields:
>> 	• One distinguishing unsolicited/clear-pending/still-pending
>> 	• Another regarding degrees of request satisfaction.
>> 
>> 
>> > There is a bit of a race here: a sender could send an unsolicited
>> > update notification at the same time the receiver requests a change
>> > of the same xchar. Could that result in a non-deterministic outcome?
>> 
>> It shouldn't.  The point is that the property changes happen in a sequence which
>> is the same for all observers (no weird relativistic effects!) and that the updates
>> to the peer should happen that same sequence.  The important point is a
>> RESPXCH indicating a succesful change request be in the proper place in that 
>> sequence.
>> 
>> > Would it ever be reasonable to send two or more updates
>> > simultaneously for the same XCHAR? 
>> 
>> Since this is a single connection with sequenced delivery, there is
>> no way to send updates simultaneously.  You can queue them for 
>> sending at the same time, but the delivery will reflect the order in
>> which were queued.
>> 
>> > (Requiring single-threading here would prevent that from occurring).
>> 
>> Given that there is no response to UPDXCH, not sure how you
>> could specify single-threading.  There is no way to define when 
>> it would be OK to send the next.
>> 
>> > What if the sender emits two optinfo_updxch messages: both with
>> > pendclr set to false, but one with an intermediate value, and one
>> > with the original value. The result on the receiver could depend on
>> > the order in which these messages arrive. 
>> 
>> It would.
>> 
>> > Possibly some text
>> > regarding the ordering of these messages is needed.
>> 
>> I can add something.
>> 
>> > What happens if the receiver of ROPT_REQXCH drops the request? 
>> 
>> >Is there a timeout after which ROPT_REQXCH may be sent again?
>> 
>> There is no timeout in the protocol.
>> 
>> An implementation may choose to do so, but since this is sent
>> on a reliable connection, it is hard to imagine it being worth doing.
>> 
>> >What happens if an ROPT_RESPXCH is dropped? If ROPT_REQXCH is sent
>> > again, the reply is :
>> 
>> >  ROPT_RESPXCH with the requested value marked done ?
>> 
>> This should result.
>> 
>> >   ROPT_RESPXCH with a rejection (no change was done) ?
>> 
>> It is true no change was done but requested value was achieved,
>> 
>> >  ROPT_UPDXCH with the requested value and pendclr set to true ?
>> 
>> ROPT_UPPDXCH is not an alternative to ROPT_RESPXCH.  It is
>> a possible additional message.
>> 
>> > I don't see language that disallows any of these responses. Which
>> > one means "I already set this value" ? Sorry if I missed that.
>> 
>> I can add some clarification.
>> 
>> > Assuming that both sides support ROPT_UPDXCH, can an implementation
>> > use ROPT_UPDXCH exclusively instead of ROPT_INITXCH?
>> 
>> Yes but it  is kind of bogus.  You would be relying on the default initial values
>> and then changing them, which would be good in a test but, in real life, it is
>> asking for trouble.
>> 
>> BTW, I've always wished there would be an RFC2119bis defining "BOGUS" and
>> "BRAIN-DEAD" :-)
>> 
>> > Assuming that both sides support ROPT_UPDXCH, may a peer change an
>> > XCHAR and not send an unsolicited ROPT_UPDXCH?
>> 
>> It may but in most cases it would be BOGUS (or BRAIN-DEAD).
>> 
>> Suppose you raise the receive buffer size and don't tell you peer that it is raised,  In
>> that case. raising it is pretty pointless since the peer can't take advantage of the bigger 
>> buffer.
>> 
>> If you lower the receive buffer size and don't tell the peer it has been lowered, then he
>> is going to continue to assume a larger size and break things.
>> 
>> On Mon, Aug 22, 2016 at 2:10 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> Remarks on rpcrdma-xcharext-02.
>> 
>> 
>> - Credit accounting
>> 
>> There's no discussion of RPC-over-RDMA credit accounting in this
>> document. There needs to be some discussion of credit consumption.
>> In particular:
>> 
>> Requesters will have to have posted extra receive buffers (over
>> and above credits for forward channel replies and backchannel
>> requests) to deal with XCHAR messages. Likewise, responders will
>> need to post similar extra receives for this purpose.
>> 
>> Perhaps both peers should reserve one credit, and the specification
>> should insist that these operations are always single-threaded on a
>> connection.
>> 
>> Alternately, when xcharext is merged into rpcrdma-version-two,
>> there might be a generic discussion of non-RPC-payload-bearing
>> messages that could cover this issue.
>> 
>> 
>> - Overloading the term "initial"
>> 
>> Section 3 introduces an "initial set of transport characteristics"
>> and Section 4.1 defines "ROPT_INITXCH: Specify Initial
>> Characteristics". I think the use of "initial" means different
>> things in these two cases?
>> 
>> Instead:
>> 
>> Section 3 could propose "initially-defined" characteristics.
>> 
>> Section 4.1 could define ROPT_STARTXCH (like STARTTLS). I'm not
>> attached to that name, but it probably shouldn't be ROPT_INITXCH
>> if Section 3 defines "initial characteristics".
>> 
>> ROPT_CONNXCH
>> 
>> ROPT_FIRSTXCH
>> 
>> ROPT_EARLYXCH
>> 
>> 
>> - Section 4.4
>> 
>> The text in this section has been clarified to address previous
>> reviewer comments; thanks! There are a number of syntax and
>> grammatical errors that still need to be addressed (most often,
>> a few words are repeated in some sentences).
>> 
>> The "argument structure" of ROPT_UPDXCH is:
>> 
>> struct optinfo_updxch {
>>     xcharval        optupdxch_now;
>>     bool            optupdxch_pendclr;
>> };
>> 
>> I prefer optupdxch_new instead of optupdxch_now; "_now" suggests this
>> field records a time and/or date stamp.
>> 
>> It would help me to understand why a receiver needs to distinguish
>> between these types of notification.
>> 
>> For instance, if pendclr is false, this could be either a rejection
>> of a pending change request, or it could be an unsolicited change
>> notification. How does the receiver make use of the difference?
>> 
>> Instead of a boolean, an enumeration of update event types would be
>> a little friendlier, and could be expressed in the same amount of
>> space (since an XDR boolean consumes 4 octets on the wire). Based
>> on the discussion in Section 4.4, we have:
>> 
>> enum optupdxch_event_type {
>>    OPTUPD_UNSOL  = 1,
>>    OPTUPD_MORE   = 2,
>>    OPTUPD_DONE   = 3,
>>    OPTUPD_REJECT = 4,
>> };
>> 
>> But since the rdma_xid field is not used to tie change requests to
>> these change update notifications, I'm not sure why the receiver
>> needs to know that a pending request has been completed. I think
>> REJECT might be more interesting than the difference between UNSOL,
>> DONE, and MORE.
>> 
>> There is a bit of a race here: a sender could send an unsolicited
>> update notification at the same time the receiver requests a change
>> of the same xchar. Could that result in a non-deterministic outcome?
>> 
>> Would it ever be reasonable to send two or more updates
>> simultaneously for the same XCHAR? (Requiring single-threading here
>> would prevent that from occurring).
>> 
>> What if the sender emits two optinfo_updxch messages: both with
>> pendclr set to false, but one with an intermediate value, and one
>> with the original value. The result on the receiver could depend on
>> the order in which these messages arrive. Possibly some text
>> regarding the ordering of these messages is needed.
>> 
>> What happens if the receiver of ROPT_REQXCH drops the request? Is
>> there a timeout after which ROPT_REQXCH may be sent again?
>> 
>> What happens if an ROPT_RESPXCH is dropped? If ROPT_REQXCH is sent
>> again, the reply is :
>> 
>>   ROPT_RESPXCH with the requested value marked done ?
>>   ROPT_RESPXCH with a rejection (no change was done) ?
>>   ROPT_UPDXCH with the requested value and pendclr set to true ?
>> 
>> I don't see language that disallows any of these responses. Which
>> one means "I already set this value" ? Sorry if I missed that.
>> 
>> Assuming that both sides support ROPT_UPDXCH, can an implementation
>> use ROPT_UPDXCH exclusively instead of ROPT_INITXCH?
>> 
>> Assuming that both sides support ROPT_UPDXCH, may a peer change an
>> XCHAR and not send an unsolicited ROPT_UPDXCH?
>> 
>> 
>> > On Aug 18, 2016, at 4:23 PM, David Noveck <davenoveck@gmail.com> wrote:
>> >
>> > This is updated and it add some vowels (and consonants too) the field and type names.  In particular "rq" --> "req".
>> >
>> > I'm aware that some people find "XCHAR" confusing.  If someone has an idea for a replacement, please propose it on the list.  If the working group is OK with it, I'll produce a -03 incorporating it.
>> >
>> >
>> > ---------- Forwarded message ----------
>> > From: <internet-drafts@ietf.org>
>> > Date: Thu, Aug 18, 2016 at 4:16 PM
>> > Subject: New Version Notification for draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt
>> > To: David Noveck <davenoveck@gmail.com>
>> >
>> >
>> >
>> > A new version of I-D, draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt
>> > has been successfully submitted by David Noveck and posted to the
>> > IETF repository.
>> >
>> > Name:           draft-dnoveck-nfsv4-rpcrdma-xcharext
>> > Revision:       02
>> > Title:          RPC-over-RDMA Extension to Manage Transport Characteristics
>> > Document date:  2016-08-18
>> > Group:          Individual Submission
>> > Pages:          23
>> > URL:            https://www.ietf.org/internet-drafts/draft-dnoveck-nfsv4-rpcrdma-xcharext-02.txt
>> > Status:         https://datatracker.ietf.org/doc/draft-dnoveck-nfsv4-rpcrdma-xcharext/
>> > Htmlized:       https://tools.ietf.org/html/draft-dnoveck-nfsv4-rpcrdma-xcharext-02
>> > Diff:           https://www.ietf.org/rfcdiff?url2=draft-dnoveck-nfsv4-rpcrdma-xcharext-02
>> >
>> > Abstract:
>> >    This document specifies an extension to RPC-over-RDMA Version Two.
>> >    The extension enables endpoints of an RPC-over-RDMA connection to
>> >    exchange information which can be used to optimize message transfer.
>> >
>> >
>> >
>> >
>> > Please note that it may take a couple of minutes from the time of submission
>> > until the htmlized version and diff are available at tools.ietf.org.
>> >
>> > The IETF Secretariat
>> >
>> >
>> > _______________________________________________
>> > nfsv4 mailing list
>> > nfsv4@ietf.org
>> > https://www.ietf.org/mailman/listinfo/nfsv4
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> nfsv4 mailing list
>> 
>> nfsv4@ietf.org
>> https://www.ietf.org/mailman/listinfo/nfsv4
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4

--
Chuck Lever