Re: [nfsv4] New Version Notification for draft-dnoveck-nfsv4-rpcrdma-rtissues-01.txt

Chuck Lever <chuck.lever@oracle.com> Wed, 21 September 2016 17:33 UTC

Return-Path: <chuck.lever@oracle.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 01B6E12B071 for <nfsv4@ietfa.amsl.com>; Wed, 21 Sep 2016 10:33:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.518
X-Spam-Level:
X-Spam-Status: No, score=-6.518 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-2.316, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8EozlbCruBv9 for <nfsv4@ietfa.amsl.com>; Wed, 21 Sep 2016 10:33:36 -0700 (PDT)
Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E2DD512B05B for <nfsv4@ietf.org>; Wed, 21 Sep 2016 10:33:35 -0700 (PDT)
Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id u8LHXXCn022622 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 21 Sep 2016 17:33:34 GMT
Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0021.oracle.com (8.13.8/8.13.8) with ESMTP id u8LHXXYv032158 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 21 Sep 2016 17:33:33 GMT
Received: from abhmp0008.oracle.com (abhmp0008.oracle.com [141.146.116.14]) by userv0121.oracle.com (8.13.8/8.13.8) with ESMTP id u8LHXR5o016719; Wed, 21 Sep 2016 17:33:32 GMT
Received: from [10.71.11.99] (/8.25.222.2) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 21 Sep 2016 10:33:21 -0700
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jcojcqZOqB-rnM888YxfOJ1h=fW4KeRqFGjyVBbwVEQ0w@mail.gmail.com>
Date: Wed, 21 Sep 2016 10:33:21 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <9620CF28-BD0A-42B4-9506-715C440BAC24@oracle.com>
References: <147292013637.2343.7092433187165824743.idtracker@ietfa.amsl.com> <CADaq8jeBaLLKkoSVy8kaBA9k4_6a7PLtEDMyx4zjhDX6U6q6Ow@mail.gmail.com> <234e3071-2b0e-e5a1-f5d5-91919e9388b1@oracle.com> <CADaq8jeP=FJKZAh4GEsogccuCKsoH5=-h7=ymKO1FkRqc=944Q@mail.gmail.com> <763c7255-9cf3-eece-f7e7-8454a23126a5@oracle.com> <CADaq8jf7DHRptJKMVGacH03-uwGBuyg5pxaGs5V6kHe7oZyYGA@mail.gmail.com> <d40729a0-55b4-13d9-7176-7e3ac85b36e8@oracle.com> <CADaq8jcojcqZOqB-rnM888YxfOJ1h=fW4KeRqFGjyVBbwVEQ0w@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
X-Mailer: Apple Mail (2.3124)
X-Source-IP: aserv0021.oracle.com [141.146.126.233]
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/BvT8LQMAUdh8o38EY1E8OyccJfw>
Cc: "nfsv4@ietf.org" <nfsv4@ietf.org>
Subject: Re: [nfsv4] New Version Notification for draft-dnoveck-nfsv4-rpcrdma-rtissues-01.txt
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 21 Sep 2016 17:33:40 -0000

> On Sep 20, 2016, at 6:27 PM, David Noveck <davenoveck@gmail.com> wrote:
> 
> > I'm not seeing that this is how it works on solaris unless I'm 
> > misunderstanding something.  
> 
> Let's just go forward based on your presentation of what Solaris.
> 
> I think if you that are misunderstanding something, it is how RDMA WRITE works on the wire.  I think you are assuming it works the way that I thought it worked when I wrote rtissues-00.
> 
> Based on some comments I received, I switched to the different approach in rtissues-01, which I now believe to be correct.  I want to make sure my understaning is correct.  Of course rtissues-00 and rtissues-01 can bothe be wrong but it is impossible for the both to be correct.
> 
> >We can potentially issue multiple RDMA_WRITES for one NFS 
> > READ, but we do wait for the last RDMA_WRITE to complete, 
> 
> The question is if you get the completion indication when do you get it?  I (as of rtissues-01) believe that it means send of the RDMA data has finished and it does't mean the data has actually been stored on the destitution.  For that to happen there would have to be an additional internode round trip which rtissues-01 says doesn't exist.  rtissues-00, supposed that it did exist but my current understanding is that that is wrong.
> 
> > then go on to RDMA SEND the reply.  
> 
> OK.
> 
> >We need to wait for the write to confirm it completed successfully > before sending the reply.  
> 
> You don't from a sequencing point of view.  Given ordered delivery, the reply cannot get there first.
> 
> > If the write fails for some reason we cant send success in the 
> > reply. 
> 
> My understanding is that if the RDMA_WRITE fails, the send will not succeed and that you could queue the request to do the RDMA_WRITE and the send at the same time.  To me, that seems counterintuitive, but I think that is the way RC works

This is how I believe this to work (I think it aligns with your
responses above). I know I may have said other things before, but
my understanding has improved over time.

RDMA ordering guarantees and the RPC-over-RDMA protocol design allows
an RPC-over-RDMA responder to build each RPC Reply as a single chain
of WRs:

  Write WR (Write chunk payload)
  Write WR (Write chunk payload)
  Write WR (Reply chunk payload)
   ...
  Send WR

Which can be posted via one post_send. Implementations I'm aware of
are not yet this efficient, but do post Writes and Sends in this
order.

If the Write WRs are signaled, each completion fires when the
responder's HCA has sent the Write payload. A Write completion means
only that the responder's HCA is done with the payload buffer(s).
It says nothing about whether that payload was received by the
requester's HCA or whether the requester's HCA has placed that
payload in the requester's memory. In other words, a responder
cannot rely on Write completion to gauge the success of its Writes
on the requester.

The Send WR may also be signaled. Its completion means only that the
responder's HCA is done with the consumer's Send buffer. It also has
little relationship to requester activity (although the HCA may delay
Send completion if there are retransmits going on).

RDMA Read and Write is available only with a Reliable Connection. I
believe both HCAs rely on the RC message sequence numbers to verify
that all payloads and requests have been received in order.

The requester's HCA fires a completion when it is ready to hand the
incoming RPC Reply message to the RDMA consumer via a Receive. RDMA
ordering guarantees that all previous Write payloads are placed in
the requester's memory before that completion fires. Otherwise, the
the requester's HCA flushes the Receive in order to report a problem.

A failure on the requester is communicated back to the responder via an
RDMA ACK with a Syndrome code and a disconnect request (IB). The QP is
taken out of Ready-To-Send, in-progress and pending WRs are flushed on
both ends, and queue processing stops.

The RPC Reply is lost in the case and the content of the registered
Write chunk memory on the requester is indeterminate. The RPC client
is responsible for establishing a new connection and QP, re-registering
memory associated with the RPC request, and retransmitting that RPC
with the fresh STags.


> On Tue, Sep 20, 2016 at 7:27 PM, karen deitke <karen.deitke@oracle.com> wrote:
> 
> 
> On 9/17/2016 4:15 AM, David Noveck wrote:
>> > where with the RDMA_WRITE we do not need to wait? (i.e. 
>> > sections 2.2 and 2.3)
>> 
>> Actually, that is why there is not an internode round trip which contributes to latency.
>> 
>> I do distinguish in  this documentbetween round-trips which do and do not contribute to latency.  For example, I mention certain acks which re part of round trips that nobody has to wait for.  
>> 
>> I believe that there is no ack for an RDMA write but I've never actually looked on the wire to see that.  I imagine that there might be certain rare cases in which some sort of separate ack were sent.  For example, if you did multiple RDMA write in sequence without a SEND or there were a long delay without the response being sent, a separate ACK of the RDMA WRITE might be sent. 
>> 
>> However, in the common case, in which a SEND immediately follows, the RDMA WRITE, there is no need for a separate ACK and I believe the ACK of the SEND suffices to assure the responder RNIC, that both previous operations have completed successfully.
> I'm not seeing that this is how it works on solaris unless I'm misunderstanding something.  We can potentially issue multiple RDMA_WRITES for one NFS READ, but we do wait for the last RDMA_WRITE to complete, then go on to RDMA SEND the reply.  We need to wait for the write to confirm it completed successfully before sending the reply.  If the write fails for some reason we cant send success in the reply.  
> Karen
> 
>> 
>> On Thu, Sep 15, 2016 at 2:58 PM, karen deitke <karen.deitke@oracle.com> wrote:
>> Thanks,
>> 
>> That clears things up.  Next question, are you saying that an RDMA_WRITE is NOT an internode round trip and an RDMA_READ is an internode round trip, because we have to wait for the data from the RDMA_READ before proceeding, where with the RDMA_WRITE we do not need to wait? (i.e. sections 2.2 and 2.3)
>> 
>> Karen
>> 
>> 
>> On 9/12/2016 2:24 PM, David Noveck wrote:
>>> > I'm confused by this summarization.  
>>> 
>>> :-(.  Let's see what I can do to make this clearer.
>>> 
>>> > In the text above you indicate 3 different places where "internode round trip is involved, yet in the summary you only mention 2.  
>>> 
>>> The point I was trying to make was that, although there were three round-trips, only two contribute to the request latency.  In some 
>>> cases, there is a round trip because an ack is sent, but because neither the client nor the server is waiting for it.
>>> 
>>> > What is the definition of an "internode round trip?"  
>>> 
>>> Any situation in which a message is sent in one direction and, after that, another message is sent in the opposite direction.
>>> 
>>> > Also its unclear to me what you mean my "in the context of a connected operation".
>>> 
>>> maybe I should have said, "Because this reliable connected operation in which messages are acked.
>>> 
>>> > Also you mention that there are two-responder-side interrupt latencies, are you referring to the notification of the RDMA_READ 
>>> > and the send completion queue for sending the response?  
>>> 
>>> I'm referring to the notification that the request has been received and and the notification that the RDMA_READ has comnpleted.
>>> 
>>> Does this interrupt latency come into play in the latency of the operation? 
>>> 
>>> I think the two I mentioned do.
>>> 
>>> > Once the client side gets the response it can continue, even if the server thread is still waiting for notification of a successful send correct?
>>> 
>>> Yes.
>>> 
>>> > Also are you missing the interrupt latency of the send on the client? In addition to the interrupt latency of receiving the reply?
>>> 
>>> I don't think that contributes to latency.  The request processing can continue once the request is received on the server, even if the client
>>> has not received notification of the completion of the send.
>>> 
>>> On Mon, Sep 12, 2016 at 3:52 PM, karen deitke <karen.deitke@oracle.com> wrote:
>>> Hi Dave,
>>> 
>>> I'm struggling following this below:
>>> 
>>>    o  First, the memory to be accessed remotely is registered.  This is
>>>       a local operation.
>>> 
>>>    o  Once the registration has been done, the initial send of the
>>>       request can proceed.  Since this is in the context of connected
>>>       operation, there is an internode round trip involved.  However,
>>>       the next step can proceed after the initial transmission is
>>>       received by the responder.  As a result, only the responder-bound
>>>       side of the transmission contributes to overall operation latency.
>>> 
>>>    o  The responder, after being notified of the receipt of the request,
>>>       uses RDMA READ to fetch the bulk data.  This involves an internode
>>>       round-trip latency.  After the fetch of the data, the responder
>>>       needs to be notified of the completion of the explicit RDMA
>>>       operation
>>> 
>>>    o  The responder (after performing the requested operation) sends the
>>>       response.  Again, as this is in the context of connected
>>>       operation, there is an internode round trip involved.  However,
>>>       the next step can proceed after the initial transmission is
>>>       received by the requester.
>>> 
>>>    o  The memory registered before the request was issued needs to be
>>>       deregistered, before the request is considered complete and the
>>>       sending process restarted.  When remote invalidation is not
>>>       available, the requester, after being notified of the receipt of
>>>       the response, performs a local operation to deregister the memory
>>>       in question.  Alternatively, the responder will use Send With
>>>       Invalidate and the responder's RNIC will effect the deregistration
>>>       before notifying the requester of the response which has been
>>>       received.
>>> 
>>>    To summarize, if we exclude the actual server execution of the
>>>    request, the latency consists of two internode round-trip latencies
>>>    plus two-responder-side interrupt latencies plus one requester-side
>>>    interrupt latency plus any necessary registration/de-registration
>>>    overhead.  This is in contrast to a request not using explicit RDMA
>>>    operations in which there is a single inter-node round-trip latency
>>>    and one interrupt latency on the requester and the responder.
>>> 
>>> I'm confused by this summarization.  In the text above you indicate 3 different places where "internode round trip is involved, yet in the summary you only mention 2.  What is the definition of an "internode round trip?"  Also its unclear to me what you mean my "in the context of a connected operation".
>>> 
>>> Also you mention that there are two-responder-side interupt latencies, are you referring to the notification of the RDMA_READ and the send completion queue for sending the response?  Does this interrupt latency come into play in the latency of the operation? Once the client side gets the response it can continue, even if the server thread is still waiting for notification of a successful send correct?
>>> 
>>> Also are you missing the interrupt latency of the send on the client? In addition to the interrupt latency of receiving the reply?
>>> 
>>> Karen
>>> 
>>> 
>>> _______________________________________________
>>> nfsv4 mailing list
>>> nfsv4@ietf.org
>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>> 
>> 
>> 
> 
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4

--
Chuck Lever