Re: [nfsv4] Clean up for rpcrdma-version-two Read chunks

David Noveck <davenoveck@gmail.com> Tue, 05 May 2020 19:04 UTC

Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E01D23A05DE for <nfsv4@ietfa.amsl.com>; Tue, 5 May 2020 12:04:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.098
X-Spam-Level:
X-Spam-Status: No, score=-2.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8GdQCK53ByuX for <nfsv4@ietfa.amsl.com>; Tue, 5 May 2020 12:04:18 -0700 (PDT)
Received: from mail-ed1-x529.google.com (mail-ed1-x529.google.com [IPv6:2a00:1450:4864:20::529]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F09283A05AA for <nfsv4@ietf.org>; Tue, 5 May 2020 12:04:17 -0700 (PDT)
Received: by mail-ed1-x529.google.com with SMTP id d16so2690874edq.7 for <nfsv4@ietf.org>; Tue, 05 May 2020 12:04:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=2LykN3gJ4f6nNyWB6h1wM5rJXfcXCw64HyYmNEqK33Y=; b=Eat6TSzxKpaSiCpm8AoO9Hsh+ZqmhN920VKkIw8HfsSufgLNk0HqLHQFN0IHMqcueL Ehe9ZSjgLtqolnN8pf6TDyg1NfSUdPHL8s8qHmo+yOybkbPOpYbp5QgcLUV2wV74b5hX S2X0oUoJESdpeWbZRLKMhnwKXEG9YkLvuYiMMA1qME9EMsYuHqujic+ro9GfGRkggJRh D3P25BSSwb4X6hOJ2hETYaew/j6gTP/2yOH2F3inhPxhrvrv9af2LnVbO+ekdmmS6WaV lXQjBCoBr0SIbaIrEWVRynH0piXTjfhK8cQ3c39mE9IaHC7GDKIy6nGb9hLmwwUmQgqq 53uA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=2LykN3gJ4f6nNyWB6h1wM5rJXfcXCw64HyYmNEqK33Y=; b=DarvsO9yP+DORLpR8FDwLZ8qZsawn1qvp/V+kMycNiWAlwkcAzy7FKkDu3B0IprHSE imofe+OZrUBGE7Bb+8jBrGsjRO3YMj7wDI0pXR2jp7bgeumYUvOPuC5N5ZSRVcDVaxY9 4ZmzOcakpQh26Ct/atwADNo3lSGldORJb7BbaIs5yAGgeG1KRC7mH07hSBdp00DvV/XB 2nJGs7DRmhgcm2yiz6fSfP9OBHt8p2oykfXLAFLXrbVt+dRtyMbbTMHrOOlqeVgFX7k7 feT+oH/+ArlGrJon9E8T3vLF5uEJLIz84wIp9oEVjHXXovpYrqi7sb8w9+1fJU2W3JhF XIzQ==
X-Gm-Message-State: AGi0PuZ1bioH+J5odAB3lS2RAQgztd+j3E4unBQdmh0z4GZXqDfipCx9 HxENrt54E8jJ2cduyLwqrx/4PC0iNFFQ13snXls=
X-Google-Smtp-Source: APiQypJM2hlUcmr+R+kBL5DjnZ3ep+VOMo1rK1/ny4QCaiqqO66mxpjT3/fnhCgh/dYS7UnFfv/Yt78jlMBB2I/Xcxk=
X-Received: by 2002:a05:6402:4:: with SMTP id d4mr3910469edu.344.1588705456393; Tue, 05 May 2020 12:04:16 -0700 (PDT)
MIME-Version: 1.0
References: <A999AEE0-9201-4A73-AC9D-005500A32BCA@oracle.com> <CADaq8jfXo65s-nPP0eh_zwJUtZ194XQrth8f5RpmMvy_54urVA@mail.gmail.com> <97344C4C-E9A5-4230-B477-F5E2775BED85@oracle.com>
In-Reply-To: <97344C4C-E9A5-4230-B477-F5E2775BED85@oracle.com>
From: David Noveck <davenoveck@gmail.com>
Date: Tue, 05 May 2020 15:04:05 -0400
Message-ID: <CADaq8jfEKBOnQKDFvvLfd4jaKjtCZWOYZBzQb2V=REJKu_+=bA@mail.gmail.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: NFSv4 <nfsv4@ietf.org>
Content-Type: multipart/alternative; boundary="00000000000061922605a4eb50d9"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/OkTlc8JiBcW4tbMogyb45rUYXEo>
Subject: Re: [nfsv4] Clean up for rpcrdma-version-two Read chunks
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 05 May 2020 19:04:22 -0000

      > This would require sorting by the receiver.

>It would indeed.

It would require that if they could arrive unsorted.

> The alternative is kind of disastrous.

I think a reasonable alternative is to ask the sender to send them in
order.  I understand you don't agree with that but can't see how it is
"disastrous".

> Gaps would probably contain data from a previous use of the Read sink
> buffer.

I don't see how.  If you have multiple read chunks, it is probably due to
multiple operations, each of which contain a DDP-eligible data area, since
there is no op with more than one DDp-eligible argument. In this cse the
non-DDP_eligible data appears inline.

I guess you are focused on the case of a mix of PZRCS and PNZRCS, which is
kind of a special case.

> The exact contents of overlap regions would depend on the order that the
> RNIC completed the RDMA Read operations.

I think we agree about overlapping chunks being bad but only disagree about
how best to prevent them.

> Thus IMO a good quality Responder has to perform some sanity checking on
> the position values and lengths in incoming Read segments.

I Agree.

> I'm not sure
> how it could avoid some sorting-like behavior to perform this check.

That's true only if it is valld to send them out-of-order which I don't
think
should be the case.

> It might be better to place responsibility on the sender to sort these.

I should have said "send these in sorted order".  Realistically, requesters
will send these in sorted order and no actual sort would be required.

> IMO the goal of RPC/RDMA is to reduce host processing on the Requester.

That's an important goal but don't agree that it is *the* goal.

> Sorting on the Responder follows that paradigm more closely than the
> converse.

I think requring them being in sorted order avoids a lot of complexity.
I haven't seen any case where it makes sense to send them other than in
sorted order.

With regard to this goal/paradigm, alll I can say is that T. S. Kuhn has
lot to answer for :-)

> For a given position value, RPC/RDMA already requires that Read segments
> have to be in the order they appear in the reconstructed RPC message. But
> there's currently no requirement that the Read list's position values have
> to appear in monotonically increasing order. In fact I think RPC/RDMA
> permits a Requester to interleave Read segments at different positions, as
> long as they are in the Read list in the order they should be used to
> reconstruct the RPC Call.

True but I don't see why any requester would find it necessary/helpful
 to interleave read segments in this way, except perhaps in the special case
a PZRC/PNZRC mix.

> I'm working on some changes to the Linux NFS/RDMA implementation that
might
> perform Responder-side Read list sorting in order to deal properly with
> Read lists that contain segments with more than one Position value. For
> example, a Read list that contains Read segments with position zero and
> Read segments with a non-zero position could be re-sorted so that all of
> the segments are in byte order at position zero.

You are talking about doing some thing that goes beyond sorting, even though
sorting is a part of it and it appears to me that you are sorting by
someting
apporoximating chunk position.
You are reorganizing all the read chunks including
changing non-PZ chunks to position-zero chunks.   In addition the sorting
is not by
the position field of the chunk but by expected position.  I can see why
you moght
do if you receive somethning like that that but I can't see why the
reqester is using a
PZRC to send this, given that, in version two, you can avoid the PZRC and
use
message continuation to send the request.

> This enables the Responder
> to set up the Read sink buffer pages so at Read completion the message is
> already in proper segment and byte order.

The problem with this is that when some of non-position-zero read chunks
are to be written
to file offsets not a multiple of the page size.  the alignment of the data
in serrver memory
is not what you would want itto be.   I think it is better, when allocating
the pages to be
read-into and ten page flipped as part ofthe WRITEs, to be aware of the
within-file offsets
of the WRITEs.

On Tue, May 5, 2020 at 2:21 AM Chuck Lever <chuck.lever@oracle.com> wrote:

>
> > On May 4, 2020, at 8:59 PM, David Noveck <davenoveck@gmail.com> wrote:
> >
> >> On Mon, May 4, 2020, 12:13 PM Chuck Lever <chuck.lever@oracle.com>
> wrote:
> >>
> >>> RPC/RDMA v1 allows a position zero Read chunk to appear in an RDMA_MSG
> type Call.
> >>> Where does a Responder put the inline portion of such a message?
> >>
> >> I propose that in RPC/RDMA version 2, a Responder MUST return
> RDMA2_ERR_BAD_XDR if
> >> a Requester sends a Read list containing a position zero Read chunk as
> part of
> >> header type other than RDMA2_NOMSG.
> >
> > Agree.
> >
> >
> >>> RPC/RDMA v1 does not explicitly require an RDMA_NOMSG type Call to
> have a position
> >>> zero Read chunk. Does such a message have gaps? Are they zero-filled?
> >>
> >> I propose that in RPC/RDMA version 2, a Responder MUST return
> RDMA2_ERR_BAD_XDR if
> >> a Requester sends an RDMA2_NOMSG header type whose Read list does not
> include a
> >> position zero Read chunk.
> >
> > As stated, this would forbid NOMSG bring used to send a long reply.
>
> Nit: rpcrdma-version-two no longer uses the term Long message, see
> Section 4.4.3.
>
>
> > I think the text to address this needs to be careful not to foreclose
> that. Your text above uses the word "requester" assuming this is sufficient
> but the only way a peer receiving message could determine whether it was
> sent by a requester or responder is my looking at the message, which, in
> this case, does not exist.
>
> The RPC/RDMA version two header has the RDMA2_F_RESPONSE flag (Section
> 6.2.2.1) which was introduced to enable a receiver to distinguish the
> roles of the sender and receiver peers without sniffing the RPC layer
> payload.
>
> A while back I had envisioned using Read chunks in Responder-to-
> Requester messages, and even wrote an I-D about it. But now that we
> have both Reply chunks and Message chaining, it seems unnecessary to
> hold the door open for using a Read chunk for a Reply, especially
> given how arcane Read chunks are. Did you have a particular use case in
> mind?
>
> Also, I think the proposed text above would prevent the use of the
> RDMA2_NOMSG type for asynchronous credit grants (Section 4.2.1.2) so I
> ought to restate the requirement as:
>
> >> A Responder MUST return RDMA2_ERR_BAD_XDR if a Requester sends an
> >> RDMA2_NOMSG header type with a non-empty Read list that does not
> >> include a position zero Read chunk.
>
>
>
> >>> RPC/RDMA v1 does not prevent or prohibit overlapping Read chunks. Is
> the correct
> >>> response ERR_CHUNK?
> >>
> >> A protocol change would be needed to totally prevent the expression of
> overlapping
> >> Read chunks. Maybe it's a little too late to address that in RPC/RDMA
> version 2.
> >
> > I think you mean version 1.
>
> No, I meant version 2. My sense of the virtual room two weeks ago was
> that no-one had the stomach for major surgery on the RPC/RDMA version 2
> data structures at this point. That's why I've limited the above proposals
> to simple requirements that the Responder recognize badly formed Read
> lists and respond to them as errors.
>
> If we were to "go there," my thought about how to address the gap/overlap
> issue would be to eliminate Read chunks and structure the Read list the
> same way that the Write list is structured; ie, as a list of arrays of
> RDMA segments, but each array would have a position field.
>
> It might be interesting to use position fields in the Write list as well,
> filled in by the Responder, to help disambiguate the position of result
> data items in a Reply message.
>
> But at this point we have escaped the orbit of RPC/RDMA version one
> entirely.
>
>
> > Nobody seems up to do rfc8166bis.
> >
> >> I propose that in RPC/RDMA version 2, a Responder MUST return
> RDMA2_ERR_BAD_XDR if
> >> a Requester sends a Read list with chunks whose offsets and lengths
> result in the
> >> same message byte position appearing in more than one Read chunk.
> >
> > This would require sorting by the receiver.
>
> It would indeed. The alternative is kind of disastrous.
>
> Gaps would probably contain data from a previous use of the Read sink
> buffer.
>
> The exact contents of overlap regions would depend on the order that the
> RNIC completed the RDMA Read operations.
>
> Thus IMO a good quality Responder has to perform some sanity checking on
> the position values and lengths in incoming Read segments. I'm not sure
> how it could avoid some sorting-like behavior to perform this check.
>
>
> > It might be better to place responsibility on the sender to sort these.
>
> IMO the goal of RPC/RDMA is to reduce host processing on the Requester.
> Sorting on the Responder follows that paradigm more closely than the
> converse.
>
> For a given position value, RPC/RDMA already requires that Read segments
> have to be in the order they appear in the reconstructed RPC message. But
> there's currently no requirement that the Read list's position values have
> to appear in monotonically increasing order. In fact I think RPC/RDMA
> permits a Requester to interleave Read segments at different positions, as
> long as they are in the Read list in the order they should be used to
> reconstruct the RPC Call.
>
> I'm working on some changes to the Linux NFS/RDMA implementation that might
> perform Responder-side Read list sorting in order to deal properly with
> Read lists that contain segments with more than one Position value. For
> example, a Read list that contains Read segments with position zero and
> Read segments with a non-zero position could be re-sorted so that all of
> the segments are in byte order at position zero. This enables the Responder
> to set up the Read sink buffer pages so at Read completion the message is
> already in proper segment and byte order.
>
>
> --
> Chuck Lever
>
>
>
>