[nfsv4] Fwd: RPC/RDMA read chunk round-up when inline content follows

David Noveck <davenoveck@gmail.com> Wed, 07 January 2015 09:58 UTC

Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 80C3B1A899D for <nfsv4@ietfa.amsl.com>; Wed, 7 Jan 2015 01:58:33 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Dc3i3SBYd9FM for <nfsv4@ietfa.amsl.com>; Wed, 7 Jan 2015 01:58:26 -0800 (PST)
Received: from mail-oi0-x22f.google.com (mail-oi0-x22f.google.com [IPv6:2607:f8b0:4003:c06::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C4C4F1A8990 for <nfsv4@ietf.org>; Wed, 7 Jan 2015 01:58:25 -0800 (PST)
Received: by mail-oi0-f47.google.com with SMTP id v63so2190685oia.6 for <nfsv4@ietf.org>; Wed, 07 Jan 2015 01:58:25 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=n9wsBZ+RuQrNcOC7Zbr4i6N7ZOr4rZ1P7Eno+f1SHHw=; b=h14AG85oDv6NZsHehLU9ZCynsVnvo4KHEZHhkVtRwZCiNb9josFEGuXEcrFlmKENe8 wCJKDApzRywcdknGIpM9h79yYJMhelRt7OS7A1MaHVlELxkb0GGBTvQH5ch8D0ioQ3up vGKQD0MPliqSd8JIhJxVLM/gjKbDfCf9QJZheFfJWVJozc0XLtrrzfu/T2K+lVixWxVZ QYpJZX2wAH0RStkEtefuKYJKJobSWoAvnUsOGovwIbM2QW7PJsEP0yNTNTLRGodAUCAD T6sut+b33rgAR7RRS5ZS39D/6dGviG+at32AVe6FYacQwgTOHs/PGZwF7U8RBMW7R1M+ QtAw==
MIME-Version: 1.0
X-Received: by 10.60.93.106 with SMTP id ct10mr1407222oeb.8.1420624704930; Wed, 07 Jan 2015 01:58:24 -0800 (PST)
Received: by 10.182.27.198 with HTTP; Wed, 7 Jan 2015 01:58:24 -0800 (PST)
In-Reply-To: <CADaq8jcuRP7b3Msne+Wu0O_HYWNAZnNkXjtTKo9d3uO5usE+8A@mail.gmail.com>
References: <6F552256-89A7-4101-B7A1-4EFDEEFDACEB@oracle.com> <CADaq8jfFDgWSpg6QL3pPeJ01GJ=jhUMj9cVN_vM6qKFmcnq7jA@mail.gmail.com> <F403FCE4-B173-4A4C-8EFC-767A92CEE0C0@oracle.com> <BLUPR03MB3288D2139A15F766701A225A0580@BLUPR03MB328.namprd03.prod.outlook.com> <20150106185826.GA28003@fieldses.org> <CADaq8jcuRP7b3Msne+Wu0O_HYWNAZnNkXjtTKo9d3uO5usE+8A@mail.gmail.com>
Date: Wed, 07 Jan 2015 04:58:24 -0500
Message-ID: <CADaq8jfLLK1ibo4gGXoKiaPN3FndJWBBxUqubc1L_Ek3EP4JuA@mail.gmail.com>
From: David Noveck <davenoveck@gmail.com>
To: "nfsv4@ietf.org" <nfsv4@ietf.org>
Content-Type: multipart/alternative; boundary="047d7b33d676e58803050c0cf658"
Archived-At: http://mailarchive.ietf.org/arch/msg/nfsv4/WqLgxb8L5m8SGtlL_mWDYnTJL38
Subject: [nfsv4] Fwd: RPC/RDMA read chunk round-up when inline content follows
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Jan 2015 09:58:33 -0000

---------- Forwarded message ----------
From: David Noveck <davenoveck@gmail.com>
Date: Tue, Jan 6, 2015 at 2:56 PM
Subject: Re: [nfsv4] RPC/RDMA read chunk round-up when inline content
follows
To: "J. Bruce Fields" <bfields@fieldses.org>


> So if I understand correctly: fixing the servers will make them fail all
> writes from current clients,

If it hurts when you do that, don't do that.  it seems you can make servers
accept old and new client behavior, even if the old behavior is a BAD IDEA.

one way to map this into RFC2119 language is to say the servers SHOULD
reject
older-form requests and MUST accept newer-form requests.


> andf fixing the clients will make the writes fail against all current
servers.

not sure how to state this in rfc2119 language but I'd argue that one
should not fix the
clients until severs are fixed to accept newer-form request.


> Isn't it a little late for that kind of change?

I think it is.

> Or are there really so
> few NFS/RDMA users that we can afford to completely break backwards
> compatibility?

i guess it depends on whether you are one of those users.

The approach described in draft-ietf-nfsv4-versioning is to make server
support for old-form
and new-form requests into separate features that the client can test for,
allowing clients
to be fixed to adapt to server behavior in parallel with making servers
which support both
behaviors.

Not only is that quite heavyweight for this problem, but this draft is a
ways away from
IESG approval and publishing.


On Tue, Jan 6, 2015 at 1:58 PM, J. Bruce Fields <bfields@fieldses.org>
wrote:

> On Mon, Jan 05, 2015 at 06:50:44PM +0000, Tom Talpey wrote:
> > I hope this additional reply doesn't make the thread too convoluted.
> >
> > > -----Original Message-----
> > > From: nfsv4 [mailto:nfsv4-bounces@ietf.org] On Behalf Of Chuck Lever
> > > Sent: Thursday, January 1, 2015 1:51 PM
> > > To: David Noveck
> > > Cc: Chunli Zhang; nfsv4 list (nfsv4@ietf.org); Karen; Dai Ngo
> > > Subject: Re: [nfsv4] RPC/RDMA read chunk round-up when inline content
> > > follows
> > >
> > > Hi Dave-
> > >
> > > Sorry for the length. There's a lot to tease apart.
> > >
> > >
> > > On Jan 1, 2015, at 7:20 AM, David Noveck <davenoveck@gmail.com> wrote:
> > >
> > > > > It turns out that Linux is the only NFS client that supports RDMA
> > > > > and adds an operation (GETATTR) after WRITE in an NFSv4 compound.
> > > >
> > > > Presumably it sends the GETATTR as a separate chunk.  I suppose that
> > > > could be a performance issue.
> > >
> > > Or efficiency: extra work has to be done to register that 16 byte
> piece of the
> > > compound RPC and send it in a separate RDMA READ. That's extra
> transport
> > > bytes on the wire, and an extra round trip (albeit a very fast one).
> >
> > It's my opinion this is a bug, in both the client and the server here.
> >
> > The intention of an RDMA Chunk is that it encodes all or part of an RPC
> > argument, RPC result, or an entire RPC message. The fact that the client
> > added a compounded GETATTR to the write data's Read Chunk is wrong.
> > The fact that the server parsed the additional data after the write
> payload
> > as an RPC request is also wrong. Keeping these segments separate is
> critical
> > to the RFC5667 specification of the NFS binding to RPC/RDMA.
>
> So if I understand correctly: fixing the servers will make them fail all
> writes from current clients, and fixing the clients will make their
> writes fail against all current servers.
>
> Isn't it a little late for that kind of change?  Or are there really so
> few NFS/RDMA users that we can afford to completely break backwards
> compatibility?
>
> --b.
>
> >
> > That said, the text in RFC5667 may be unclear. I'd suggest that it needs
> to
> > be revisited anyway, to reflect implementation experience as well as the
> > new behaviors of NFSv4.1, v4.2 etc. The document is five years old, and
> it
> > reflects work done prior to that. In particular, while it was published
> at the
> > same time as RFC5661 (NFSv4.1), it actually doesn't make any specific
> > requirements for that minor version beyond mentioning the callback
> > channel. That, plus pNFS, layouts, CREATE, OPENATTR, ACLs, etc.
> >
> > >
> > > > > Linux sends the part of the compound before the opaque file data
> via
> > > > > RDMA SEND. The opaque file data is put into the read list. Then the
> > > > > GETATTR operation is put in the read list after the file data.
> > > >
> > > > > Existing NFS/RDMA servers are able to receive and process this
> > > > > request correctly.
> > > >
> > > > The interesting question is whether they would work if one switched
> > > > from the chunks-only-at-the-end approach to the alternative
> > > > chunks-only-for-WRITE approach.
> > >
> > > The current Linux NFS server and the Solaris 11 update 2 NFS server do
> not
> > > handle a request with additional inline content. There is a minor
> exception,
> > > but let's put that aside for the moment.
> > >
> > > Recall that the Linux NFS client sends { PUTFH, WRITE, GETATTR } as its
> > > NFSv4 WRITE compound. I've constructed a Linux client that sends the
> > > GETATTR as additional inline content rather than at the end of the
> read list.
> > >
> > > The Linux server accepts the PUTFH and WRITE, but then returns 10044
> > > (OP_ILLEGAL) for the third op, because it finds nonsense when it looks
> in the
> > > XDR stream for the arguments of the third op. (I do have a fix for
> this).
> > >
> > > I can make the converse point: sending an RPC request where only a
> middle
> > > argument is placed in a chunk list is clearly allowed by RFC 5666, and
> is not
> > > forbidden by RFC 5667. Having "position" fields for each chunk in the
> read list
> > > means a server MUST be prepared to accept inline content following a
> read
> > > list
> >
> > I'm not sure I agree with "MUST", but I do agree that there is sufficient
> > (redundant) information in the encoding of the request for the server to
> > Perform such an action.
> >
> > >
> > > Therefore current NFS/RDMA server implementations are broken. If we
> > > agree that sending the final GETATTR in the read list is allowed, then
> servers
> > > should accept that in addition to NFSv4 compounds where the final
> GETATTR
> > > is sent as additional inline content.
> > >
> > > > > First question: does the Linux client comply with RFCs 5666 and
> 5667
> > > > > when it sends an NFSv4 WRITE compound in this way, or should it be
> > > > > changed?
> > > >
> > > > It clearly complies with RFC5666 which grants the sender a lot of
> > > > freedom for the sender as to how it chooses to sends individual data
> > > elements.
> >
> > While freedom is the intention of RFC5666, it is (was) the intention of
> RFC5667
> > to narrowly define those freedoms for the NFS protocol family. There is a
> > requirement in RFC5666 that each upper layer provide such a binding, with
> > the appropriate rules to ensure interoperability.
> >
> > > >
> > > > > The intention of RFC 5667 appears to be that the GETATTR belongs
> > > > > inline, following the front of the compound.
> > > >
> > > > I think the basic idea is that RDMA is for large data elements, such
> > > > as data to be written, but I don't think it comes out and forbids you
> > > > from sending other things.
> > >
> > > This text from section 4 seems to disallow the use of a read list for
> anything
> > > but opaque file data or symlink pathnames:
> > >
> > > "Similarly, a single RDMA Read list entry MAY be posted by the client
> to
> > > supply the opaque file data for a WRITE request or the pathname  for a
> > > SYMLINK request.  The server MUST ignore any Read list for  other NFS
> > > procedures, as well as additional Read list entries beyond  the first
> in the list."
> >
> > Note however that RFC5667 section 4 is explicitly about NFS versions 2
> and 3,
> > and the "MUST ignore" statement is purely about simplifying the nfsdirect
> > layering to align with the (simple) requirements of NFS v2 and v3,
> transfer-wise.
> >
> > Section 5 in turn is silent on the issue of SYMLINK, which of course is
> implemented
> > In NFSv4.x as the CREATE procedure. That's clearly an omission.
> >
> > I agree with many of the points you make below, but in the absence of
> clear
> > normative statements in RFC5667, I think they simply reinforce the need
> > for updating the nfsdirect specification.
> >
> > Tom.
> >
> >
> >
> > >
> > > "The server MUST ignore" is a clear statement that placing other
> arguments
> > > in a read list will prevent interoperation, at least for
> > > NFSv2 and NFSv3 Direct.
> > >
> > > The "MAY" refers to an alternate mechanism of sending large RPC
> > > requests: RDMA_NOMSG, where the entire RPC message is encapsulated in
> > > a read list at position 0.
> > >
> > > > The only basis on which you might base a case that this violates
> > > > RFC5667 wiule be based on the following text from section2:
> > > >
> > > > Large chunks of data,
> > > > such as the file data of an NFS WRITE request, MAY be referenced by
> an
> > > > RDMA Read list and be moved efficiently and directly placed by an
> RDMA
> > > > Read operation initiated by the server.
> > > >
> > > > One could argue that this somehow implies you "MAY NOT" transfer
> other
> > > > sorts of request data using RDMA read chunks.  I don't read it that
> way.
> > >
> > > > > Only the file data
> > > > > payload belongs in the read list, if NFSv4 Direct is to be
> > > > > consistent with NFSv2 and NFSv3.
> > > >
> > > > I don't think NFSv4 can be consistent with previous NFS protocols
> > > > because it is different in having COMPOUND.
> > >
> > > I don't agree that NFSv4 cannot be consistent with legacy NFS. Perhaps
> it is
> > > not stated clearly in RFC 5667 section 5 how to make NFSv4 equivalent
> to
> > > NFSv2/3 Direct, but observe that:
> > >
> > > 1. If you consider only WRITE operations, sending only the opaque file
> > >    data payload is allowed in all three cases.
> > >
> > > 2. If you consider NFSv2 SYMLINK, there is an argument (sattr)
> > >    following the link pathname argument. RFC 5667 says only the link
> > >    pathname may be sent in a read list. This is a clear requirement
> > >    that the middle argument is conveyed via a read list, and the
> > >    remaining arguments inline. The case where both the pathname and
> > >    the sattr argument are sent in a read list is explicitly not
> > >    allowed.
> > >
> > > The legacy-consistent way of sending an NFSv4 WRITE compound would be
> > > to put only the opaque file data in a read list, no matter what else
> appears in
> > > the NFSv4 COMPOUND request with the WRITE operation. (Small WRITEs of
> > > course are always sent inline).
> > >
> > > > Your choice is either to be consistent with NFSv3:
> > > >   * In having chunks only at the end of the request.
> > > >   * In having chunks only for WRITE data.
> > >
> > > RFC 5667 section 5 has two sentences discussing the use of read lists
> with
> > > NFSv4 COMPOUND requests:
> > >
> > > "The situation is similar for RDMA Read lists sent by the client and
> applies to
> > > the NFSv4.0 WRITE and SYMLINK procedures as for v3.
> > >  Additionally, inline segments too large to fit in posted buffers MAY
> be
> > > transferred in special "RDMA_NOMSG" messages."
> > >
> > > Although it does suggest strongly that NFSv4 WRITE should be consistent
> > > with NFSv3 WRITE, at first glance it doesn't help us decide which of
> your two
> > > bullets is the correct interpretation.
> > >
> > > "The situation is similar" I believe references earlier text in
> section 5, not to
> > > the text in section 4 that discusses WRITE/SYMLINK. Earlier, section 5
> goes as
> > > far as to say:
> > >
> > > "The Write list MUST be considered only for the COMPOUND procedure.
> > >  This procedure returns results from a sequence of operations.  Only
> the
> > > opaque file data from an NFS READ operation and the pathname from  a
> > > READLINK operation MUST utilize entries from the Write list."
> > >
> > > The "Only . . . MUST" construction in the third sentence is slippery.
> > >
> > > It can be read as allowing other operations and arguments to use a
> write list,
> > > and as requiring READ and READLINK to use a write list in all cases,
> including
> > > short payloads. That reading, while literal, seems contradictory with
> the rest
> > > of the document.
> > >
> > > Or it can be read as allowing a write list only for READ and READLINK
> > > operations in NFSv4 compounds. That reading I believe is consistent
> with the
> > > rest of the document and earlier versions of the NFS Direct protocol,
> and
> > > would also apply to WRITE in the read list case.
> > >
> > > Also worth mentioning is that NFSv4 does not have a separate SYMLINK
> > > operation. Thus the explicit mention of an NFSv4 SYMLINK operation in
> > > section 5 is incorrect. It could be replaced with CREATE(NF4LNK). Or
> using a
> > > read list during symlink creation could simply be abandoned for NFSv4,
> in
> > > favor of sending large CREATE requests via RDMA_NOMSG.
> > >
> > > In sum, a very particular reading of section 5 suggests that only WRITE
> > > payloads in NFSv4 COMPOUND requests can be sent in a read list, but
> there
> > > seems to be some wiggle room. It would help me, as an implementer, if
> RFC
> > > 5667 section 5 could be clarified.
> > >
> > > > > Now suppose the Linux client is changed to place the GETATTR
> > > > > operation inline, following the front of the compound.
> > > >
> > > > And assuming the existing servers are OK with that?
> > >
> > > > > When the opaque file data requires round-up because it's length is
> > > > > not divisible by four, should there be a pad?
> > > >
> > > > No.
> > > >
> > > > > if so, where does it belong?
> > > >
> > > > > To put it another way: Since inline data does not have a "position"
> > > > > field, should the receiver assume that inline content following the
> > > > > read list begins right at the end of the rounded-up chunk?
> > > >
> > > > No.
> > > >
> > > > > or should
> > > > > the receiver assume the inline content begins at the next XDR
> > > > > position after the end of the chunk?
> > > >
> > > > Yes.
> > > >
> > > > > If the former, that would require inserting a zero pad before the
> > > > > inline content. But that pad would cause the following inline
> > > > > content to be sent unaligned, since a zero pad is always shorter
> > > > > than 4 bytes long.
> > > >
> > > > > By implication, then, the receiver MUST conclude that round-up is
> > > > > present in the case when inline data remains to be decoded (ie, the
> > > > > following inline content always begins at the next XDR position).
> > > > > The sender MUST NOT send a pad inline in this case. Is this
> correct?
> > > >
> > > > Yes.
> > >
> > > Thanks for confirming my interpretation, and for helping me to sharpen
> my
> > > understanding of these RFCs.
> > >
> > > > > I've read RFC 5666 section 3.7. Paragraph 4 seems most relevant,
> but
> > > > > it doesn't discuss the case where there is additional inline
> content
> > > > > following a chunk list.
> > > >
> > > > I think the following is relevant:
> > > >
> > > > Because this position will not match (else roundup would not have
> > > > occurred), the receiver decoding will fall back to inspecting the
> > > > remaining inline portion.
> > > >
> > > > This may not be clear enough.  Maybe a clarifying editorial errata
> > > > would be justified.
> > >
> > > Let me know if you'd like me to file one.
> > >
> > > > On Tue, Dec 30, 2014 at 12:53 PM, Chuck Lever <
> chuck.lever@oracle.com>
> > > wrote:
> > > > Hi-
> > > >
> > > > It turns out that Linux is the only NFS client that supports RDMA and
> > > > adds an operation (GETATTR) after WRITE in an NFSv4 compound.
> > > >
> > > > Linux sends the part of the compound before the opaque file data via
> > > > RDMA SEND. The opaque file data is put into the read list. Then the
> > > > GETATTR operation is put in the read list after the file data.
> > > >
> > > > Existing NFS/RDMA servers are able to receive and process this
> request
> > > > correctly.
> > > >
> > > > First question: does the Linux client comply with RFCs 5666 and 5667
> > > > when it sends an NFSv4 WRITE compound in this way, or should it be
> > > > changed?
> > > >
> > > > The intention of RFC 5667 appears to be that the GETATTR belongs
> > > > inline, following the front of the compound. Only the file data
> > > > payload belongs in the read list, if NFSv4 Direct is to be consistent
> > > > with NFSv2 and NFSv3
> > > >
> > > > Now suppose the Linux client is changed to place the GETATTR
> operation
> > > > inline, following the front of the compound.
> > > >
> > > > When the opaque file data requires round-up because it's length is
> not
> > > > divisible by four, should there be a pad? if so, where does it
> belong?
> > > >
> > > > To put it another way: Since inline data does not have a "position"
> > > > field, should the receiver assume that inline content following the
> > > > read list begins right at the end of the rounded-up chunk? or should
> > > > the receiver assume the inline content begins at the next XDR
> position
> > > > after the end of the chunk?
> > > >
> > > > If the former, that would require inserting a zero pad before the
> > > > inline content. But that pad would cause the following inline content
> > > > to be sent unaligned, since a zero pad is always shorter than 4 bytes
> > > > long.
> > > >
> > > > By implication, then, the receiver MUST conclude that round-up is
> > > > present in the case when inline data remains to be decoded (ie, the
> > > > following inline content always begins at the next XDR position). The
> > > > sender MUST NOT send a pad inline in this case. Is this correct?
> > > >
> > > > I've read RFC 5666 section 3.7. Paragraph 4 seems most relevant, but
> > > > it doesn't discuss the case where there is additional inline content
> > > > following a chunk list.
> > > >
> > > > Thanks for reading!
> > > >
> > > > --
> > > > Chuck Lever
> > > >
> > > > _______________________________________________
> > > > nfsv4 mailing list
> > > > nfsv4@ietf.org
> > > > https://www.ietf.org/mailman/listinfo/nfsv4
> > > >
> > >
> > > --
> > > Chuck Lever
> > > chuck[dot]lever[at]oracle[dot]com
> > >
> > >
> > >
> > > _______________________________________________
> > > nfsv4 mailing list
> > > nfsv4@ietf.org
> > > https://www.ietf.org/mailman/listinfo/nfsv4
> >
> > _______________________________________________
> > nfsv4 mailing list
> > nfsv4@ietf.org
> > https://www.ietf.org/mailman/listinfo/nfsv4
>