Re: [nfsv4] pNFS file and use of RPC-over-RDMA to access an NFSv4.1 data server

Chuck Lever <chuck.lever@oracle.com> Wed, 18 April 2018 15:05 UTC

Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 11.3 \(3445.6.18\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jfcEa3xz1UNwrWkcvLZAr=eHKzQKppR+Kuqbzyq8H2ueQ@mail.gmail.com>
Date: Wed, 18 Apr 2018 11:05:14 -0400
Cc: NFSv4 <nfsv4@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <57F19FA8-2491-4CAB-B3B0-09A194BF09F9@oracle.com>
References: <CADaq8jfcEa3xz1UNwrWkcvLZAr=eHKzQKppR+Kuqbzyq8H2ueQ@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/TIjsqHmEDEsrUd8hiyk7WWhMh88>
Subject: Re: [nfsv4] pNFS file and use of RPC-over-RDMA to access an NFSv4.1 data server
Precedence: list

Hi Dave-

Thanks for bringing this up. I agree it should be explored.
A few comments below.

> On Apr 18, 2018, at 9:51 AM, David Noveck <davenoveck@gmail.com> wrote:
> 
> As far as I can determine, there has not been significant discussion of the question of how a client which receives a pNFS file layout is to determine whether the connection to the data server is to be established using TCP or RPC-over-RDMA.  As a result, the Linux NFS client, when it receives a file layout, unconditionally establishes a TCP connection to the data server.   I’ve had to disable pNFS support on the server for my testing, so that the NFS-over-RDMA data paths can get tested but I’d prefer if we could arrive at an approach which allows interoperable implementations supporting both pNFS file and RPC-over-RDMA to be developed and tested.

We have found that there is a similar transport selection issue
that occurs when following NFSv4 referrals.

> I have discussed this issue with some people and arrived at an initial list of ways in which this choice might be made.   Additions are possible but our principal task is arriving at a compatible subset which can serve as a basis for use in testing events and eventual standardization and deployment.  One critical criterion governing our choice is the expected penalty for making the wrong choice.   When this is high, the need is strong for a correct decision, while, if we could reduce the penalty, a reasonable basis for a guess could be an acceptable choice.  In this connection, it is worth noting that currently Linux’s RoCE support for RPC-over-RDMA winds up waiting 60 seconds (i.e. about 10 million times the typical round-trip inter-node delay time) before deciding that RPC-over-RDMA service is not available.  Until and unless that is changed, we would need a quite accurate/reliable means of determining when such support is available.

Not a broad survey by any means, but I've consulted a few people
about the delay in rdma_resolve_addr when detecting end-to-end
RDMA capability on certain fabrics. It is not expected and
implementers should study it further.

There are many possible scenarios where bridging and routing can
be employed between fabrics that can break end-to-end RDMA
capability. I don't believe either a server or client by itself
can know for certain whether an end-to-end capability exists.
For example, the server can suggest the use of RDMA even when an
end-to-end capability does not exist.

This is similar in nature to the issue of assessing end-to-end
path bandwidth when a client attempts to select the most
performant network path to a server, though the consequences are
somewhat more existential.

I believe that whatever other mechanisms are put into place, this
issue must be resolved as well to ensure smooth interoperation on
all fabric types.

> In any case, here is the list of approaches I have come up with so far.   I’d like the working group to augment the list as new approaches arise and then decide on a way forward that we can proceed on for use in future testing events and as a basis for standardization.
> 
> 
> 	• Deciding that, when the metadata server connection is RDMA, data connections are, by default, presumed to be RDMA.   This is a basis for a guess and would not be acceptable if the penalty for a wrong guess is high.   There will certainly be cases in which RDMA support is present on the MDS but might not be available on one or more DS’s. 

In those cases the client should select a TCP connection to the
MDS, IMO, and the client implementation should use the same
transport capability for both. One could argue that a client
administrator would expect that the same transport protocol is
used for accessing the MDS and any DS when a pNFS file layout is
in effect.

> 	• Using the port specified in the layout with 20049 indicating that RDMA is to be attempted.   I don’t see any practical issue with this but it is kind of hacky.

I don't have a problem with this mechanism. A precedent has
already been established with non-pNFS connections, and the port
number is codified in standards-track RFCs.

> 	• Using a value of   “rdma’ or “rdma6” instead of “tcp” or “tcp6”, in the netaddr field of the device  info.  I’m not sure how hard or time-consuming it would be to have a new netaddr assigned in the case in which we use existing universal address formats.   We need to investigate this.

rdma and rdma6 are already IANA-assigned netids. See:

https://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml

> 	• Expecting the client to connect, at first, using a tcp connection and then interrogating the fs_locations_info attribute and using the FSLI4TF_RDMA flag as an indication that the client is to access this fs using an RDMA-capable transport .

This seems like a reliable long-term solution, and it has the
benefit of providing not one but a list of interfaces and
transport types.

However, it falls prey to the issue above, where both ends could
support RDMA, but the end-to-end network path might not.

> 	• Expecting  the client to connect, at first, using a tcp connection and then do a CREATE_SESSION specifying CREATE_SESSION4_FLAG_CONN_RDMA, or to address transports in which a within-connection step-up is not possible i.e . RoCE or Infiniband, defniing a new similar create-session flag to support creation of a new connection to obtain RDMA support on the session using a BIND_CONN_TO_SESSION on the new connection.

The ultimate issue is path discovery. Here are other ideas:

- We invent a control plane protocol that allows a network
  administrator to specify, or some autonomous agent to discover,
  the true end-to-end capabilities of the storage fabric. Have
  clients consult this information before attempting to
  establish connections.

- A client should attempt to establish connections to all
  entries in an fs_locations result in parallel. The server
  can report the availability of RDMA-enabled interfaces,
  and for those entries, the client can try both RDMA and
  TCP.

  The first connection that completes is used, and the others
  are discarded as they complete or time out.

  This is also a good strategy for discovering end-to-end IPv6
  capability.

--
Chuck Lever

[nfsv4] pNFS file and use of RPC-over-RDMA to acc… David Noveck
Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Chuck Lever
Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Tom Talpey
Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… David Noveck
Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Chuck Lever
Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Tom Talpey
Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Black, David
Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Tom Talpey
Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… David Noveck