Re: [nfsv4] pNFS file and use of RPC-over-RDMA to access an NFSv4.1 data server
Chuck Lever <chuck.lever@oracle.com> Wed, 18 April 2018 15:05 UTC
Return-Path: <chuck.lever@oracle.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7C230126BF3 for <nfsv4@ietfa.amsl.com>; Wed, 18 Apr 2018 08:05:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.01
X-Spam-Level:
X-Spam-Status: No, score=-2.01 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_HIGH=-0.01, UNPARSEABLE_RELAY=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=oracle.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id sBFM9Peg7xJp for <nfsv4@ietfa.amsl.com>; Wed, 18 Apr 2018 08:05:18 -0700 (PDT)
Received: from aserp2130.oracle.com (aserp2130.oracle.com [141.146.126.79]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D579D1243FE for <nfsv4@ietf.org>; Wed, 18 Apr 2018 08:05:18 -0700 (PDT)
Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w3IF1aVJ100654; Wed, 18 Apr 2018 15:05:17 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2017-10-26; bh=e33pVvjus/x9oaxn53DiOJnyGS6EyA1r0zP6k6D4yOQ=; b=G9+ub3gXMkmHj8En3xtwwmusA9pKDNnMAu4UkSnIJMlgcC2amF/g4X00vYC9JBVKsuAU yqw1Jdun5oojTKZbATq0IVRy/hiaeqrkBhIpmppIiOFZYqNzpgUBAMoAdygf8fVHuy3d TDpCl/h8AQiPysL5M7qAmy1jyiurZak+KYKTGgsqdoy7MP7AEXqJhNkWbZl4s9UT/R2U oK+rnTeNLI5hIo+0q9+CdVb517EulvAAS0gt7YaHWpCLgVQepLRL8W4LiPr3e8E+4h0A pJ0o9cQGO9QRpm7d1CXDVj03ETctojsoRx8t355dtQCe3OKN2BTHDgLvu+t89+fMiSmu fQ==
Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2130.oracle.com with ESMTP id 2hdrxnb3rn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 18 Apr 2018 15:05:17 +0000
Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w3IF5GtL017495 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 18 Apr 2018 15:05:16 GMT
Received: from abhmp0003.oracle.com (abhmp0003.oracle.com [141.146.116.9]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w3IF5Fta006303; Wed, 18 Apr 2018 15:05:15 GMT
Received: from anon-dhcp-171.1015granger.net (/68.61.232.219) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 18 Apr 2018 08:05:15 -0700
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 11.3 \(3445.6.18\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jfcEa3xz1UNwrWkcvLZAr=eHKzQKppR+Kuqbzyq8H2ueQ@mail.gmail.com>
Date: Wed, 18 Apr 2018 11:05:14 -0400
Cc: NFSv4 <nfsv4@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <57F19FA8-2491-4CAB-B3B0-09A194BF09F9@oracle.com>
References: <CADaq8jfcEa3xz1UNwrWkcvLZAr=eHKzQKppR+Kuqbzyq8H2ueQ@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
X-Mailer: Apple Mail (2.3445.6.18)
X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8866 signatures=668698
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1804180136
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/TIjsqHmEDEsrUd8hiyk7WWhMh88>
Subject: Re: [nfsv4] pNFS file and use of RPC-over-RDMA to access an NFSv4.1 data server
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Apr 2018 15:05:20 -0000
Hi Dave- Thanks for bringing this up. I agree it should be explored. A few comments below. > On Apr 18, 2018, at 9:51 AM, David Noveck <davenoveck@gmail.com> wrote: > > As far as I can determine, there has not been significant discussion of the question of how a client which receives a pNFS file layout is to determine whether the connection to the data server is to be established using TCP or RPC-over-RDMA. As a result, the Linux NFS client, when it receives a file layout, unconditionally establishes a TCP connection to the data server. I’ve had to disable pNFS support on the server for my testing, so that the NFS-over-RDMA data paths can get tested but I’d prefer if we could arrive at an approach which allows interoperable implementations supporting both pNFS file and RPC-over-RDMA to be developed and tested. We have found that there is a similar transport selection issue that occurs when following NFSv4 referrals. > I have discussed this issue with some people and arrived at an initial list of ways in which this choice might be made. Additions are possible but our principal task is arriving at a compatible subset which can serve as a basis for use in testing events and eventual standardization and deployment. One critical criterion governing our choice is the expected penalty for making the wrong choice. When this is high, the need is strong for a correct decision, while, if we could reduce the penalty, a reasonable basis for a guess could be an acceptable choice. In this connection, it is worth noting that currently Linux’s RoCE support for RPC-over-RDMA winds up waiting 60 seconds (i.e. about 10 million times the typical round-trip inter-node delay time) before deciding that RPC-over-RDMA service is not available. Until and unless that is changed, we would need a quite accurate/reliable means of determining when such support is available. Not a broad survey by any means, but I've consulted a few people about the delay in rdma_resolve_addr when detecting end-to-end RDMA capability on certain fabrics. It is not expected and implementers should study it further. There are many possible scenarios where bridging and routing can be employed between fabrics that can break end-to-end RDMA capability. I don't believe either a server or client by itself can know for certain whether an end-to-end capability exists. For example, the server can suggest the use of RDMA even when an end-to-end capability does not exist. This is similar in nature to the issue of assessing end-to-end path bandwidth when a client attempts to select the most performant network path to a server, though the consequences are somewhat more existential. I believe that whatever other mechanisms are put into place, this issue must be resolved as well to ensure smooth interoperation on all fabric types. > In any case, here is the list of approaches I have come up with so far. I’d like the working group to augment the list as new approaches arise and then decide on a way forward that we can proceed on for use in future testing events and as a basis for standardization. > > > • Deciding that, when the metadata server connection is RDMA, data connections are, by default, presumed to be RDMA. This is a basis for a guess and would not be acceptable if the penalty for a wrong guess is high. There will certainly be cases in which RDMA support is present on the MDS but might not be available on one or more DS’s. In those cases the client should select a TCP connection to the MDS, IMO, and the client implementation should use the same transport capability for both. One could argue that a client administrator would expect that the same transport protocol is used for accessing the MDS and any DS when a pNFS file layout is in effect. > • Using the port specified in the layout with 20049 indicating that RDMA is to be attempted. I don’t see any practical issue with this but it is kind of hacky. I don't have a problem with this mechanism. A precedent has already been established with non-pNFS connections, and the port number is codified in standards-track RFCs. > • Using a value of “rdma’ or “rdma6” instead of “tcp” or “tcp6”, in the netaddr field of the device info. I’m not sure how hard or time-consuming it would be to have a new netaddr assigned in the case in which we use existing universal address formats. We need to investigate this. rdma and rdma6 are already IANA-assigned netids. See: https://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml > • Expecting the client to connect, at first, using a tcp connection and then interrogating the fs_locations_info attribute and using the FSLI4TF_RDMA flag as an indication that the client is to access this fs using an RDMA-capable transport . This seems like a reliable long-term solution, and it has the benefit of providing not one but a list of interfaces and transport types. However, it falls prey to the issue above, where both ends could support RDMA, but the end-to-end network path might not. > • Expecting the client to connect, at first, using a tcp connection and then do a CREATE_SESSION specifying CREATE_SESSION4_FLAG_CONN_RDMA, or to address transports in which a within-connection step-up is not possible i.e . RoCE or Infiniband, defniing a new similar create-session flag to support creation of a new connection to obtain RDMA support on the session using a BIND_CONN_TO_SESSION on the new connection. The ultimate issue is path discovery. Here are other ideas: - We invent a control plane protocol that allows a network administrator to specify, or some autonomous agent to discover, the true end-to-end capabilities of the storage fabric. Have clients consult this information before attempting to establish connections. - A client should attempt to establish connections to all entries in an fs_locations result in parallel. The server can report the availability of RDMA-enabled interfaces, and for those entries, the client can try both RDMA and TCP. The first connection that completes is used, and the others are discarded as they complete or time out. This is also a good strategy for discovering end-to-end IPv6 capability. -- Chuck Lever
- [nfsv4] pNFS file and use of RPC-over-RDMA to acc… David Noveck
- Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Chuck Lever
- Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Tom Talpey
- Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… David Noveck
- Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Chuck Lever
- Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Tom Talpey
- Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Black, David
- Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… Tom Talpey
- Re: [nfsv4] pNFS file and use of RPC-over-RDMA to… David Noveck