[nfsv4] pNFS file and use of RPC-over-RDMA to access an NFSv4.1 data server

David Noveck <davenoveck@gmail.com> Wed, 18 April 2018 13:52 UTC

Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 64C1D12D7F8 for <nfsv4@ietfa.amsl.com>; Wed, 18 Apr 2018 06:52:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.699
X-Spam-Level:
X-Spam-Status: No, score=-2.699 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4s5G1t7kb0jt for <nfsv4@ietfa.amsl.com>; Wed, 18 Apr 2018 06:51:59 -0700 (PDT)
Received: from mail-oi0-x234.google.com (mail-oi0-x234.google.com [IPv6:2607:f8b0:4003:c06::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5FB0112D86E for <nfsv4@ietf.org>; Wed, 18 Apr 2018 06:51:59 -0700 (PDT)
Received: by mail-oi0-x234.google.com with SMTP id e11-v6so1619381oii.11 for <nfsv4@ietf.org>; Wed, 18 Apr 2018 06:51:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=plbpkpDEzJzwplv+uMskV34yxELGFAENaFkdzyKFy40=; b=g1uRXhVGocU05Mf325qnF2cwL4Ep/rHKCc3F7r5K3hFtRa4W34ffrQtP8sqB/iF4LV KKcB2Lq6J9Gqpdql+TF7IudE0ZKmB7PrWTdtSRg37ynVgYuoLkxWv61iqMQn5zJiQDD9 Q7LJfk8D9BJQ79qIWC1cf7BxR8Z5TF96vWcGrUU7lIt8tQkKUgawkS6ldsy4cKDK7CA7 NHrcU2ZSpdWOVRDKbCL6PPuHxIQ+IVHsfy8ByQFAqHfN1H3M3gT3MXLfunu3YkY+CHk+ 0MmwqX2q95dwoqOD4ATgur+l3qmnNKm3prt+KR3JoFKjKqgc+V7MlEgLa13Ui1XnWhW5 ydBg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=plbpkpDEzJzwplv+uMskV34yxELGFAENaFkdzyKFy40=; b=X+sFj6X1zprtHGI4WkjwqDjO4+cBxPfMDseBauRPyVAj9AzGrdV5wvNNIrMNHqgGKs G4t+GOMqhuPfL6PKdm0WHBioJMrp60WLalbc289r+tPX0hHSOjpFyABuapa9yeiFHsJK Qt2Ub9Gro3vUuKtBtdpuigJUbgTs8WaktsPvN3YRqcp2n5f+HaaiSuOraTuFUl/J7Ahe m6MIUHE9TefJ/6Nqw6ubAoujCNR/4JHV47dIpGeztAXINzgAtuJQmLl+ZHLZkXKPW3vl AnKaH2RRoJbSRqWvgsP+SxcMzzYBKTXeQF7s8LPrrk8y7TcE+JJZKGklk51C/nI5EjDv 6Q/g==
X-Gm-Message-State: ALQs6tDtjz1PtO3OujQpEwhLXFJIVs15gisrrdOl22k/E6YewKpR/KRH Y2yJaAodXDpdRJXCHzNonrXJmyg03qFBV0JQHpVxEQ==
X-Google-Smtp-Source: AIpwx4+PE7SK0p2oTF7XkaPxEXbcbUIBscjuSJgkgwPPVjPSDLXIA//pOguHinN7B9efEGPEWVUGr3a1MMGtMsggKvQ=
X-Received: by 2002:aca:d786:: with SMTP id o128-v6mr1118685oig.10.1524059518419; Wed, 18 Apr 2018 06:51:58 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.201.55.233 with HTTP; Wed, 18 Apr 2018 06:51:57 -0700 (PDT)
From: David Noveck <davenoveck@gmail.com>
Date: Wed, 18 Apr 2018 09:51:57 -0400
Message-ID: <CADaq8jfcEa3xz1UNwrWkcvLZAr=eHKzQKppR+Kuqbzyq8H2ueQ@mail.gmail.com>
To: NFSv4 <nfsv4@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000364ba6056a1fc28e"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/vH9nlPTk88uQTreCyBmKY4XOVgw>
Subject: [nfsv4] pNFS file and use of RPC-over-RDMA to access an NFSv4.1 data server
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Apr 2018 13:52:01 -0000

As far as I can determine, there has not been significant discussion of the
question of how a client which receives a pNFS file layout is to determine
whether the connection to the data server is to be established using TCP or
RPC-over-RDMA.  As a result, the Linux NFS client, when it receives a file
layout, unconditionally establishes a TCP connection to the data server.
I’ve had to disable pNFS support on the server for my testing, so that the
NFS-over-RDMA data paths can get tested but I’d prefer if we could arrive
at an approach which allows interoperable implementations supporting both
pNFS file and RPC-over-RDMA to be developed and tested.

I have discussed this issue with some people and arrived at an initial list
of ways in which this choice might be made.   Additions are possible but
our principal task is arriving at a compatible subset which can serve as a
basis for use in testing events and eventual standardization and
deployment.  One critical criterion governing our choice is the expected
penalty for making the wrong choice.   When this is high, the need is
strong for a correct decision, while, if we could reduce the penalty, a
reasonable basis for a guess could be an acceptable choice.  In this
connection, it is worth noting that currently Linux’s RoCE support for
RPC-over-RDMA winds up waiting 60 seconds (i.e. about 10 million times the
typical round-trip inter-node delay time) before deciding that
RPC-over-RDMA service is not available.  Until and unless that is changed,
we would need a quite accurate/reliable means of determining when such
support is available.

In any case, here is the list of approaches I have come up with so far.
I’d like the working group to augment the list as new approaches arise and
then decide on a way forward that we can proceed on for use in future
testing events and as a basis for standardization.


   - Deciding that, when the metadata server connection is RDMA, data
   connections are, by default, presumed to be RDMA.   This is a basis for a
   guess and would not be acceptable if the penalty for a wrong guess is
   high.   There will certainly be cases in which RDMA support is present on
   the MDS but might not be available on one or more DS’s.
   - Using the port specified in the layout with 20049 indicating that RDMA
   is to be attempted.   I don’t see any practical issue with this but it is
   kind of hacky.
   - Using a value of   “rdma’ or “rdma6” instead of “tcp” or “tcp6”, in
   the netaddr field of the device  info.  I’m not sure how hard or
   time-consuming it would be to have a new netaddr assigned in the case in
   which we use existing universal address formats.   We need to investigate
   this.
   - Expecting the client to connect, at first, using a tcp connection and
   then interrogating the fs_locations_info attribute and using the
   FSLI4TF_RDMA flag as an indication that the client is to access this fs
   using an RDMA-capable transport .
   - Expecting  the client to connect, at first, using a tcp connection and
   then do a CREATE_SESSION specifying CREATE_SESSION4_FLAG_CONN_RDMA, or to
   address transports in which a within-connection step-up is not possible i.e
   . RoCE or Infiniband, defniing a new similar create-session flag to support
   creation of a new connection to obtain RDMA support on the session using a
   BIND_CONN_TO_SESSION on the new connection.