Re: [nfsv4] [ New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

Steve Byan's Lists <> Fri, 11 August 2017 15:01 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 7149F132640 for <>; Fri, 11 Aug 2017 08:01:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -4.7
X-Spam-Status: No, score=-4.7 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id FWrs1eusI2yq for <>; Fri, 11 Aug 2017 08:01:03 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id A7667132573 for <>; Fri, 11 Aug 2017 08:01:00 -0700 (PDT)
X-Sender-Id: a2hosting|x-authuser|
Received: from (localhost []) by (Postfix) with ESMTP id EC22512A20D; Fri, 11 Aug 2017 15:00:56 +0000 (UTC)
Received: from (unknown []) (Authenticated sender: a2hosting) by (Postfix) with ESMTPA id 6A6141293ED; Fri, 11 Aug 2017 15:00:48 +0000 (UTC)
X-Sender-Id: a2hosting|x-authuser|
Received: from ( []) (using TLSv1 with cipher DHE-RSA-AES256-SHA) by (trex/5.9.10); Fri, 11 Aug 2017 15:00:56 +0000
X-MC-Relay: Neutral
X-MailChannels-SenderId: a2hosting|x-authuser|
X-MailChannels-Auth-Id: a2hosting
X-Tangy-Stretch: 537a97061d5e0c26_1502463648791_1700946685
X-MC-Loop-Signature: 1502463648791:1599832034
X-MC-Ingress-Time: 1502463648790
Received: from ([]:49192 helo=[]) by with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.87) (envelope-from <>) id 1dgBQd-000x92-LY; Fri, 11 Aug 2017 11:00:47 -0400
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
From: Steve Byan's Lists <>
In-Reply-To: <>
Date: Fri, 11 Aug 2017 11:00:45 -0400
Cc: "" <>
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <> <> <>
To: Christoph Hellwig <>
X-Mailer: Apple Mail (2.3273)
Archived-At: <>
Subject: Re: [nfsv4] [ New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 11 Aug 2017 15:01:05 -0000

> On Jul 19, 2017, at 4:10 AM, Christoph Hellwig <> wrote:
> On Tue, Jul 18, 2017 at 05:33:31PM -0400, Steve Byan's Lists wrote:
>> 2.3. Device Addressing and Discovery
>> I think addressing and discovery should support multipathing, to enhance
>> availability. So rather than a single netaddr4, I think the
>> struct pnfs_rdma_device_addr4 should be defined to contain a
>> multipath_list4, as in the File layout. 
> Memory registrations are bound to a protection domain, which at least
> for NFS is generally bound to a specific QP, so simply returning
> multiple addresses for interchangable use might not be a good idea.
> It also is a very bad idea for load balancing purposes - I'd much rather
> have the MDS control explicitly which layouts go to which QP, to e.g.
> steer them to different HCAs.  (and not that that nothing in the draft
> requires multiple HCAs used for the RDMA operations to even be in the
> same system).

I think the server may not be the right place for load balancing for 
pNFS-RDMA. The pNFS-RDMA client is much more likely to know if it 
is experiencing congestion delays than even the server HCA/RNIC, much 
less the pNFS-RDMA server software, which is not even involved in the 
data transfer path.

Also, if the client is unable to establish a connection to the single address 
specified by the layout, how can it request the server to fail over to a 
redundant path? Is the server required to hand out a different path 
(assuming it has one) the next time the client requests a layout?

>> It seems better to me to model the extent using strictly file-local
>> information, i.e. the registered handle is simply that resulting from,
>> for example, a user-space server mapping the file, determining its
>> sparseness, and registering the non-sparse extents of the file. Thus
>> re_storage_offset would not be needed in struct pnfs_rdma_extent4. The
>> re_device_id, re_state, re_file_offset, re_length, and a separately
>> provided set of re_handles are sufficient. I view getting a pNFS-RDMA
>> layout as analogous to mmap’ing a local file - the layout mmap’s a
>> file into the RDMA address space of a RC queue pair.
> File mappings are very much an implementation detail.  E.g. one the
> scenarious I want to support with this layout is indirect writes
> where the client gets a write buffer that only gets moved into the
> file itself by the layoutcommit (or RDMA FLUSH/COMMIT operation once
> standardized).
> That would work together with an NFS extension to support O_ATOMIC
> out of place updates ala:
> to provide byte level write persistent memory semantics over NFS.

I think using an RDMA RPC for atomic writes might be a better approach. 
I’m not convinced that using one-sided RDMA ops is lower latency — the 
client still has to send the layoutcommit RPC. Or if one tries to include
the atomic commit semantic in the proposed RDMA FLUSH/COMMIT,
that effectively turns it into an RPC.

Why not just send the data along with the commit, given that you need 
an RPC anyway?

Also, forcing the interface to the failure-atomic transaction to look like
copy-on-write is awkward for servers that implement it using undo or
redo logging.

It would be good to see some data on the performance of these 
approaches before we bake something into the protocol.

Best regards,

Steve Byan <>
Littleton, MA