Re: [nfsv4] [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

Steve Byan's Lists <steve-list@byan-roper.org> Tue, 18 July 2017 21:33 UTC

Return-Path: <steve-list@byan-roper.org>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 231FF128B8F for <nfsv4@ietfa.amsl.com>; Tue, 18 Jul 2017 14:33:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.92
X-Spam-Level:
X-Spam-Status: No, score=-1.92 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EdNZMbWYf55i for <nfsv4@ietfa.amsl.com>; Tue, 18 Jul 2017 14:33:42 -0700 (PDT)
Received: from cadetblue.maple.relay.mailchannels.net (cadetblue.maple.relay.mailchannels.net [23.83.214.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 99ED9127978 for <nfsv4@ietf.org>; Tue, 18 Jul 2017 14:33:41 -0700 (PDT)
X-Sender-Id: a2hosting|x-authuser|steve-list+byan-roper.org
Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id DFD96127F56; Tue, 18 Jul 2017 21:33:39 +0000 (UTC)
Received: from a2s68.a2hosting.com (unknown [100.96.135.249]) (Authenticated sender: a2hosting) by relay.mailchannels.net (Postfix) with ESMTPA id 21CD9127F0B; Tue, 18 Jul 2017 21:33:37 +0000 (UTC)
X-Sender-Id: a2hosting|x-authuser|steve-list+byan-roper.org
Received: from a2s68.a2hosting.com (a2s68.a2hosting.com [172.20.72.20]) (using TLSv1 with cipher DHE-RSA-AES256-SHA) by 0.0.0.0:2500 (trex/5.9.8); Tue, 18 Jul 2017 21:33:39 +0000
X-MC-Relay: Neutral
X-MailChannels-SenderId: a2hosting|x-authuser|steve-list+byan-roper.org
X-MailChannels-Auth-Id: a2hosting
X-Descriptive-Absorbed: 1e9237d02f99c2a0_1500413619017_599557264
X-MC-Loop-Signature: 1500413619017:480289854
X-MC-Ingress-Time: 1500413619016
Received: from pool-96-252-45-170.bstnma.fios.verizon.net ([96.252.45.170]:59803 helo=multivac.fios-router.home) by a2s68.a2hosting.com with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.87) (envelope-from <steve-list@byan-roper.org>) id 1dXa7X-000NnC-Vj; Tue, 18 Jul 2017 17:33:32 -0400
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\))
From: Steve Byan's Lists <steve-list@byan-roper.org>
In-Reply-To: <CADaq8jdK+nSDBy6xr=VU-eWX8LMvWuZZQdy5VMKrbBrh3RPVKA@mail.gmail.com>
Date: Tue, 18 Jul 2017 17:33:31 -0400
Cc: "nfsv4@ietf.org" <nfsv4@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <37B32567-5FB7-45B4-9105-77272D52CC8F@byan-roper.org>
References: <20170702231000.GA2564@lst.de> <26A3EDDF-200C-49F2-934B-CD9155AECE88@gmail.com> <CADaq8jdK+nSDBy6xr=VU-eWX8LMvWuZZQdy5VMKrbBrh3RPVKA@mail.gmail.com>
To: Christoph Hellwig <hch@lst.de>
X-Mailer: Apple Mail (2.3273)
X-AuthUser: steve-list+byan-roper.org
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/4erjfN0yp8P5ItO6yJYMxrdMMr4>
Subject: Re: [nfsv4] [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Jul 2017 21:33:44 -0000

I have a number of comments on draft-hellwig-nfsv4-rdma-layout-00.txt.

I'm coming at this from the perspective of a user-space pNFS-RDMA client (for the RDMA layout part of the protocol, not necessarily for the NFS part) and a user-space pNFS-RDMA server (for example, an extended NFS-Ganesha that supports pNFS-RDMA).

As a result, I don't presuppose that there is a pre-existing RDMA reliable connection between the rdma layout client and server. This has implications on the connection establishment model. The NFS portions of the protocol, including exchanging the layout, could occur over a TCP connection (either over the RDMA network or over an ethernet network), and the identity of the RDMA layout server would not be known until the client receives the layout.

2.3. Device Addressing and Discovery

I think addressing and discovery should support multipathing, to enhance availability. So rather than a single netaddr4, I think the struct pnfs_rdma_device_addr4 should be defined to contain a multipath_list4, as in the File layout. 

Combined with an assumption of not requiring a pre-existing RDMA connection, this has major implications for the protocol. 

In later sections, the draft defines the rdma layout to contain a registered memory handle. If there is no pre-existing connection, the server has to provide an unconnected queue pair, register memory for the file with it, and pass the memory registration handle back to the client in the layout, along with the identity of the rdma server. Finally it must supply the unconnected queue pair to the RDMA Communication Manager when the server accepts the client's connection request.

This dance is possible (I think, I haven't tried it) if there is only one address for the server, as the server can bind its rdma_cm_id to one RDMA device before listening. However, if the server supports multiple addresses (for multipathing), then it is not possible to pre-create the server-side queue pair, because an unbound listening rdma_cm_id doesn't have a valid ibv_context until a connection attempt is received.

Consequently I think the rdma layout should contain only the file offset, length and extent state. The client would then obtain the handles using a pNFS-RDMA protocol exchange. This is unpalatable, but I think it is necessary. Trying to fit it all into the LAYOUTGET confronts a chicken or the egg problem for RDMA connection establishment.


2.4.  Data Structures: Extents and Extent Lists

The layout definition seems to pre-suppose a kernel (or at least highly-priviledged user space) server, because it exposes portions of the whole persistent memory device address space via the re_storage_offset field in the extent. This means the pNFS-RDMA server must be cognizant of the file system on the device.

It seems better to me to model the extent using strictly file-local information, i.e. the registered handle is simply that resulting from, for example, a user-space server mapping the file, determining its sparseness, and registering the non-sparse extents of the file. Thus re_storage_offset would not be needed in struct pnfs_rdma_extent4. The re_device_id, re_state, re_file_offset, re_length, and a separately provided set of re_handles are sufficient. I view getting a pNFS-RDMA layout as analogous to mmap’ing a local file - the layout mmap’s a file into the RDMA address space of a RC queue pair.

Exposing the portions of the persistent memory device address space seems to be motivated by a desire to enable client-offload for filling of holes in a sparse file and copy-on-write. However, I question whether these client-offloads are very useful. 

Offloaded sparse-hole-filling and copy-on-write are not available to a user-space client for its local persistent memory - the local file system has to provide for them using page-mapping tricks. User-space servers don't have offloaded access to the offloads either, unless they implement the entire file system in user space, and so again must rely on page-mapping tricks by the local file system. In either case, writing to a sparse (unallocated) page or copy-on-write of a page in a file in local persistent memory is expected to be a high-latency operation. Given that, why not just have the client send a plain old NFSv4 write to the server when it encounters a copy-on-write extent?

Removing client-offloaded sparse-hole-filling and copy-on-write would considerably simplify the layout, and it makes server implementation possible in a user-space process without the server having to have intimate knowledge of the underlying local persistent memory file system.


Best regards,
-Steve

-- 
Steve Byan <steve@byan-roper.org>
Littleton, MA