Re: [nfsv4] [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]

Christoph Hellwig <hch@lst.de> Wed, 19 July 2017 08:10 UTC

Return-Path: <hch@lst.de>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F0A3A131C26 for <nfsv4@ietfa.amsl.com>; Wed, 19 Jul 2017 01:10:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, RP_MATCHES_RCVD=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QIUCMR8CcxvW for <nfsv4@ietfa.amsl.com>; Wed, 19 Jul 2017 01:10:16 -0700 (PDT)
Received: from newverein.lst.de (verein.lst.de [213.95.11.211]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4459E131C23 for <nfsv4@ietf.org>; Wed, 19 Jul 2017 01:10:16 -0700 (PDT)
Received: by newverein.lst.de (Postfix, from userid 2407) id 00B1A68C4E; Wed, 19 Jul 2017 10:10:14 +0200 (CEST)
Date: Wed, 19 Jul 2017 10:10:14 +0200
From: Christoph Hellwig <hch@lst.de>
To: Steve Byan's Lists <steve-list@byan-roper.org>
Cc: Christoph Hellwig <hch@lst.de>, "nfsv4@ietf.org" <nfsv4@ietf.org>
Message-ID: <20170719081014.GA21642@lst.de>
References: <20170702231000.GA2564@lst.de> <26A3EDDF-200C-49F2-934B-CD9155AECE88@gmail.com> <CADaq8jdK+nSDBy6xr=VU-eWX8LMvWuZZQdy5VMKrbBrh3RPVKA@mail.gmail.com> <37B32567-5FB7-45B4-9105-77272D52CC8F@byan-roper.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <37B32567-5FB7-45B4-9105-77272D52CC8F@byan-roper.org>
User-Agent: Mutt/1.5.17 (2007-11-01)
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/qJvXaKzDcA16r7cFTntIVY8L9h8>
Subject: Re: [nfsv4] [internet-drafts@ietf.org: New Version Notification for draft-hellwig-nfsv4-rdma-layout-00.txt]
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 19 Jul 2017 08:10:21 -0000

[hi Steve, can you properly break lines after ~75 chars in your mail,
that would make them a lot more readable]

On Tue, Jul 18, 2017 at 05:33:31PM -0400, Steve Byan's Lists wrote:
> As a result, I don't presuppose that there is a pre-existing RDMA reliable
> connection between the rdma layout client and server.

I do not assume that, it's just that my prototype CM code is so bad
that I don't want to document it in its current form.

> 2.3. Device Addressing and Discovery
> 
> I think addressing and discovery should support multipathing, to enhance
> availability. So rather than a single netaddr4, I think the
> struct pnfs_rdma_device_addr4 should be defined to contain a
> multipath_list4, as in the File layout. 

Memory registrations are bound to a protection domain, which at least
for NFS is generally bound to a specific QP, so simply returning
multiple addresses for interchangable use might not be a good idea.

It also is a very bad idea for load balancing purposes - I'd much rather
have the MDS control explicitly which layouts go to which QP, to e.g.
steer them to different HCAs.  (and not that that nothing in the draft
requires multiple HCAs used for the RDMA operations to even be in the
same system).


> Consequently I think the rdma layout should contain only the file offset,
> length and extent state. The client would then obtain the handles using
> a pNFS-RDMA protocol exchange. This is unpalatable, but I think it is
> necessary. Trying to fit it all into the LAYOUTGET confronts a chicken
> or the egg problem for RDMA connection establishment.

Connection establishment is something done at GETDEVICEINFO time,
although that is indeed triggered by the first LAYOUTGET usually.

But the basic idea behind the protocol is that indeed the memory
registration is generally done at LAYOUTGET time.


> The layout definition seems to pre-suppose a kernel (or at least
> highly-priviledged user space) server, because it exposes portions of the
> whole persistent memory device address space via the re_storage_offset
> field in the extent. This means the pNFS-RDMA server must be cognizant
> of the file system on the device.

With RDMA memory registrations this offset is relative to the MR, similar
to the address fields in NVMeoF, SRP or iSER.  If you use the (rather unsafe)
global MR your above observations are indeed true.  But I would recommend
against such an implementation and use safer registrations methods at
LAYOUTGET time (e.g. FRs), in which case re_storage_offset is an offset
inside the MR.

And yes, I agree the naming and description needs an improvement, the
current text is copy and paste material from the SCSI layout.

> It seems better to me to model the extent using strictly file-local
> information, i.e. the registered handle is simply that resulting from,
> for example, a user-space server mapping the file, determining its
> sparseness, and registering the non-sparse extents of the file. Thus
> re_storage_offset would not be needed in struct pnfs_rdma_extent4. The
> re_device_id, re_state, re_file_offset, re_length, and a separately
> provided set of re_handles are sufficient. I view getting a pNFS-RDMA
> layout as analogous to mmap’ing a local file - the layout mmap’s a
> file into the RDMA address space of a RC queue pair.

File mappings are very much an implementation detail.  E.g. one the
scenarious I want to support with this layout is indirect writes
where the client gets a write buffer that only gets moved into the
file itself by the layoutcommit (or RDMA FLUSH/COMMIT operation once
standardized).

That would work together with an NFS extension to support O_ATOMIC
out of place updates ala:

https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
https://lwn.net/Articles/715918/

to provide byte level write persistent memory semantics over NFS.

> Offloaded sparse-hole-filling and copy-on-write are not available to a user-space client for its local persistent memory - the local file system has to provide for them using page-mapping tricks. User-space servers don't have offloaded access to the offloads either, unless they implement the entire file system in user space, and so again must rely on page-mapping tricks by the local file system. In either case, writing to a sparse (unallocated) page or copy-on-write of a page in a file in local persistent memory is expected to be a high-latency operation. Given that, why not just have the client send a plain old NFSv4 write to the server when it encounters a copy-on-write extent?

You don't have to hand out a layout for this case, but at least for my
server it's a natural operation that adds no additional latency in
the write path, and very additional latency in the commit path.

> Removing client-offloaded sparse-hole-filling and copy-on-write would considerably simplify the layout, and it makes server implementation possible in a user-space process without the server having to have intimate knowledge of the underlying local persistent memory file system.

Again, just because the protocol specifies this you don't have to
implement it.  For example I've not seen an implementation of this in
the block and scsi layouts so far, although I'm looking into implementing
it in the future.