Re: [nfsv4] New Version Notification for draft-talpey-rdma-commit-01.txt

Chuck Lever <chuck.lever@oracle.com> Sat, 11 April 2020 18:33 UTC

Return-Path: <chuck.lever@oracle.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9C8AE3A0B2E for <nfsv4@ietfa.amsl.com>; Sat, 11 Apr 2020 11:33:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.267
X-Spam-Level:
X-Spam-Status: No, score=-2.267 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.168, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=oracle.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aGJzOY_fFToE for <nfsv4@ietfa.amsl.com>; Sat, 11 Apr 2020 11:33:29 -0700 (PDT)
Received: from userp2120.oracle.com (userp2120.oracle.com [156.151.31.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 130313A0B2F for <nfsv4@ietf.org>; Sat, 11 Apr 2020 11:33:28 -0700 (PDT)
Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 03BIXQxO081676 for <nfsv4@ietf.org>; Sat, 11 Apr 2020 18:33:26 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : content-type : content-transfer-encoding : mime-version : subject : date : references : to : in-reply-to : message-id; s=corp-2020-01-29; bh=odIQWjT5vrXoj+yIDbgpzIyfztMvos+mxIZR6BGSTi0=; b=RMFvhYhmRIIKrfga4HgOU5P0EBuOHlkwf0x+pnS09JgI13xdDYnhNbnCCadwaiZcBLyR t2KocPLxxRfENX6TtZbUqUcGlDTP8qgxdzEKzctG7AfjlyLHb2V4wW52IDxdXN5Y6Qub vEewlyOR+a02bhNeA8yNbD11a815gBsok2pf4Qnt2k+byF7G8I3cOkaNissbRxBo5HuS mLy73gwm4zmtAlVWPcUVtVksodECgYj1S+vdEgZniz1JHaW//VbfSOG1ciUk0waJKES4 s+AcmqSzO8Qo7gQgfCRtAwVy4qtLH0S+MhqvgXZtNLAIRR6y+U0yu6rAw9Jvc/WXMQi0 bQ==
Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2120.oracle.com with ESMTP id 30b6hp9bnq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for <nfsv4@ietf.org>; Sat, 11 Apr 2020 18:33:26 +0000
Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 03BIVkFN175798 for <nfsv4@ietf.org>; Sat, 11 Apr 2020 18:33:25 GMT
Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by aserp3020.oracle.com with ESMTP id 30b5q2qm7a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for <nfsv4@ietf.org>; Sat, 11 Apr 2020 18:33:25 +0000
Received: from abhmp0008.oracle.com (abhmp0008.oracle.com [141.146.116.14]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 03BIXPGf007831 for <nfsv4@ietf.org>; Sat, 11 Apr 2020 18:33:25 GMT
Received: from anon-dhcp-153.1015granger.net (/68.61.232.219) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sat, 11 Apr 2020 11:33:24 -0700
From: Chuck Lever <chuck.lever@oracle.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Date: Sat, 11 Apr 2020 14:33:23 -0400
References: <DM6PR00MB06504DBF4532AFAB5343C622A0FD0@DM6PR00MB0650.namprd00.prod.outlook.com>
To: NFSv4 <nfsv4@ietf.org>
In-Reply-To: <DM6PR00MB06504DBF4532AFAB5343C622A0FD0@DM6PR00MB0650.namprd00.prod.outlook.com>
Message-Id: <096AC417-ED75-4468-89D2-3CC07E1C07ED@oracle.com>
X-Mailer: Apple Mail (2.3445.104.11)
X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9588 signatures=668686
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0 suspectscore=0 malwarescore=0 mlxscore=0 bulkscore=0 adultscore=0 mlxlogscore=999 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2004110172
X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9588 signatures=668686
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 adultscore=0 mlxlogscore=999 clxscore=1015 mlxscore=0 phishscore=0 suspectscore=0 lowpriorityscore=0 bulkscore=0 malwarescore=0 priorityscore=1501 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2004110172
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/FoFUAZscf5W6f7QVAmK_2iu0D5o>
Subject: Re: [nfsv4] New Version Notification for draft-talpey-rdma-commit-01.txt
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 11 Apr 2020 18:33:31 -0000

> On Mar 12, 2020, at 4:26 PM, Tom Talpey <ttalpey=40microsoft.com@dmarc.ietf.org> wrote:
> 
> As mentioned on nfsv4 earlier, the "RDMA Push Mode" extensions draft has
> been updated and was published on Monday. There is a ten-minute session
> in the upcoming agenda where its possible adoption as an NFSv4 Work Item
> will be discussed. While we wait for those logistics, please take a look, if
> interested in the area.

In preparation for our interim meeting on April 29, where this
document is to be discussed, I've prepared the following review
of rdma-commit-01.

Thanks, Tom, for considering the nfsv4 WG as a home for this work.


General comments:

Having not been involved in the crafting of RDDP (5040), I can't claim
expertise in the protocol mechanics described in Section 3 and later
parts of the document. My domain of expertise is as a developer and
protocol architect that makes use of RDMA transports.

I have been following the development of these ideas for several years.
The work described here is thorough and detailed, and I'm aware of at
least one partial prototype implementation. And, well-noted that
Sections 3.2 and several later sections appear to be incomplete. The
document remains a work-in-progress.

RPC/RDMA itself is not likely to employ the extensions described in this
document. "Push mode" however can be utilized in extensions to NFS that
would directly submit RDMA Read, Write, Flush, Verify, and Commit
operations in place of NFS READ, WRITE, and COMMIT procedures. Such a
construction has been proposed by:

https://tools.ietf.org/html/draft-hellwig-nfsv4-rdma-layout-00

though it predates the definition of RDMA Flush, Verify, and Commit.

The document does not indicate any parallel development of similar
operations in the InfiniBand community. It would be valuable to
understand how InfiniBand operations differ (semantically and in
their design approach) from the extensions described in this document.

In summary, this document definitely falls within the intersection of
networking and storage. Therefore, IMO that makes it appropriate for
adoption by the nfsv4 WG.


Document organization:

The information in Section 2 is usually relegated to a separate
Informative requirements document. That kind of separation might not
be practical for an individual contribution, but could be considered
for a scope of work that has a home in a particular Working Group.

An Implementation Status section would be a welcome addition to this
document.


1.1. Glossary

   Invalidate:  The removal of data from volatile intermediate locations.

Does this perfunctory definition of "invalidate" align with the use of
Invalidate in "Send With Invalidate" and other common utilizations?

Suggest a definition of "queued" and/or "posted" operations appear in the
Glossary.

Suggest a definition of "ULP" and/or "upper-layer protocol" appear in the
Glossary.


2. Problem Statement

Minor comment:

"requires that multiple copies be committed in multiple locations across
the fabric,"

Suggest instead "across the data center" or even "across cloud instances".
Note that the term "fabric" is not defined in this document, and might be
unfamiliar to the average IETF reader.

Spelling: "a persistent and/or globally visibile state,"


The problem statement introduces

traditional storage hardware, where data resides in a memory cache and is
flushed to durable storage via a host I/O operation;

and

persistent memory, where data resides in that memory and no such I/O is
necessary to make the data durable.

The document then introduces

   a new "flush to
   persistence" RDMA operation.  This operation, when invoked by a
   connected remote RDMA peer, can be used to request that previously-
   written data be moved into the persistent storage domain.

It is not entirely clear that this operation is not intended to be used
with traditional storage devices that require host I/O for data to reach
durability.

The purpose of these extensions is to leverage hardware architectures
where that I/O step is now gone. Rather than framing these as introducing
a "push mode" this document might be more accurately described as
introducing support for RDMA to persistent memory -- an off-label use
of these new operations would be pushing data to volatile memory for
further processing.


2.1. Requirements for RDMA Flush

Spelling: "checking and corection (ECC) memory"


2.2. Requirements for Atomic Write

"the resulting string (log) of transactions"

"string" is an overloaded term and perhaps not precise in this context.
Suggest instead "the resulting sequence of transactions".


2.3.  Requirements for RDMA Verify

   Typically, storage stacks such as filesystems and
   media approches such as SCSI T10 DIF or filesystem integrity checks
   such as ZFS provide for block- oir file-level protection of data at
   rest on storage devices.

Perhaps T10-DIF and ZFS might need references. Note the spelling
error "oir".


   The integrity
   check is initiated by the upper layer or application, which
   optionally computes the expected hash of a given segment of arbitrary
   size, sending the hash via an RDMA Verify operation targeting the
   RDMA segment on the responder, and the responder calculating and
   optionally verifying the hash on the indicated data, bypassing any
   volatile copies remaining in caches.

If the remote RNIC is responsible for offloading hash verification,
why wouldn't the hash computation be offloaded on the requester side as
well?


--
Chuck Lever