Re: [nfsv4] List of possible work items for NFSv4.2

"Muntz, Daniel" <Dan.Muntz@netapp.com> Fri, 14 August 2009 00:02 UTC

Return-Path: <Dan.Muntz@netapp.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C92BB3A6C25 for <nfsv4@core3.amsl.com>; Thu, 13 Aug 2009 17:02:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.599
X-Spam-Level:
X-Spam-Status: No, score=-6.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MssrzQRRHehA for <nfsv4@core3.amsl.com>; Thu, 13 Aug 2009 17:02:27 -0700 (PDT)
Received: from mx2.netapp.com (mx2.netapp.com [216.240.18.37]) by core3.amsl.com (Postfix) with ESMTP id 4958928C19F for <nfsv4@ietf.org>; Thu, 13 Aug 2009 17:02:27 -0700 (PDT)
X-IronPort-AV: E=Sophos;i="4.43,377,1246863600"; d="scan'208";a="223959714"
Received: from smtp1.corp.netapp.com ([10.57.156.124]) by mx2-out.netapp.com with ESMTP; 13 Aug 2009 17:02:31 -0700
Received: from svlrsexc1-prd.hq.netapp.com (svlrsexc1-prd.hq.netapp.com [10.57.115.30]) by smtp1.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id n7E02VqI019251; Thu, 13 Aug 2009 17:02:31 -0700 (PDT)
Received: from SACMVEXC1-PRD.hq.netapp.com ([10.99.115.13]) by svlrsexc1-prd.hq.netapp.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 13 Aug 2009 17:02:31 -0700
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Thu, 13 Aug 2009 17:02:29 -0700
Message-ID: <7A24DF798E223B4C9864E8F92E8C93EC03BF6154@SACMVEXC1-PRD.hq.netapp.com>
In-Reply-To: <fe7adea4b3fba5af3e063472b7048e41.squirrel@webmail.eisler.com>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: [nfsv4] List of possible work items for NFSv4.2
Thread-Index: AcobyDPpX4HHHdCkQXS99CiyzskHSwAp6Mgg
References: <fe7adea4b3fba5af3e063472b7048e41.squirrel@webmail.eisler.com>
From: "Muntz, Daniel" <Dan.Muntz@netapp.com>
To: Mike Eisler <mre-ietf@eisler.com>, nfsv4@ietf.org
X-OriginalArrivalTime: 14 Aug 2009 00:02:31.0441 (UTC) FILETIME=[84DD1010:01CA1C72]
Subject: Re: [nfsv4] List of possible work items for NFSv4.2
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Aug 2009 00:02:31 -0000

 

> -----Original Message-----
> From: Mike Eisler [mailto:mre-ietf@eisler.com] 
> Sent: Wednesday, August 12, 2009 8:42 PM
> To: nfsv4@ietf.org
> Subject: [nfsv4] List of possible work items for NFSv4.2
> 
> 
> At the Stockholm IETF meeting, I left with an action item to 
> list and summarize possible work items for NFSv4.2 and 
> provide this within two weeks of the meeting.
> 
> Please read this list, and discuss it, and where possible 
> indicate your willingness to:
> 
> - contribute to NFSv4.2
> 
> - review NFSv4.2 proposals

Yes.

> 
> - implement an NFSv4.2 client and/or server at
>   supported one or more of these ideas.
> 
> And of course, suggest items to add to the list, and items to 
> remove from the list.
> 

I've been thinking about whether we could improve performance of and/or
simplify GETATTR.  Now, it is one of the most frequently used
operations, but is one of the most complex to process.  Some
alternatives:

  1) If it can be determined that a particular attrmask is widely used,
create an operation specifically for retrieving that set of attributes.
The goal would be to improve performance by having the bulk of GETATTR
ops replaced by the new op, and that processing of the new op would be
faster than generic GETATTR handling.

  2) Split GETATTR into an op the retrieves fixed-size attributes, and
another op for variable-sized attributes.

  3) Choose an alternative to the current structure of GETATTR that
would allow rpcgen to generate XDR [en|de]code functions (e.g., an array
of attribute number/value pairs). 


> Work items apparently requiring a new minor version of NFSv4 
> ============================================================
> 
> Peer-to-Peer NFS
> ----------------
> 
> See
> http://tools.ietf.org/id/draft-myklebust-nfsv4-pnfs-backend-00.txt .
> 
> The proposal involves allowing pNFS clients to become data 
> servers for other pNFS clients thus off loading the primary 
> storage array.  As presented at the Stockholm IETF meeting it 
> appears to be limited to read-only workloads based on 
> whole-file caching.  Feedback in Stockholm was that 
> read/write workloads should be considered and that perhaps 
> something along the lines of 
> http://tools.ietf.org/html/draft-eisler-nfsv4-pnfs-dedupe-00
> could be combined with the peer-to-peer NFS proposal to offer 
> sub-file caching. Note that the dedupe proposal is easily 
> extended to provide sub-file caching.
> Regardless the concept was well received.
> 
> Given industry trends toward flash memory, and that the 
> economics of flash memory best realized if the flash is close 
> to the application, Peer-to-Peer NFS with sub-file caching is 
> a very timely.
> 
> Copy
> ----
> 
> See 
> http://tools.ietf.org/html/draft-lentini-nfsv4-server-side-copy-03 .
> 
> The proposal supports intra-NFS server and inter-NFS server 
> file copy.  This is not a new idea for WG; it was mentioned 
> in the BOF that predated the formation of the WG over 10 
> years ago. The idea has never had merit because there have 
> been now APIs on existing NFS clients to use it. This is 
> starting to change.  Hypervisors are starting to have 
> facilities (e.g.  VMware vSphere system, see
> http://www.vmware.com/files/pdf/key_features_vsphere.pdf)
> that require the ability to copy storage between storage 
> devices. In addition, there is a proposal to add a reflink() 
> ( see http://lwn.net/Articles/331576/ ) system call that 
> enables application to leverage a file system's zero copy 
> cloning capability.
> 
> There seems to be little controversy supporting adopting file 
> copy as a work item.
> 
> Note that as proposed, file copy requires a revision to 
> RPCSEC_GSS to enable a concept called "privileges".  See 
> http://tools.ietf.org/html/draft-williams-rpcsecgssv3-00 .
> While the original motivation for RPCSEC_GSSv3 was to support 
> Mandatory Access Control, the privileges concept has proven 
> to be very useful to express the type of delegated security 
> inter-server file copy requires.
> 
> Hole Punching
> -------------
> 
> Regular files can contain holes: byte ranges that are zero 
> filled but do not contain allocated space. Holes are useful 
> for saving space.
> 
> Hole Punching is the act of zeroing out a byte range of a 
> file, and de-allocating the storage for that byte range. This 
> is easy to do in NFSv4 today if hole punching from the end of the file
> downward: invoke SEATTR to reduce the file's length, then 
> invoke SEATTR to restore the file to its original size.
> 
> Hope punching has been proposed for NFS in the past. In the 
> 1980s a draft proposal for NFSv3 included it. However, as 
> with file copy, the lack of APIs caused the idea to die.
> 
> What has changed is the existence of hypervisors.
> A hypervisor virtualizes the storage of guest operating 
> systems. The guest thinks it is dealing with a physical 
> storage device and sends commands to deallocate storage. The 
> hypervisor intercepts these requests. If using network 
> storage, then the hypervisor needs support in the storage 
> access protocol to deallocate blocks of storage.
> I.e., it needs hole punching support.
> 
> Hole punching has had little (if not zero) discussion on the 
> NFSv4 mailing list, but it would be inconsistent to adopt 
> file copy without hole punching.
> 
> MAC Security Attribute
> ----------------------
> 
> See http://tools.ietf.org/html/draft-quigley-nfsv4-sec-label-00.
> 
> The attribute is necessary to support Mandatory Access Control.
> 
> This work has been presented several times at IETF and has 
> been discussed several times on the mailing list. One issue 
> is that there is no consensus on defining REQUIRED 
> Domain-Of-Interpretations (DOIs) to ensure interoperability, 
> nor is there consensus that such a thing is required. There 
> is arguably a
> precedent:  the mimetype attribute. The NFSv4.1 protocol does 
> not explicitly define the content of this attribute, relying 
> on the standards bodies that control mime types to define the 
> content. Mime types are similar to DOIs in the action taken 
> by the NFS client when reading the type is not explained by 
> the NFS protocol. Nor are mandatory mime types defined. The 
> difference is that DOIs potentially have explicit 
> requirements on the NFSv4 server.
> 
> The idea of associating MAC with an
> IANA-registered named attribute has been suggested in the 
> past, but according to the I-D rejected because:
> 
> "[named attributes] lack a way to atomically set  the 
> attribute on creation.  In addition, named  attributes 
> themselves are file system objects  which need to be assigned 
> a security attribute.
>  "
> 
> These are certainly issues. While we could imagine changing 
> the NFSv4.x protocol to allow named attributes to be created 
> atomically, more thought would have to be put into what it 
> means to set what is effectively a permission attribute on 
> the permission attribute. Assigning special semantics to one 
> particular named attribute seems to be what a RECOMMENDED or 
> REQUIRED attribute are designed to do.
> 
> It is clear (as evidenced by the energy behind
> SELinux) that on the NFS client-side, especially Linux, there 
> is strong desire to support MAC. The same level of desire 
> does not appear to exist on the NFS server-side, with several 
> storage vendors at IETF meetings indicating that in absence 
> of a customer demand, they would not be likely to support the feature.
> 
> Traffic Classification
> ----------------------
> Near the end of the feature freeze for NFSv4.1, a proposal 
> was made to specify priority channels.
> This was very controversial and quickly
> withdrawn. Nonetheless, the demand for
> classifying or tagging streams of traffic never goes away.
> 
> End-to-End Data Integrity
> -------------------------
> At various times during the NFSv4.1 standards process, the 
> topic of defining data integrity checksum that would be kept 
> in the storage device and provided to the client when it read 
> the data was discussed. The motivation was to protect data 
> from silent corruption as it left the storage media on read, 
> or was sent from the client on write.
> 
> Various issues were raised:
> - the method by which this checksum was provided,
>   as explicit NFSv4.x operations or via
>   RPCSEC_GSS was controversial
> 
> - additional performance impact, especially if
>   the client or server was already using
>   integrity or privacy in RPCSEC_GSS (i.e. why
>   calculate two different checksums)
> 
> - no support for operations other than READ and WRITE. I.e. metadata
>   was not protected.
> 
> - how should mismatches between the alignment of
>   transfer size of the client's I/O versus the
>   server's on media check sum be handled?
> 
> - controversy as to whether this was a significant problem
> 
> The last issue is the key one. Without consensus there is a 
> problem to solve, this work item won't go forward.
> 
> Umask Attribute
> ---------------
> See http://www.ietf.org/proceedings/74/slides/nfsv4-3.pdf
> 
> The proposal is to include a umask attribute that would be 
> provided with the OPEN operation during file creation.  This 
> is not an attribute that would be stored in the file but 
> instead would allow the NFSv4 client to indicate to the NFSv4 
> server what umask to apply to file when combining the mode 
> and/or acl attributes in the OPEN arguments.
> 
> The proposal goes on to say that if there is a default ACL on 
> the file's directory, the server can ignore the umask.  What 
> is not explained is what problem this solves, since the 
> client could combine the umask and mode on its side, and send 
> the OPEN with a mode attribute reflecting the combination of 
> umask and the mode asrgument to the open() system call.
> 
> The proposal does say that if there is a default ACL on the 
> file's parent directory, the server can ignore the umask. 
> Apparently the purpose here is to emulate a UNIX semantic 
> that says that the mode should be used as is when there is a 
> default ACL (but then how is the mode combined with any 
> corresponding user, group, and other ACEs in the default ACL?).
> 
> More discussion is likely requireed for this proposed item.
> 
> Shutdown Callback
> -----------------
> See http://www.ietf.org/proceedings/74/slides/nfsv4-3.pdf
> 
> The proposal is that the server will send a callback in 
> preparation for a planned shutdown.
> 
> The client can then react as needed: inform user, unmount NFS 
> file systems etc.
> 
> One reaction not mentioned is that the client could commit 
> modified data to the server.
> 
> This functionality replaces the "rwall" ONC RPC service.
> 
> Readahead Hint
> --------------
> See http://www.ietf.org/proceedings/74/slides/nfsv4-3.pdf
> 
> Today NFS servers use heuristics to determine if a sequential 
> read pattern exists, and if so, they will schedule reads from 
> their storage devices in anticipation that by the time the 
> client sends a READ, the data will be in the server's cache.
> This has drawbacks:
> 
> - With pNFS, a given storage device has
>   difficulty detecting a read pattern, since the
>   next logical block might be on the next
>   device.
> 
> - NFS clients often have parallel threads issuing
>   read requests. The pattern of READs as received
>   by the server is not sequential.
> 
> - Detecting readahead requires a set of READs.
> 
> - For small files, the set of READs needed might
>   exceed the length of the file
> 
> - The heuristics on the server can produce false
>   positives.
> 
> It appears the proposal would consist of a new operation that 
> would be like READ, but would not return data. Possible 
> return values might be:
> 
> - requested ignored (server is too loaded)
> 
> - range is already in cache
> 
> - request in progress
> 
> pNFS Connectivity/Access Indication
> -----------------------------------
> See http://www.ietf.org/proceedings/75/slides/nfsv4-0.pdf, 
> slides 112-121.
> 
> The issue is that a pNFS client might be able to reach a 
> storage device identified in a layout, due to a 
> misconfiguration in the network or on the pNFS server. 
> Ease-of-use considerations motivate a way for the pNFS client 
> to communicate the problem to the MDS.
> 
> This communication could be in the form of an extension to 
> LAYOUT_RETURN, or a new operation.
> 
> There seemed to be consensus at the Stockholm meeting that we 
> want to solve this.
> 
> Better Negotiation of Session Reply Cache Sizes
> -----------------------------------------------
> After the WG meeting in Stockholm there was discussion around 
> how to enable a replier on a session to pre-allocate the 
> necessary space needed for the reply cache without over 
> provisioning. One idea discussed to add an operation that 
> limits the set of operations that can be used on the session. 
> For example, a client might create a session used only for 
> operations with results that are never cached, such as READ, 
> READDIR, and another session used only for operations that 
> are invariably cached, such as WRITE, RENAME, REMOVE, etc. 
> One problem with this approach is that the operation would be 
> sent after the session was created, making it too late for 
> the server to pre-allocate the optimal size for its reply cache.
> 
> Work items apparently not requiring a new minor version of 
> NFSv4 ================================================================
> 
> Metadata Striping
> -----------------
> 
> See http://www.ietf.org/proceedings/73/slides/nfsv4-3.pdf .
> 
> The proposal is extend pNFS via a new layout type to support 
> distribution of metadata in a pNFS server. A second type of 
> MDS, the lMDS is described. A pNFS client would be directed 
> to an lMDS via a layout returned by LAYOUTGET on the new 
> layout type. As proposed, only a new layout type is needed.
> 
> The proposal has had little discussion on mailing list, other 
> than to clarify some points. At the Minneapolis IETF meeting, 
> it was noted that the registered algorithms used for 
> distributing metadata by file name needed to be small in 
> number if pNFS clients were going to successfully 
> interoperate with any pNFS server.
> 
> De-Dupe Awareness and Sub-File Caching
> --------------------------------------
> 
> See http://www.ietf.org/proceedings/73/slides/nfsv4-3.pdf .
> 
> The proposal is that NFS servers that support space 
> efficiency (i.e. data is the same between two files is stored 
> once), provide the space efficiency maps to the NFS client. 
> The maps are encoded as bit maps, each bit corresponding to a 
> fixed sized block of a file.
> 
> The proposal does not require a new minor version of NFS, but 
> instead requires 64 new pNFS layout types.
> 
> The proposal can be extended to support sub-file caching, 
> whether the file has de-duplication or not, and is a 
> candidate for marrying with the peer-to-peer NFS proposal.
> 
> At the San Francisco and Minneapolis IETF meetings, the 
> feedback on the proposals has been that block sizes and 
> alignments that are powers of 2 don't match up with all forms 
> of de-duplication and major use cases of caching.
> For example suppose file 1 is 9111 bytes long and file 2 is 
> 1000 bytes long.  At offset 111, the next 1000 bytes are 
> equal to all of file 2. File
> 1 and file 2 are de-duplicated in some storage devices.  A 
> major use cache of caching that is not covered by the 
> proposal might be an HPC application that has records are 
> each aligned on
> 64 bit boundaries but with lengths that are not powers of 2, 
> e.g. the record lengths might be 108 bytes each (a multiple 
> of 64 bits).
> 
> It seems obvious how the proposal could address the HPC 
> caching use case; simply relax the requirement that block 
> sizes be powers of 2.  More thought will be needed to address 
> unaligned de-duplication use case, at least in its most general forms.
> 
> 
> --
> Mike Eisler, Senior Technical Director, NetApp, 719 599 9026, 
> http://blogs.netapp.com/eislers_nfs_blog/
> 
> 
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4
>