[nfsv4] List of possible work items for NFSv4.2

"Mike Eisler" <mre-ietf@eisler.com> Thu, 13 August 2009 03:42 UTC

Message-ID: <fe7adea4b3fba5af3e063472b7048e41.squirrel@webmail.eisler.com>
Date: Wed, 12 Aug 2009 20:41:55 -0700
From: Mike Eisler <mre-ietf@eisler.com>
To: nfsv4@ietf.org
User-Agent: SquirrelMail/1.4.19
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
Subject: [nfsv4] List of possible work items for NFSv4.2
Precedence: list

At the Stockholm IETF meeting, I left with an
action item to list and summarize possible work
items for NFSv4.2 and provide this within two
weeks of the meeting.

Please read this list, and discuss it, and where
possible indicate your willingness to:

- contribute to NFSv4.2

- review NFSv4.2 proposals

- implement an NFSv4.2 client and/or server at
  supported one or more of these ideas.

And of course, suggest items to add to the list,
and items to remove from the list.

Work items apparently requiring a new minor version of NFSv4
============================================================

Peer-to-Peer NFS
----------------

See
http://tools.ietf.org/id/draft-myklebust-nfsv4-pnfs-backend-00.txt .

The proposal involves allowing pNFS clients to
become data servers for other pNFS clients thus
off loading the primary storage array.  As
presented at the Stockholm IETF meeting it
appears to be limited to read-only workloads
based on whole-file caching.  Feedback in
Stockholm was that read/write workloads should be
considered and that perhaps something along the
lines of
http://tools.ietf.org/html/draft-eisler-nfsv4-pnfs-dedupe-00
could be combined with the peer-to-peer NFS proposal to offer
sub-file caching. Note that the dedupe proposal
is easily extended to provide sub-file caching.
Regardless the concept was well received.

Given industry trends toward flash memory, and
that the economics of flash memory best realized
if the flash is close to the application,
Peer-to-Peer NFS with sub-file caching is a very
timely.

Copy
----

See http://tools.ietf.org/html/draft-lentini-nfsv4-server-side-copy-03 .

The proposal supports intra-NFS server and
inter-NFS server file copy.  This is not a new
idea for WG; it was mentioned in the BOF that
predated the formation of the WG over 10 years
ago. The idea has never had merit because there
have been now APIs on existing NFS clients to use
it. This is starting to change.  Hypervisors are
starting to have facilities (e.g.  VMware vSphere
system, see
http://www.vmware.com/files/pdf/key_features_vsphere.pdf)
that require the ability to copy storage between
storage devices. In addition, there is a proposal
to add a reflink() ( see
http://lwn.net/Articles/331576/ ) system call
that enables application to leverage a file
system's zero copy cloning capability.

There seems to be little controversy supporting
adopting file copy as a work item.

Note that as proposed, file copy requires a
revision to RPCSEC_GSS to enable a concept called
"privileges".  See
http://tools.ietf.org/html/draft-williams-rpcsecgssv3-00 .
While the original motivation for RPCSEC_GSSv3
was to support Mandatory Access Control, the
privileges concept has proven to be very useful
to express the type of delegated security
inter-server file copy requires.

Hole Punching
-------------

Regular files can contain holes: byte ranges that
are zero filled but do not contain allocated
space. Holes are useful for saving space.

Hole Punching is the act of zeroing out a byte
range of a file, and de-allocating the storage
for that byte range. This is easy to do in NFSv4
today if hole punching from the end of the file
downward: invoke SEATTR to reduce the file's
length, then invoke SEATTR to restore the file to
its original size.

Hope punching has been proposed for NFS in the
past. In the 1980s a draft proposal for NFSv3
included it. However, as with file copy, the lack
of APIs caused the idea to die.

What has changed is the existence of hypervisors.
A hypervisor virtualizes the storage of guest
operating systems. The guest thinks it is dealing
with a physical storage device and sends commands
to deallocate storage. The hypervisor intercepts
these requests. If using network storage, then
the hypervisor needs support in the storage
access protocol to deallocate blocks of storage.
I.e., it needs hole punching support.

Hole punching has had little (if not zero)
discussion on the NFSv4 mailing list, but it
would be inconsistent to adopt file copy without
hole punching.

MAC Security Attribute
----------------------

See http://tools.ietf.org/html/draft-quigley-nfsv4-sec-label-00.

The attribute is necessary to support Mandatory
Access Control.

This work has been presented several times at
IETF and has been discussed several times on the
mailing list. One issue is that there is no
consensus on defining REQUIRED
Domain-Of-Interpretations (DOIs) to ensure
interoperability, nor is there consensus that
such a thing is required. There is arguably a
precedent:  the mimetype attribute. The NFSv4.1
protocol does not explicitly define the content
of this attribute, relying on the standards
bodies that control mime types to define the
content. Mime types are similar to DOIs in the
action taken by the NFS client when reading the
type is not explained by the NFS protocol. Nor
are mandatory mime types defined. The difference
is that DOIs potentially have explicit
requirements on the NFSv4 server.

The idea of associating MAC with an
IANA-registered named attribute has been
suggested in the past, but according to the I-D
rejected because:

"[named attributes] lack a way to atomically set
 the attribute on creation.  In addition, named
 attributes themselves are file system objects
 which need to be assigned a security attribute.
 "

These are certainly issues. While we could
imagine changing the NFSv4.x protocol to allow
named attributes to be created atomically, more
thought would have to be put into what it means
to set what is effectively a permission attribute
on the permission attribute. Assigning special
semantics to one particular named attribute seems
to be what a RECOMMENDED or REQUIRED attribute
are designed to do.

It is clear (as evidenced by the energy behind
SELinux) that on the NFS client-side, especially
Linux, there is strong desire to support MAC. The
same level of desire does not appear to exist on
the NFS server-side, with several storage vendors
at IETF meetings indicating that in absence of a
customer demand, they would not be likely to
support the feature.

Traffic Classification
----------------------
Near the end of the feature freeze for NFSv4.1, a
proposal was made to specify priority channels.
This was very controversial and quickly
withdrawn. Nonetheless, the demand for
classifying or tagging streams of traffic never
goes away.

End-to-End Data Integrity
-------------------------
At various times during the NFSv4.1 standards
process, the topic of defining data integrity
checksum that would be kept in the storage device
and provided to the client when it read the data
was discussed. The motivation was to protect data
from silent corruption as it left the storage
media on read, or was sent from the client on
write.

Various issues were raised:
- the method by which this checksum was provided,
  as explicit NFSv4.x operations or via
  RPCSEC_GSS was controversial

- additional performance impact, especially if
  the client or server was already using
  integrity or privacy in RPCSEC_GSS (i.e. why
  calculate two different checksums)

- no support for operations other than READ and WRITE. I.e. metadata
  was not protected.

- how should mismatches between the alignment of
  transfer size of the client's I/O versus the
  server's on media check sum be handled?

- controversy as to whether this was a significant problem

The last issue is the key one. Without consensus
there is a problem to solve, this work item won't
go forward.

Umask Attribute
---------------
See http://www.ietf.org/proceedings/74/slides/nfsv4-3.pdf

The proposal is to include a umask attribute that
would be provided with the OPEN operation during
file creation.  This is not an attribute that
would be stored in the file but instead would
allow the NFSv4 client to indicate to the NFSv4
server what umask to apply to file when combining
the mode and/or acl attributes in the OPEN
arguments.

The proposal goes on to say that if there is a
default ACL on the file's directory, the server
can ignore the umask.  What is not explained is
what problem this solves, since the client could
combine the umask and mode on its side, and send
the OPEN with a mode attribute reflecting the
combination of umask and the mode asrgument to
the open() system call.

The proposal does say that if there is a default
ACL on the file's parent directory, the server
can ignore the umask. Apparently the purpose here
is to emulate a UNIX semantic that says that the
mode should be used as is when there is a default
ACL (but then how is the mode combined with any
corresponding user, group, and other ACEs in the
default ACL?).

More discussion is likely requireed for this
proposed item.

Shutdown Callback
-----------------
See http://www.ietf.org/proceedings/74/slides/nfsv4-3.pdf

The proposal is that the server will send a
callback in preparation for a planned shutdown.

The client can then react as needed: inform user,
unmount NFS file systems etc.

One reaction not mentioned is that the client
could commit modified data to the server.

This functionality replaces the "rwall" ONC RPC service.

Readahead Hint
--------------
See http://www.ietf.org/proceedings/74/slides/nfsv4-3.pdf

Today NFS servers use heuristics to determine if
a sequential read pattern exists, and if so, they
will schedule reads from their storage devices in
anticipation that by the time the client sends a
READ, the data will be in the server's cache.
This has drawbacks:

- With pNFS, a given storage device has
  difficulty detecting a read pattern, since the
  next logical block might be on the next
  device.

- NFS clients often have parallel threads issuing
  read requests. The pattern of READs as received
  by the server is not sequential.

- Detecting readahead requires a set of READs.

- For small files, the set of READs needed might
  exceed the length of the file

- The heuristics on the server can produce false
  positives.

It appears the proposal would consist of a new
operation that would be like READ, but would not
return data. Possible return values might be:

- requested ignored (server is too loaded)

- range is already in cache

- request in progress

pNFS Connectivity/Access Indication
-----------------------------------
See http://www.ietf.org/proceedings/75/slides/nfsv4-0.pdf, slides 112-121.

The issue is that a pNFS client might be able to
reach a storage device identified in a layout,
due to a misconfiguration in the network or on
the pNFS server. Ease-of-use considerations
motivate a way for the pNFS client to communicate
the problem to the MDS.

This communication could be in the form of an
extension to LAYOUT_RETURN, or a new operation.

There seemed to be consensus at the Stockholm
meeting that we want to solve this.

Better Negotiation of Session Reply Cache Sizes
-----------------------------------------------
After the WG meeting in Stockholm there was
discussion around how to enable a replier on a
session to pre-allocate the necessary space
needed for the reply cache without over
provisioning. One idea discussed to add an
operation that limits the set of operations that
can be used on the session. For example, a client
might create a session used only for operations
with results that are never cached, such as READ,
READDIR, and another session used only for
operations that are invariably cached, such as
WRITE, RENAME, REMOVE, etc. One problem with this
approach is that the operation would be sent
after the session was created, making it too late
for the server to pre-allocate the optimal size
for its reply cache.

Work items apparently not requiring a new minor version of NFSv4
================================================================

Metadata Striping
-----------------

See http://www.ietf.org/proceedings/73/slides/nfsv4-3.pdf .

The proposal is extend pNFS via a new layout type
to support distribution of metadata in a pNFS
server. A second type of MDS, the lMDS is
described. A pNFS client would be directed to an
lMDS via a layout returned by LAYOUTGET on the
new layout type. As proposed, only a new layout
type is needed.

The proposal has had little discussion on mailing
list, other than to clarify some points. At the
Minneapolis IETF meeting, it was noted that the
registered algorithms used for distributing
metadata by file name needed to be small in
number if pNFS clients were going to successfully
interoperate with any pNFS server.

De-Dupe Awareness and Sub-File Caching
--------------------------------------

See http://www.ietf.org/proceedings/73/slides/nfsv4-3.pdf .

The proposal is that NFS servers that support
space efficiency (i.e. data is the same between
two files is stored once), provide the space
efficiency maps to the NFS client. The maps are
encoded as bit maps, each bit corresponding to a
fixed sized block of a file.

The proposal does not require a new minor version
of NFS, but instead requires 64 new pNFS layout
types.

The proposal can be extended to support sub-file
caching, whether the file has de-duplication or
not, and is a candidate for marrying with the
peer-to-peer NFS proposal.

At the San Francisco and Minneapolis IETF
meetings, the feedback on the proposals has been
that block sizes and alignments that are powers
of 2 don't match up with all forms of
de-duplication and major use cases of caching.
For example suppose file 1 is 9111 bytes long and
file 2 is 1000 bytes long.  At offset 111, the
next 1000 bytes are equal to all of file 2. File
1 and file 2 are de-duplicated in some storage
devices.  A major use cache of caching that is
not covered by the proposal might be an HPC
application that has records are each aligned on
64 bit boundaries but with lengths that are not
powers of 2, e.g. the record lengths might be 108
bytes each (a multiple of 64 bits).

It seems obvious how the proposal could address
the HPC caching use case; simply relax the
requirement that block sizes be powers of 2.  More
thought will be needed to address unaligned
de-duplication use case, at least in its most
general forms.


-- 
Mike Eisler, Senior Technical Director, NetApp, 719 599 9026,
http://blogs.netapp.com/eislers_nfs_blog/

Re: [nfsv4] List of possible work items for NFSv4… William A. (Andy) Adamson
[nfsv4] List of possible work items for NFSv4.2 Mike Eisler
Re: [nfsv4] List of possible work items for NFSv4… Tom Haynes
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Robert Gordon
Re: [nfsv4] List of possible work items for NFSv4… Rick Macklem
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… James Lentini
Re: [nfsv4] List of possible work items for NFSv4… Nicolas Williams
Re: [nfsv4] List of possible work items for NFSv4… J. Bruce Fields
Re: [nfsv4] List of possible work items for NFSv4… Rick Macklem
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Rick Macklem
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Rick Macklem
Re: [nfsv4] List of possible work items for NFSv4… David P. Quigley
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Rick Macklem
Re: [nfsv4] List of possible work items for NFSv4… J. Bruce Fields
Re: [nfsv4] List of possible work items for NFSv4… Tom Haynes
Re: [nfsv4] List of possible work items for NFSv4… Noveck, Dave
Re: [nfsv4] List of possible work items for NFSv4… Noveck, Dave
Re: [nfsv4] List of possible work items for NFSv4… Muntz, Daniel
Re: [nfsv4] List of possible work items for NFSv4… Robert Gordon
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Nick Williams
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Nicolas Williams
Re: [nfsv4] List of possible work items for NFSv4… Mike Eisler
Re: [nfsv4] List of possible work items for NFSv4… Mike Eisler
Re: [nfsv4] List of possible work items for NFSv4… David P. Quigley
Re: [nfsv4] List of possible work items for NFSv4… sfaibish
Re: [nfsv4] List of possible work items for NFSv4… Spencer Shepler
Re: [nfsv4] List of possible work items for NFSv4… R N ALEX
Re: [nfsv4] List of possible work items for NFSv4… Sorin Faibish
Re: [nfsv4] List of possible work items for NFSv4… Mahesh Siddheshwar
Re: [nfsv4] List of possible work items for NFSv4… J. Bruce Fields
Re: [nfsv4] List of possible work items for NFSv4… Lisa Week
Re: [nfsv4] List of possible work items for NFSv4… Noveck, Dave
Re: [nfsv4] List of possible work items for NFSv4… Tom Haynes
Re: [nfsv4] List of possible work items for NFSv4… Nicolas Williams
Re: [nfsv4] List of possible work items for NFSv4… J. Bruce Fields
Re: [nfsv4] List of possible work items for NFSv4… Sam Falkner
Re: [nfsv4] List of possible work items for NFSv4… Mike Eisler
Re: [nfsv4] List of possible work items for NFSv4… Nicolas Williams
Re: [nfsv4] List of possible work items for NFSv4… Mike Eisler
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Nicolas Williams
Re: [nfsv4] List of possible work items for NFSv4… Tigran Mkrtchyan
Re: [nfsv4] List of possible work items for NFSv4… Muntz, Daniel
Re: [nfsv4] List of possible work items for NFSv4… sfaibish
Re: [nfsv4] List of possible work items for NFSv4… Benny Halevy
Re: [nfsv4] List of possible work items for NFSv4… Trond Myklebust
Re: [nfsv4] List of possible work items for NFSv4… Benny Halevy