[nfsv4] Directory delegations, take 2

"Noveck, Dave" <Dave.Noveck@netapp.com> Fri, 17 October 2003 23:57 UTC

Received: from optimus.ietf.org (ietf.org [132.151.1.19] (may be forged)) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA00574 for <nfsv4-archive@odin.ietf.org>; Fri, 17 Oct 2003 19:57:32 -0400 (EDT)
Received: from localhost.localdomain ([127.0.0.1] helo=www1.ietf.org) by optimus.ietf.org with esmtp (Exim 4.20) id 1AAeT0-0001y8-Cl for nfsv4-archive@odin.ietf.org; Fri, 17 Oct 2003 19:57:10 -0400
Received: (from exim@localhost) by www1.ietf.org (8.12.8/8.12.8/Submit) id h9HNvAYt007564 for nfsv4-archive@odin.ietf.org; Fri, 17 Oct 2003 19:57:10 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by optimus.ietf.org with esmtp (Exim 4.20) id 1AAeT0-0001xv-5q for nfsv4-web-archive@optimus.ietf.org; Fri, 17 Oct 2003 19:57:10 -0400
Received: from ietf-mx (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA00562 for <nfsv4-web-archive@ietf.org>; Fri, 17 Oct 2003 19:57:01 -0400 (EDT)
Received: from ietf-mx ([132.151.6.1]) by ietf-mx with esmtp (Exim 4.12) id 1AAeSy-0003EL-00 for nfsv4-web-archive@ietf.org; Fri, 17 Oct 2003 19:57:08 -0400
Received: from ietf.org ([132.151.1.19] helo=optimus.ietf.org) by ietf-mx with esmtp (Exim 4.12) id 1AAeSx-0003EH-00 for nfsv4-web-archive@ietf.org; Fri, 17 Oct 2003 19:57:07 -0400
Received: from localhost.localdomain ([127.0.0.1] helo=www1.ietf.org) by optimus.ietf.org with esmtp (Exim 4.20) id 1AAeSq-0001xD-Tb; Fri, 17 Oct 2003 19:57:00 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by optimus.ietf.org with esmtp (Exim 4.20) id 1AAeRy-0001uK-HK for nfsv4@optimus.ietf.org; Fri, 17 Oct 2003 19:56:06 -0400
Received: from ietf-mx (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id TAA00486 for <nfsv4@ietf.org>; Fri, 17 Oct 2003 19:55:57 -0400 (EDT)
Received: from ietf-mx ([132.151.6.1]) by ietf-mx with esmtp (Exim 4.12) id 1AAeRw-0003DS-00 for nfsv4@ietf.org; Fri, 17 Oct 2003 19:56:04 -0400
Received: from mx01.netapp.com ([198.95.226.53]) by ietf-mx with esmtp (Exim 4.12) id 1AAeRv-0003Cg-00 for nfsv4@ietf.org; Fri, 17 Oct 2003 19:56:03 -0400
Received: from hawk.corp.netapp.com (hawk [10.10.20.101]) by mx01.netapp.com (8.12.10/8.12.10/NTAP-1.4) with ESMTP id h9HNtX4Z002931 for <nfsv4@ietf.org>; Fri, 17 Oct 2003 16:55:33 -0700 (PDT)
Received: from svlexc01.hq.netapp.com (svlexc01.corp.netapp.com [10.10.22.171]) by hawk.corp.netapp.com (8.12.9/8.12.9/NTAP-1.5) with ESMTP id h9HNtXij019092 for <nfsv4@ietf.org>; Fri, 17 Oct 2003 16:55:33 -0700 (PDT)
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C3950A.2355AE0C"
X-MimeOLE: Produced By Microsoft Exchange V6.0.6249.0
Message-ID: <C8CF60CFC4D8A74E9945E32CF096548AB8092C@silver.nane.netapp.com>
Thread-Topic: Directory delegations, take 2
Thread-Index: AcOVCiUV69t6x1TjTeqThFjBNqx/4Q==
From: "Noveck, Dave" <Dave.Noveck@netapp.com>
To: nfsv4@ietf.org
Subject: [nfsv4] Directory delegations, take 2
Sender: nfsv4-admin@ietf.org
Errors-To: nfsv4-admin@ietf.org
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.0.12
Precedence: bulk
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
List-Archive: <https://www1.ietf.org/mail-archive/working-groups/nfsv4/>
X-Original-Date: Fri, 17 Oct 2003 16:55:27 -0700
Date: Fri, 17 Oct 2003 16:55:27 -0700

So this is my attempt to update the directory delegation approach
to reflect comments I have recently received.  In some cases, I've
changed things and in some cases I've simply tried to explain things
better.  I'm not going to do extensive quoting of comments I've
received, but I hope my summaries are not misrepresenting anyone.

The first thing that has been mentioned, by David Robinson I believe,
is directory write delegation.  I'd like to explain why I've stayed
away from directory write delegation.

Directory write delegations scare the hell out of me.  There are two 
important practical issues.  First, you have the problem of an 
unresponsive client stopping things on the server from proceeding.  
It is true that this problem already exists with write delegations 
for files, but the problem is much more likely if you have a whole 
directory such that any lookup through it will cause people to wait 
for a long time, while we decide whether the client holding the write 
delegation is ever going to respond.  The second issue, which exacerbates
the first is that the effect of revoking a write directory delegation
is liable to be extremely disconcerting to the client/user, causing the 
server to be extremely reluctant to revoke the thing, exacerbating the 
delay to other clients when there is an unresponsive client.  The problem 
is that once you do a directory-modifying operation, and it succeeds at 
the application level, if you have your delegation ripped away, you are 
in a tough situation.  Your application syscall has succeeded, the 
application may have terminated, and now you have created a file, other
clients have seen the directory when you didn't, and thus you don't want 
to push the create out and if you don't, you essentially have a corrupt-
fs/reboot situation.  It is  theoretically possible to embed the directory-
modifying operations in transactions such that we have a nice recovery
but my guess is that there aren't going to be actual clients able to use 
this kind of thing safely for a long long time, if ever.

Carl Burnett made reference to write directory delegation in DFS (I think 
???*****).  I'm guessing that the issues of lack of shared semantics that 
he mentions could be worked around, if it was worth it.  But I wonder 
about the effects of communications problems, as I mentioned above, 
particularly in an internet environment.  So I would be interested in 
hearing about actual experiences with this.  Is it worth it, given that
spec-ing and implementing this is liable to be a lot of work?

There are number of other issues that people have brought up that seem 
to be inter-related:

     Delegated directory contents as READDIR or READDIR+ (or what is 
     the role of attributes?).

     Synchronous or asynchronous notification.

     Notification of changes vs. a clear-your-dnlc-for-directory model.  
     (raised by Tom Talpey in some private e-mail).

These all relate to the issue of what write delegations are intended to 
do (Oh Gosh, I need a Problem Statement :-).  So the following things are 
possibilities (not mutually exclusive). I'm particularly interested in 
additions to this list, except when they complicate the design.  Come to 
think of it, I'm not interested in changes to the list :-), but I know 
that won't stop anybody.

     Enable files to be accessed without significant server interaction 
     when they exist in read-mostly directories, or rather, in directories 
     that are not being written by other clients.

     Tracking changes in a specific directories for programs that display 
     directories for GUI tools, without ugly polling.

     Accessing non-existent files in a non-changing directory, presumably 
     one which exists :-), or the issue of ENOENT lookups/opens mentioned 
     by Carl Burnett.

Let's first consider the issue of asynchronous vs. synchronous notification.  
The motivation for asynchronous notification is that it is better from the 
server's point of view in that operations will not be held up due to network 
problems or a client not responding quickly to a callback (or being down) 
and that even when everything is working OK there is a cost in that a delay 
equal to twice the latency to the most distant notified client is added.

So the issue boils down to whether asynchronous notification will do the job.  
Carl's comments have caused me to rethink the issue and I've decided that 
they won't, at least for anything other than the case of the GUI tools.  So 
I think I'm back in the synchronous notification camp.

Tom Talpey (private e-mail) has raised the issue of whether change notifica-
tion is worth it at all, and whether instead you just have a recall/revoca-
tion event and let the client just get his delegation again and refetch the 
modified directory contents.  This does make the feature easier to spec 
(e.g. There is a tough case of sequencing notifications and successive 
READDIR's when fetching a big directory, as well as a more complicated set 
of callbacks to define).  However, I worry about very large directories and 
the effort of refetching, when there is a modest level of exogenous
directory change.  What do other people think?  I think I'm going to go 
forward trying to do the notifications, unless it turns out  that the 
complexities make this too difficult to do for v4.1. 

One issue that Saadia Khan raised is the issue of directory changes made 
by the delegate himself.  I think we have to make clear that this is 
allowed and the client is presumed to know about directory changes it 
makes itself.  Doing otherwise would compromise the usefulness of 
directory delegations in the case in which a single client is modifying 
the directory.  My assumption is that it just too difficult to do write 
directory delegation, but exclusive use is still a very important case, 
and we should what we can to make read directory delegations useful in 
the exclusive use environment.  

This would be particularly important if we are not doing notifications 
and have to recall the delegation and re-READDIR, but even if we do 
notification, the thought of a RENAME on a high-latency link waiting 
for a high-latency callback to the client doing the rename, makes me 
kind of sick.  

The issue of not sending callbacks to the client making the change
could require stateid's in all directory-modifying operations, but
with sessions, we can simply not notify delegations associated with 
the session making the change.

Regardless of all the IETF procedural stuff, my impression is that 
sessions are going to be in v4.1 and I don't want to waste my time 
defining new operations, that won't be needed for v4.1 if sessions
are present.

Another issue that was raised is the requirement that attributes not
be changed.  There was some objections to this by David Robinson on 
what I take to be architectural grounds, in that directories and
the attributes of files within them are just different sorts of 
objects.  Also, Rob Thurlow worries about the difficulty of 
implementing the callback in response to, for example, a SETATTR
on a filehandle which just happens to be in the subject directory. 

So let me first explain my basic motivation for the attribute
requirement.  The attributes I am basically concerned with are
those that have to do with access to the file: mode, owner, 
group, and acl.  Also the change attribute so that the client can
see if he has the right version of the file.  We could try to
reduce the attributes guaranteed constant to the minimum, but 
there doesn't seem to be a lot of reason to do that.  This is the
same situation as with file delegation.  Any SETATTR causes the
delegation to be recalled, even though it might be possible to
allow a few marginal attributes to be changed.  Having the client
able to assume that all attributes remain unchanged just makes 
things simpler. 

I want the directory-delegated client to be able to access files 
(i.e. open and read) files within the directory without needing 
to contact the server.  So this is why it makes sense to impose
a similar attribute constancy requirement on directory delegation.
If you didn't, you could not determine whether a given user
could access the file and would have to contact the server
for each individual file.
  
Even when you have read delegations available for the individual 
files, you have to get a delegation for each one being accessed, 
and then return that delegation.  Given that clients may cache 
copies of infrequently changed files on disk, a simple way of 
validating such copies and securing access would be very 
nice indeed, especially without forcing a per-file state 
housekeeping requirement.  The number of directories you are 
going to be accessing is much smaller than the number of files,
in almost all cases.

So I'd argue that the performance benefits of this override any
architectural reservations, but that is generally the way I lean
on these things.  After all, READDIRPLUS (now READDIR) returns
attribute information together with directory information, in the
face of the same architectural disconnect for the same sorts of
pragmatic reasons.  As far as the difficulty of implementation,
I'd say "No pain, no gain" but I would be open to an option what
would allow servers that couldn't implement this to obtain all the
benefits that they could get without it.  Let me also offer the
following full disclosure.  WAFL does not have pointers in the inode
back to the enclosing directory but it has been discussed.  I'm 
pretty firm in believing that this is something that filesystems
will just have to do.  When things go wrong, for example, saying you 
have a problem with inode xxx (as opposed to the file named 
aaa/bbb/cc) as is part of the typical UNIX fs paradigm is not 
something that users can or should be asked to accept.

I guess it is possible to reduce the attributes to the critical 
set, if someone can make a strong case for this.  However, once 
you subtract what are basically filesystem attributes, get rid 
of atime which has to be excluded, take away unchanging attributes 
such as fileid and fsid, there isn't all that much left.  Also,
the difficulty of implementation does not seem to be reduced with 
fewer attributes.

One issue that has come up recently that we will have to resolve
for directory delegation, and appears particularly relevant to the
client looking at the acls and granting access to the individual
user processes is relation of credentials and state, particularly
delegation state.  I haven't followed the ongoing discussion of
this issue well enough to determine my exact position on how it
does or should affect directory delegations, although it is clearly
quite relevant.  This needs further discussion.
  
Carl also mentioned some ideas for structuring requests to get 
directory delegation.  I'm thinking a request to get a directory
delegation alone would work OK.  You can add a READDIR to the
COMPOUND.  There would have to be an option so that failure to
get the delegation would not cause an error so that you could
try for a delegation and get the directory information whether
you got the delegation or not.  Mike Eisler has suggested (in
private e-mail), the possibility that this would fit well with
OPENDIR/CLOSEDIR operations in which a delegation request was
a client option.  Since OPENDIR would allow the server to know
when the directory was open, it could make the cookie verifier
useful by enabling the server to switch the verifier only when
the directory was not open. 

Carl also mentioned the possibility of symlink delegations.  I 
don't think this is needed and it would be a lot of delegation
stateid's for the server and client to keep track of.  At least
within the nfs protocols, there is no way to change a symlink
without changing the directory.  Symlinks are not writable
objects.  You have to delete the existing one and then create a 
new one of the same name to get the effect of changing symlink 
contents, and even this would change the filehandle of the
symlink, rather than being see as modifying an existing object.  
So changing a symlink is always going to involve a directory
delegation callback in any case.  To deal with the possibility
that the local server OS has a API to modify as symlink, we merely 
have to make the rule that a read directory delegation provides 
an assurance that there be no change in symlinks within the directory 
without a callback.