Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues

David Noveck <> Sat, 04 July 2020 11:09 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 532DC3A093A for <>; Sat, 4 Jul 2020 04:09:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id Ufl1Z4O3fUVX for <>; Sat, 4 Jul 2020 04:09:16 -0700 (PDT)
Received: from ( [IPv6:2a00:1450:4864:20::62c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 644FD3A09B8 for <>; Sat, 4 Jul 2020 04:09:16 -0700 (PDT)
Received: by with SMTP id l12so37017330ejn.10 for <>; Sat, 04 Jul 2020 04:09:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=/uDpalcTVyqCxKm0L8Iif8ByQznkCOkuKm/PMxRMhuU=; b=A6+vCqOQ4gzdDzZHz1vqyLXXY652g0cBVsDotI7BKdPvydQdHqa51sGepOO8TDW1by zw1on25/ftz5WUsmpmzl3z0qzUzfPb+7p+ocCu3ycD+hYGpvlOTXTJ5eWjSGt7UeZLhj cdDiwL7F9FmIMLuUbgRfO5J5IHCeC/81H2MkExd9vaPvStarkkM2Zij6yBDKBQzzqYxH aTlo9tR+0+T8vuEhSWfBgraZSQNVDDCSEFnCO3rPceTRooaP0YD+ALYWvaWaBX4HfDSM n7NrZRODjS83cG3sRy+lmRp3X7kglQMHMGnkr9PTRVDnNzJuK/rFP0kztUsaBabw+Lfv Lcvg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=/uDpalcTVyqCxKm0L8Iif8ByQznkCOkuKm/PMxRMhuU=; b=IhZQVLcS/Cp+LtdyLEinbXvxMnWyHtomLIlVPM5lyZhb14B2VVoh7XPmJz9OfEK2Xq 7qfvmA80HW18XMcS+8I+6DOej1FH3SP6Om3cb9H77B4ICB8fHDSThU7CYcFJStEh8U4W ocdX/Ns8sJO+vd+NJ24/5HcSgBNu9leU5cZPaFdFqLnqq14pwTZMFjWavXEJTke3IvjH htABRlWj8XvF4nAjVA8XfUakuhx/YVxHr/rjO0CiHAboq8eFaa60533BqfLGZc13I9Rv d/xG/lle9TpIOGUYXD1VB336A3oI8u4DQCZmyLP4aNcaoc2Ps0+jDB8S1RTTpw/RPRIN r2AQ==
X-Gm-Message-State: AOAM530sCPgQRnwdTY6PsnKUkUg7D10diNYuh8ioAimwaBtXucXlJBMt YOz5FJR98RGBsMOlKE/zaGdmENFGFReFTxOTnkU=
X-Google-Smtp-Source: ABdhPJwO30+TRq6QEOR538GMknnttX21+BB4kuJAq0Fsk4biCa7HP+4ZaRKFsfMv/pWwY7aWNFc42P1sYAfGk4fBIj4=
X-Received: by 2002:a17:906:899:: with SMTP id n25mr33949919eje.298.1593860954665; Sat, 04 Jul 2020 04:09:14 -0700 (PDT)
MIME-Version: 1.0
References: <> <> <> <> <> <QB1PR01MB3364D6132B8D515B7766A606DD6A0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM> <> <QB1PR01MB336438533FDE0C39CA5F2CA2DD6A0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>
From: David Noveck <>
Date: Sat, 04 Jul 2020 07:09:03 -0400
Message-ID: <>
To: Rick Macklem <>
Cc: Trond Myklebust <>, NFSv4 <>
Content-Type: multipart/alternative; boundary="00000000000006379405a99bacc1"
Archived-At: <>
Subject: Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: NFSv4 Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 04 Jul 2020 11:09:20 -0000

On Fri, Jul 3, 2020, 1:09 PM Rick Macklem <> wrote:

> David Noveck wrote:
> >Rick Macklem  wrote:
> >>Well, I'll throw out a simple suggestion w.r.t. directory cookies.
> >
> >I'm also working on a suggestion about helping clients maintain directory
> cookies. >Slides will be presented 7/9.
> >
> >
> >>What if the cookies generated by the server were required to have the
> >>following properties:
> >
> >I don't think we can add new requirements for directory cookies.
> Requirements for >the granting or retention of directory delegations are
> possible but I'd like  to >derive them from inherent aspects of the
> existing notification scheme. Adding >wholly new requirements on an
> existing feature is not possible.
> >
> >- Monotonically increasing numeric values.
> - Without the monotonically increasing values for cookies, the client side
>   implementation becomes more difficult. To insert entries efficiently, it
> would
>   have to find the extant entry that a new entry follows. To do this is
> going to
>   require some sort of data structure that maintains the ordering of the
> entries
>   and allows for efficient search based on a "random" cookie.
>   --> Too difficult for client implementors --> Never gets implemented.
> >Not possible to add this requirment. Also don't see the need/value of
> this.  In my >view, the existing directory notification scheme provides
> adequate information for >subscribed clients to maintain the structure of
> cached directories being modified, >including cookie values.
> Ok, so if it cannot be a requirement, what about an optional property?
> - Add a new read-only, per-file system attribute that is a Boolean that
> states
>   whether or not the directory cookies generated by the server have the
> property.

That's not a problem for me and feel it should be added given it would
enable client implementation.  It seems likely that most server
implementations would be able to provide cookies with this property.

  --> Then the client can choose whether or not to try and use directory
>          delegations based on that.
> >- Sparse enough that new entries can be created with numeric values
> >  between them.
> As worded, this is just a suggestion for server implementors.
> The optional property would be invariance of extant cookies and
> monotonically
> increased ordering maintained.
> >Again, a new requirement, with no clear justification.
> Justification, as above.
> >- Guaranteed to not change and remain valid (and referring to the same
> entry)
> >  despite additions/deletions. (Until a deletion callback is received by
> the client.)
> >
> >Should also be allowed if the directory delegation is recalled.
> Yes.
> >I'm unclear about the callbacks you are proposing. Normally, notification
> callbacks >are not sent to the client making the change. It might be
> possible to change that >in a v4.2 extension but it adds to operation
> latency and creates sequencing >difficulties ☹️
> I'll admit I had not read it in a long time and my recollection was
> similar.
> I just re-read it and it states that the client doing the
> addition/deletion gets
> the CB_NOTIFY. Here's the snippet:
>    If the client holding the delegation makes any changes in the
>    directory that cause files or sub-directories to be added or removed,
>    the server will notify that client of the resulting change(s).  If
>    the client holding the delegation is making attribute or cookie
>    verifier changes only, the server does not need to send notifications
>    to that client.  The server will send the following information for
>    each operation:

Now it's my turn to apologise.  I obviously was not reading carefully
enough when I discussed this.  I thought I had seen text saying the
contrary, but when I went back and looked for it, it wasn't there.

> I was mistaken and thought the CB_NOTIFY did not go to the client that did
> the addition/deletion.
> Since the CB_NOTIFY does go to the client doing the addition/deletion, I
> think the 4.1 CB_NOTIFY is sufficient.

My feeling is that it is barely sufficient but we will need to discuss the
possible need for further extensions at the meeting and on the list. The
reasons I'm not sanguine about this approach are:

   - It is a problem for latency,which is a concern for common operations
   like OPEN-create.
   - There are troubling synchronization problems, given that the
   notification can be received before the request response or seconds after.
   - Unless you are prepared to lock the directory on the client until the
   callback is received (not so easy to do since the callback is asynchronous,
   you face an even worse synchronization problems as you receive multiple
   notifications (in-order but not organized by request).

If we want to do this in 4.1 (e.g. Linux is OK without cookie
monotonicity), then we are going have to fix up issues with regard to
dirent_notiy_delay.  This is the  minimum number of *seconds *to delay
sending the notification.   We'd have to change this so that it didn't
apply to the same client case.  It pretty clearly is designed for the
other-client case.  I'm pretty sure we could do this in the rfc5661bis
context, along with some other cleanup in this area.

> >I considered returning the necessary info in a bunch of new ops but the
> whole >thing got waaay too complicated.  I think the best choice would be a
> >GET_NOTIFICATION op to be issued after the CREATE, OPEN, LINK, REMOVE,
> >
> As above, the client doing the addition/deletion does get the CB_NOTIFY.
> >>For example, for a simple case of a UFS file system on a server, the UFS
> >>directory consists of blocks of directory entries.
> >>- When an entry is added, it goes in a gap within the directory or grows
> a
> >>   new block at the end (if I vaguely recall this correctly;-).
> >>- When an entry is removed, the entry is erased without moving other
> >>   entries.
> >>For this example, the directory offset cookie can simply be the byte
> offset
> >>of the entry.
> >>(I haven't looked at other file systems, but hopefully offset cookies
> with the
> >> above properties can be created for most of them?)
> >
> >Possibly, but given the need for this, I'm not inclined to rely on hope.
> If the server cannot do it, the new read-only attribute would be false and
> (at least the FreeBSD client) would choose not to get directory
> delegations.
> >When an entry is added/deleted, the server issues a callback to the client
> >with the new directory cookie offset (and the directory entry for the add
> case).
> >
> >I'd prefer that the same information that is returned as in the existing
> directory >notification scheme, which allows you to maintain directory
> order and cookies, >without tieing one to the other.
> I was just being lazy. I hadn't read the section of RFC5661 in a long time
> and
> didn't remember the terminology. (I didn't mean to imply "new callbacks"
> were
> needed.) There was the issue w.r.t the client doing the addition/deletion
> getting a callback (which it appears it does).
> >>The client can maintain this structure in any number of ways (and it
> could be
> >>fun figuring out what works well), but a trivial version could be:
> >>- The reply to a readdir is kept as a list head (with the directory
> offset of the
> >>   first entry) and a linked list of the entries in order, with their
> cookie.
> >>  (Remember that the cookies are in the same ordering as the entries
> Well, I do this for fun. If it isn't fun, I won't be doing it. (I know
> someone
> doing this as a retirement hobby is weird, but since vendors do not
> invest in clients much...)

> >Not in my scheme.
> >
> >>- The next readdir reply creates the next head/list.
> >>
> >>readdir() just works through each list, following each head in order.
> >>telldir() returns the cookie for the entry.
> >>seekdir() just finds the correct list and then searches down that list
> for a match.
> >>
> >>The remove/add entry callbacks just insert/delete entrie(s) in the
> appropriate
> >>list.
> >>(I'd probably keep these lists in the kernel client under the VFS for
> FreeBSD,
> >> as malloc'd data structures, but that is simply an implementation
> choice.)
> >>
> >>You would require the monotonically increasing property for directory
> >>delegations to be issued. (Without that you don't know where to insert
> >>additions.)
> >>
> >The insert notification gives you enough information to do this without
> the >monotonically increasing property.
> That is true, but then why hasn't anyone implemented it?
> (Yes, I was "misusing" the term require. I'm not a spec. writer, as you
> can easily tell.)
> >>You would also require that extant directory entry cookies
> >>remain valid and unchanged when additions/deletions occur.
> >>(Note that removing an entry and then adding an entry at the same offset
> >> is allowed under POSIX telldir()/seekdir() as I understand it.)
> >
> >I think the requirement goes away when the delegation does.
> Yes, although it practice, assuming the client(s) are getting the
> notifications,
> I only see them returning a directory delegation after a closedir() and if
> the
> number of CB_NOTIFYs is large or the caching storage needs to be free'd.
> >>This avoids any need for the client to synthesize cookies and just use
> the ones
> >>returned by the server, I think?
> >
> >Yes
> >
> >>Just a simple idea that may be worth considering?
> >>(If this has already been discussed,
> >
> >Don't think it has.
> >
> >>I apologize for not seeing it.)
> >
> >Even if it had, no apology would be necessary
> I should have re-read the appropriate sections of RFC5661 before the last
> post.
> rick
> rick
> ________________________________________
> From: nfsv4 <<>> on
> behalf of David Noveck <<>>
> Sent: Thursday, July 2, 2020 6:06 AM
> To: Trond Myklebust
> Cc:<>
> Subject: Re: [nfsv4] Notes regarding discussion of directory scalabiliy
> issues
> CAUTION: This email originated from outside of the University of Guelph.
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe. If in doubt, forward suspicious emails to
> On Tuesday, June 30, 2020, Trond Myklebust <
> <><<mailto:
>>>> wrote:
> On Tue, 2020-06-30 at 08:40 -0400, David Noveck wrote:
> Thanks for your helpful comments.
> On Mon, Jun 29, 2020 at 2:43 PM Trond Myklebust <
> <><<mailto:
>>>> wrote:
> Like it or not, the readdir cookie is an attribute of the directory.
> If the protocol treated them as such, then the attribute notifications
> feature could provide updates to the client.   Given that it doesn't, we
> could add a cookie update feature to directory notification feaure as a
> v4.2 extension to the protocol.  However, I'm reluctant to start work on
> the necessary protocol additions until we are sure they are needed to
> provide better directory cacheability.
> Actually, they are attributes of directory streams.   The difference is
> not all that important given that client implementations are unlikely to be
> aware of the specific steam associated with any particular request.
> However, there are a few cases in which the difference is important in
> determining whether various approaches to client handling of cookies might
> or might not work, and will be important in the discussion below:
>   *   Two requests made on different clients necessarily are made on
> distinct streams.
>   *   Two requests made on different instances of the same client (with an
> intervening restart/reboot) also have to arise on different streams.
> If I want to support the POSIX telldir() and seekdir() operations (
> ), then I need to ensure that when the application calls seekdir(), I
> return to the exact same cursor location in the stream that I was at when I
> called telldir().
> Agreed.
> Without a server side cookie on which to anchor my telldir() cookies,
> Every client has these available but it is not clear to me useful such
> anchoring is.   I think the flexibility that each client has to assign
> cookies to streams it is responsible for is valuable and could be
> compromised if anchoring to the server cookies is made the focus of the
> implementation.
> then all I have is a list of filenames that can and will change every time
> a file is created, deleted or renamed.
> Clearly it will change.  However, the directory notifications feature
> makes some assumptions, currently implicit about how the list will change.
>  Once these are made explicit, the wg could decide that server/fs pairs
> incapable of staying within these reasonable restrictions (if they are, in
> fact, reasonable), cannot support the directory notifications feature.
> Both the length and ordering of that list may change whenever the
> directory is modified,
> Clearly the length will change, but the reasonable expectation is that
> creating a file will increase the length by one and deleting one will
> decrease it by one.   I don't see the value of supporting directory
> notifications on server fs that do something else.
> With regard ro ordering,  suppose the spec allows an fs to shuffle the
> directory order every time a change is made, but I'm unaware of any actual
> file systems that do this.   Do we need to support directory notifications
> for such fs's?
> touch foo; touch bar; ln foo baz; rm foo; mv baz foo
> There... Most filesystems will end up reordering 'foo' and 'bar' in the
> directory stream given the above sequence of commands. How does the client
> figure out what happened if the above sequence of commands is performed on
> the server?
> Now let's say that is a directory of a million files, and something like
> the above is made to happen regularly. How do I maintain a stable list of
> synthetic cookies on the client?
> I think you are right about there being cases in which it is impossible,
> but we either disagree or are simply talking past one another about other
> cases.
> If the caching client is making the directory changes, then I agree this
> cannot be done and you are stuck having to refetch potentially large
> directories to deal with new READDIR requests☹️
> Where we might disagree is the case in which another client is making the
> change.  In that case directory notifications would allow you to avoid
> repeated READDIR ops, whether you are providing the user synthetic or
> server-based cookies.
> My talk on directory caching will discuss the possibility of v4.2
> extensions to address the same-client directory caching issue, as well as
> possible clarifications regarding directory delegation/notification in v4.1.
> meaning that a naive implementation
> OK.   I'll plead guilty to one misdemeanor count of directory naivety.
> of synthetic cookies as an offset is not compatible with the
> telldir()/seekdir() requirements.
> It's not clear to me how this incompatibility would manifest itself.  I
> think I need to understand what would break.
> To make matters worse, the list size is for all intents and purposes
> unbounded, because there is no hard limit on the size of a directory. That
> makes it also impossible to create a cached mapping between a synthetic
> cookie and a filename; such a mapping would be unbounded both in size and
> in duration (since we don't know a priori how long the application will
> keep the directory open, or for that matter, which exact set of cookies it
> may have cached).
> Such a mapping would, in essence, be part of the cached directory.   So,
> if it is too big to keep in client memory,then it is too big to cache and
> you might as well decide not to cache it.
> I expect there is an issue that is a worry in the case in which a
> reasonably sized directory  grows over time to be too big to cache while an
> open directory stream retains some directory cookies which might be
> incompatible with the client dropping  caching of directories and switching
> to server-based cookies.😖
> I feel it is reasonable to treat this situation as one might a
> cookie-verifier failure, particularly if this is the only worrisome failure
> mode.   However, this possibility means that I would not ask clients to
> implement such local cookies. To enable that, we would have to make
> explicit the same sort of reasonableness requirement for cookie changes
> that we have already discussed for ordering changes.  RFC7530 already
> alludes to the need to avoid spurious cookie invalidations although not in
> as explicit or strict way as we would need to support directory
> notifications:
>    As there is no way for the client to indicate that a cookie value,
>    once received, will not be subsequently used, server implementations
>    should avoid schemes that allocate memory corresponding to a returned
>    cookie.  Such allocation can be avoided if the server bases cookie
>    values on a value such as the offset within the directory where the
>    scan is to be resumed.
>    Cookies generated by such techniques should be designed to remain
>    valid despite modification of the associated directory.  If a server
>    were to invalidate a cookie because of a directory modification,
>    READDIRs of large directories might never finish.
> So in order to make this work the client would basically have to create
> its own B-tree and persist it in storage somewhere.
> I don't see the need to make this persistent.  If the client restarts, all
> directory streams have ceased to exist and we know  a posteriori  that
> there are no outstanding directory cookies to which the client would have
> to respond.
> <<>>
> --
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> ><<mailto: