Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues

David Noveck <> Tue, 30 June 2020 12:40 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 72A7D3A07AC for <>; Tue, 30 Jun 2020 05:40:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id xihjfQcaPpv2 for <>; Tue, 30 Jun 2020 05:40:35 -0700 (PDT)
Received: from ( [IPv6:2a00:1450:4864:20::62b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id DA2AF3A0044 for <>; Tue, 30 Jun 2020 05:40:34 -0700 (PDT)
Received: by with SMTP id a1so20355402ejg.12 for <>; Tue, 30 Jun 2020 05:40:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Stt+rMsdEuEKv3Rkgg3b2infs4aKBiIDV8Xg7eklGro=; b=QA6/siKy+PInnRvAISFJ77Qw68bjc+XcbSwYw0ypQMLcpq1Y/b6nFSuZxJwjD4Psbd Hw9xCyOFX/3AKn60Du/KKMqnjUYV5mdHJdwRIEgZ36P7BH2UweFInRWPn+9HlwcPwGwn +4d15yuE9d/5GcjhyKpMd2nEKaqBDnTIA1w/8ePFv9LL240KVSLUE99qjHc5gMlVoCBb 5RranFAfb+M0mW9NfBzVIFHP3PVcsBGPQruFRQdYlWXDwgflOrhR8kim3/7dE82B47MW JNOSdekdT3CvALwcPePvR7lon6jvI/LeqVo+l3LzVt5OGCHSucuQz32d4ajmf0I/EkNf KP6Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Stt+rMsdEuEKv3Rkgg3b2infs4aKBiIDV8Xg7eklGro=; b=UpywdEvDNW7pwcHnUk9O8vZUcz2EREV/Jj7SOAJ0R2aIHjjtIbVcYBG2qoVQklHgwL ys37NaUd7DQtLCggw+eIe4Nu9Fh3+8zJVG7e35A3KcVQH+RVcMI4a6T77eGkdKUG9oHj WFZbjHOTz8PwKBJT/2TMnFS+UNpWPpxedGbhy0/z2784q4k6IxuCYpO1j4+WT9ZieQGB rvQsoqCzS+0FMjEQf6LbpzKXYLPcAJ8Ko9IIvVLQ9ZxvsEhI1G3FVR9I91VtiA0aD5KA SU8OgxeKvechoD3YJVV0nrp8QugeuWB0qb8BYKS31gRak/P4dqR2uRsT4wj3Jt+pl5ae /T2g==
X-Gm-Message-State: AOAM532qfg5MI3k0V422zFA3UMwerlKyury6xh0goyE2Q/ko8/i193uv x51fqFIMiegPqxqmvKX7zX7+vDrBQ6cZrCfF88rbwR7b
X-Google-Smtp-Source: ABdhPJytCkb/JZ5J8iHLMoBlN2jPqqoK3+pTPR4t0QN73p+a7HJqV5U0xk1nNzhW4molzhrhj/raVwpKxijg6O4HZcw=
X-Received: by 2002:a17:906:50b:: with SMTP id j11mr17695245eja.127.1593520833097; Tue, 30 Jun 2020 05:40:33 -0700 (PDT)
MIME-Version: 1.0
References: <> <>
In-Reply-To: <>
From: David Noveck <>
Date: Tue, 30 Jun 2020 08:40:21 -0400
Message-ID: <>
To: Trond Myklebust <>
Cc: "" <>
Content-Type: multipart/alternative; boundary="00000000000032f67705a94c7b0a"
Archived-At: <>
Subject: Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: NFSv4 Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 30 Jun 2020 12:40:37 -0000

Thanks for your helpful comments.

On Mon, Jun 29, 2020 at 2:43 PM Trond Myklebust <>

> Like it or not, the readdir cookie *is* an attribute of the directory.

If the protocol treated them as such, then the attribute notifications
feature could provide updates to the client.   Given that it doesn't, we
could add a cookie update feature to directory notification feaure as a
v4.2 extension to the protocol.  However, I'm reluctant to start work on
the necessary protocol additions until we are sure they are needed to
provide better directory cacheability.

Actually, they are attributes of directory *streams.   *The difference is
not all that important given that client implementations are unlikely to be
aware of the specific steam associated with any particular request.
However, there are a few cases in which the difference is important in
determining whether various approaches to client handling of cookies might
or might not work, and will be important in the discussion below:

   - Two requests made on different clients necessarily are made on
   distinct streams.
   - Two requests made on different instances of the same client (with an
   intervening restart/reboot) also have to arise on different streams.

> If I want to support the POSIX telldir() and seekdir() operations (
> ),
> then I need to ensure that when the application calls seekdir(), I return
> to the exact same cursor location in the stream that I was at when I called
> telldir().


Without a server side cookie on which to anchor my telldir() cookies,

Every client has these available but it is not clear to me useful such
anchoring is.   I think the flexibility that each client has to
assign cookies to streams it is responsible for is valuable and could be
compromised if anchoring to the server cookies is made the focus of the

then all I have is a list of filenames that can and will change every time
> a file is created, deleted or renamed.

Clearly it will change.  However, the directory notifications feature makes
some assumptions, currently implicit about how the list will change.   Once
these are made explicit, the wg could decide that server/fs pairs incapable
of staying within these reasonable restrictions (if they are, in fact,
reasonable), cannot support the directory notifications feature.

Both the length and ordering of that list may change whenever the directory
> is modified,

Clearly the length will change, but the reasonable expectation is that
creating a file will increase the length by one and deleting one will
decrease it by one.   I don't see the value of supporting directory
notifications on server fs that do something else.

With regard ro ordering,  suppose the spec allows an fs to shuffle the
directory order every time a change is made, but I'm unaware of any actual
file systems that do this.   Do we need to support directory notifications
for such fs's?

> meaning that a naive implementation

OK.   I'll plead guilty to one *misdemeanor *count of directory naivety.

> of synthetic cookies as an offset is not compatible with the
> telldir()/seekdir() requirements.

It's not clear to me how this incompatibility would manifest itself.  I
think I need to understand what would break.

> To make matters worse, the list size is for all intents and purposes
> unbounded, because there is no hard limit on the size of a directory. That
> makes it also impossible to create a cached mapping between a synthetic
> cookie and a filename; such a mapping would be unbounded both in size and
> in duration (since we don't know a priori how long the application will
> keep the directory open, or for that matter, which exact set of cookies it
> may have cached).

Such a mapping would, in essence, be part of the cached directory.   So, if
it is too big to keep in client memory,then it is too big to cache and you
might as well decide not to cache it.

I expect there is an issue that is a worry in the case in which a
reasonably sized directory  grows over time to be too big to cache while an
open directory stream retains some directory cookies which might be
incompatible with the client dropping  caching of directories and switching
to server-based cookies.😖
I feel it is reasonable to treat this situation as one might a
cookie-verifier failure, particularly if this is the only worrisome failure
mode.   However, this possibility means that I would not ask clients to
implement such local cookies. To enable that, we would have to make
explicit the same sort of reasonableness requirement for cookie changes
that we have already discussed for ordering changes.  RFC7530 already
alludes to the need to avoid spurious cookie invalidations although not in
as explicit or strict way as we would need to support
directory notifications:

   As there is no way for the client to indicate that a cookie value,
   once received, will not be subsequently used, server implementations
   should avoid schemes that allocate memory corresponding to a returned
   cookie.  Such allocation can be avoided if the server bases cookie
   values on a value such as the offset within the directory where the
   scan is to be resumed.

   Cookies generated by such techniques should be designed to remain
   valid despite modification of the associated directory.  If a server
   were to invalidate a cookie because of a directory modification,
   READDIRs of large directories might never finish.

> So in order to make this work the client would basically have to create
> its own B-tree and persist it in storage somewhere.

I don't see the need to make this persistent.  If the client restarts, all
directory streams have ceased to exist and we know  *a posteriori  *that
there are no outstanding directory cookies to which the client would have
to respond.

> <>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> <>