Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues

Trond Myklebust <> Mon, 29 June 2020 18:43 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 932493A09B9 for <>; Mon, 29 Jun 2020 11:43:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.1
X-Spam-Status: No, score=-2.1 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (1024-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id 0b07IOtCwrQH for <>; Mon, 29 Jun 2020 11:43:02 -0700 (PDT)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 678B93A09B3 for <>; Mon, 29 Jun 2020 11:43:01 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901;; cv=none; b=AdRc3WliUyMPy/35aVswPLORY31iZJXahJwBwCImgjGIf948gPvXpXtDCMP9CdBYSt2WkIdfTMzmgJQYo8mKABMOOKDm8FeuLx0+N4zsUjBvBIy8gEMhc8IiJdPKZUaCxRT47ikSnhdttQub05vyxghkVOJ/SGFBInl3woBiIvptE0U2Vy3DzpRq0M30k/4aPKsG+i15/9EtFBltYJvOBEDCI3vE/R024zuIVUIbt/YJx6ZLeSTp5BIO1CsqyA+biCaaDCnY5L/CVxqdxHJ5Eo+hEFelqG9bjrq207Odo1OkNSnuA97JWqVyWjaA3sFWBPsBeL9EHWlwyQ9/R/k8kw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GgRkqdCaStvS5IQQGUJSqfVRKW3PI0lcPmgV2V9GiaI=; b=BFDNaVOa9ArNKOCa3SYsj4CXF/EpiWlDeSZbdEeQg4Y4vg/rknUZKTSxahRD5qwxrAyWalUF2Mxb8FmueKXVz87NxGKH6s5m5Rw+KWzJ4oXPdc+4bTdgrhtjGj47uJV8JQEr+Wb572rYiSdtEVCLC5GLPkGGVJ6LTZmQiQwfO52111dB7e1kPXpcA13RstkQiBEsbTKjtU9NUOE0SmCaEleh72Lw8YeXJQg7AKr0jrDuXqnH/2kag5GDERjHj0Dx/J/klw1pWBt9mG664EA9krJX7EHiYq6AnrMUBFDH/+FMlh+ZBZm73iwf4TJmKvWKU64fi7/emVLfHVfmY6jURA==
ARC-Authentication-Results: i=1; 1; spf=pass; dmarc=pass action=none; dkim=pass; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GgRkqdCaStvS5IQQGUJSqfVRKW3PI0lcPmgV2V9GiaI=; b=TOJ5V3IgnCIz1DGNtFv6MhmdXA4hZ/WiaqxEbrT7Dez4lQu7CLGpTB9t96sx+1C1R+hSQNXXgoBs2AoXNF0qCJChHnluXY/rAahrs25mK/bTpxVrksdY9qxiEFNt/0oSOFfuDGygT0OEp+hor/pfGzf3s/FBxtTbl4ObYfeSsfA=
Received: from (2603:10b6:610:2a::33) by (2603:10b6:610:9e::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3153.13; Mon, 29 Jun 2020 18:42:59 +0000
Received: from ([fe80::352c:f318:f4a7:6a0f]) by ([fe80::352c:f318:f4a7:6a0f%3]) with mapi id 15.20.3153.018; Mon, 29 Jun 2020 18:42:59 +0000
From: Trond Myklebust <>
To: "" <>, "" <>
Thread-Topic: [nfsv4] Notes regarding discussion of directory scalabiliy issues
Thread-Index: AQHWTkUbs01N8D6Lt0ORrflYNt2duw==
Date: Mon, 29 Jun 2020 18:42:58 +0000
Message-ID: <>
References: <>
In-Reply-To: <>
Accept-Language: en-US, en-GB
Content-Language: en-US
authentication-results:; dkim=none (message not signed) header.d=none;; dmarc=none action=none;
x-originating-ip: []
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: 1c13a837-d449-41d7-e77e-08d81c5c3ee4
x-ms-traffictypediagnostic: CH2PR13MB3718:
x-microsoft-antispam-prvs: <>
x-ms-oob-tlc-oobclassifiers: OLM:10000;
x-forefront-prvs: 044968D9E1
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: BTOqDWA8Fwy+8xw+giBoOK2v8yxY22a8CRK4OBVzxXp/vkxiO13naj9Emmp3sGGes9OorvznLhLIZ3DpdmR0Cp9yD26TgjOohcqZBk2yNOF+ioRxkP7vWBR29/eX4x4aDK96r8uLR2rFXbhJ2c23AjqX+i7Wmzg/S8LnlZgCKpjHBvEDzk8urEhTQWD9G2nYkfPi92WK6v/OZS4UkWjc6fw5tdF/Tr3pJ2LD/Bfy3tVYrlyjW0ABlYfZqz87/9HYJ7YmkOA8qrM6S6o60v2p8dvwyUB3LYTk7oxaBoVlBdYRNRH7SWxdayMkLOaQsdF6fVuKwHGbUxzvu9ebyIYy5C0W4vYhyMRCMZezCsOCUzYu6cYpHwqKmJbyKZojbCezZU2aYficqIKj3OL93mssRg==
x-forefront-antispam-report: CIP:; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM;; PTR:; CAT:NONE; SFTY:; SFS:(136003)(346002)(366004)(39840400004)(376002)(396003)(166002)(66946007)(36756003)(76116006)(64756008)(66476007)(66446008)(66556008)(316002)(6486002)(110136005)(6512007)(966005)(83380400001)(8676002)(8936002)(2906002)(2616005)(478600001)(21615005)(71200400001)(86362001)(26005)(5660300002)(186003)(6506007); DIR:OUT; SFP:1102;
x-ms-exchange-antispam-messagedata: BUCTjleVJ9fhGqPiyQiq+7IZmiofVu4fv4gqLMeFhGemdgfsT6WG205bS1RIks4O46qLbDqp0unvUPscWSJEhSSLIEcYoCW/GgNtIA3qn2YsYYiV0khpN0qh/QJHikm0FhGcKH52cqCWR2xV/jDmlHXl37qscGLvH6oqUfSQzUwuSBSWHFk6wIXe9gMmPDbUHRsuDenGQKE5uQn9aVcsgIdGWEU6vKtMYXgktOycw28oTwIQ3879NYGP7tKWm5c+Xl5qKCRsByy/C7QOB7IEjCnXC38GWe30SuVdTotU22/08H7FZ5TzAJY6IR62PgkC/+i1N+2Sr1p4WUGTrGMCG8/RlmbbI5ksfEZ8MRk0A1mJ/gI9lgusRQ3gfVpjXvPhC0l6G5XfjH2sQNiVA+MR1tmYnNNOCQCt+ZzNBBNkBBjZMDrJV9rDozDrSeoVpRhbcd3UxAcZhpL8TG0/BsZnXhJRIBWuPeHKVREc8UqbkIY=
x-ms-exchange-transport-forked: True
Content-Type: multipart/alternative; boundary="_000_3fc0af37d7d870eeb6ab854a75d6eeb5aae61a0dcamelhammerspac_"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-Network-Message-Id: 1c13a837-d449-41d7-e77e-08d81c5c3ee4
X-MS-Exchange-CrossTenant-originalarrivaltime: 29 Jun 2020 18:42:58.9493 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 0d4fed5c-3a70-46fe-9430-ece41741f59e
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 8x+pvuyOQLOMd1fI87ixgNqSj6gXGdIaersbykZyPfmc+JxmAIyzqiOd5APCBG85h2s34F1BDAh2CKnVYLF2Kw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH2PR13MB3718
Archived-At: <>
Subject: Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: NFSv4 Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Mon, 29 Jun 2020 18:43:06 -0000

Like it or not, the readdir cookie is an attribute of the directory.

If I want to support the POSIX telldir() and seekdir() operations ( ), then I need to ensure that when the application calls seekdir(), I return to the exact same cursor location in the stream that I was at when I called telldir().

Without a server side cookie on which to anchor my telldir() cookies, then all I have is a list of filenames that can and will change every time a file is created, deleted or renamed. Both the length and ordering of that list may change whenever the directory is modified, meaning that a naive implementation of synthetic cookies as an offset is not compatible with the telldir()/seekdir() requirements.
To make matters worse, the list size is for all intents and purposes unbounded, because there is no hard limit on the size of a directory. That makes it also impossible to create a cached mapping between a synthetic cookie and a filename; such a mapping would be unbounded both in size and in duration (since we don't know a priori how long the application will keep the directory open, or for that matter, which exact set of cookies it may have cached). So in order to make this work the client would basically have to create its own B-tree and persist it in storage somewhere. Remind me why we needed that NFS server?

On Fri, 2020-06-26 at 13:54 -0400, David Noveck wrote:

On 6/22, Chuck and I held a discussion to try to resolve some issues that arose in our discussion of scalability issues for directory operations.   After our original presentations on this topic at the post-IETF107 virtual interim, it was anticipated that we would discuss those on the wg mailing list.   Since that didn't work out, Chuck and I decided to clarify and possibly resolve our differences of approach in a short meeting that worked well in the form of a phone call.

We were able to clarify but not resolve two issues.  We hope to be able to resolve these eventually, but not necessarily during the July 9th meeting.

  *   We did agree that we will need to better understand, and try to resolve issues regarding the compatibility of directory notifications and client handling of directory cookies and their caching. See Directory Delegation Issues for details.

  *   We also explored issues raised by the possible addition of ops to aid recursive directory traversals, building on Chuck's suggestion, at the earlier meeting, of possible protocol aid for "rm -r". See  Ops to Add Aiding Recursive Directory Traversal for details

Directory Delegation Issues

Implementations of directory delegations are quite limited, making it not worth investing in server-side implementations.   Useful client-side implementation would require the ability to efficiently cache large slowly changing directories.   Unfortunately, that is not currently possible, at least for the Linux client, since the creation or removal of a file from a directory requires that the entire directory (which can be very large) be refetched.  This now occurs for directory changes that the client makes itself but would presumably also apply in the case of directory notifications, making them not very useful in the important case of large slowly-changing directories.

As Chuck related to me, this requirement for directory refetching is predicated on the possibility that any change to a directory (e.g. remove, create, rename) could potentially invalidate directory entry cookies for all the cached directory.entries.   While this is unlikely to happen in practice, I can see that clients might be unwilling to rely on filsystems not changing these. The thing I don't understand, and Chuck was unable to clarify for me, is why such entries need to be cached at all, and essentially treated as if they were attributes.  While the directory notification feature has the ability to propagate attribute changes to clients with delegations, there is no such ability in the case of cookie changes..  It appears that the feature was designed assuming these would not be necessary.

As a result, it appeared that one if the following would have to be done:

  1.  Clients might be modified so as not to depend on a supposed fixity of such cookies. That's clearly my favorite as I believe the client returning a cached directory could synthesize its own cookies to allow users to fetch directory information across multiple requests.  However, I'm not sure client implementers would agree and this is a matter on which consensus is important.
  2.  Provide a way in which the server could  communicate that the theoretical possibility that a remove, create, or rename could cause a revision of cookies for uninvolved directory entries does not occur for  a given fs. This could be done by adding a new fs-scope attribute providing information about an fs's directory entry cookie management.  The downside is that this would be v4.2-only and would be require an additional RFC to make a v4.1 feature effectively usable 😞
  3.  Provide another way to exclude the possibility of the client being blind-sided by a server-side fs prone to directory cookie reassignment.  For example, any change to a directory entry cookie not being removed or renamed could require delegation recall.  Unfortunately, this is a big change to an existing feature😖

As Chuck and I finished the discussion we anticipated a long-term process aimed at securing a consensus about which of these choices the working group would adopt to make directory delegations useful.  This could start at the 7/9 meeting but the need to make sure we had the active participation of client implementers meant it wasn't a sure thing for initial discussion at the 7/9 meeting.

Lately, I've been looking at directory notifications in more detail and come to the conclusion that the way the notifications use cookies to update cached directories is really predicated on the expectation that cookies for directory entries not involved in the specific update will not change.  For example, insert and rename notifications include the cookie of the entry before the insertion point, which is not really useful if the server-side fs  is free to change cookies for entries not involved in the directory operation.

I now believe that it is possible to rework the description of this feature so that the server by supporting directory notifications is providing assurance that the server, by supporting such notifications, is effectively providing an assurance that the wholesale revision of cookies as a result of directory modification operations, about which clients are concerned, cannot happen.   This could not be done as a consequence of an errata report, but it is the sort of clarification/revision that I feel could be done as a part of rfc5661bis, assuming that we can reach a working group consensus on the matter.  We intend to do change sof this scale for some REJECTED errata reports.I will take about 5-10 minutes for a presentation of this issue at the 7/9 meeting, hoping to stimulate later discussion of this issue and future directory delegation/notification implementation possibilities on the mailing list.

Ops to Add Aiding Recursive Directory Traversal

In Chuck's presentation he alluded to the possibility of the protocol giving "rm -r" more help and there are a number of useful extensions that could be defined.  As I considered other applications in which recursive directory traversal there appeared to be a number of possible READDIR extensions that could be sensibly proposed and would be useful in software build workloads.

The problem with such extensions is that they are unlikely to be used unless complementary work is done to provide useful API's to provide access to any helpful protocol extensions.  Although such work is out-of-scope for the working group, the working group has to avoid investing in protocol extensions which, realistically, will never be used.  As a result I will not present regarding possible extensions at the 7/9 meeting but may dp this later if there is interest in compatible API's for important clients.

Miscellaneous Items.

We also had occasion to discuss some other issues regarding the agenda of the  forthcoming meeting:

  *   Chuck reminded me of my previously mentioned intention to use github for the handling of the writing and review of wg documents associated with rfc5661bis (i.e. draft-ietf-nfsv4-internationalization, draft-ietf-nfsv4-security-needs, draft-ietf-nfsv4-security, draft-ietf-nfsv4-rfc5661bis, and possibly others).   Chuck wasn't clear what exactly I might need from him to help the process and it turned out, I wasn't sure either.  We agreed that I would mention the issues in my general slides on the rfc5661bis process and that I'd give Chuck an early opportunity to respond to those.
  *   Chuck pointed out the issue of the length of rfc5661 nd the need to address concerns about that.  I was concerned that previous suggestions in this regard (with regard to pNFS file) might result in multiple documents that don't fit together all that well.   I agreed to look at other approaches to the issue and present those to the working group at the 7/9 meeting.


nfsv4 mailing list




Trond Myklebust
Linux NFS client maintainer, Hammerspace<>