Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues

Rick Macklem <rmacklem@uoguelph.ca> Sun, 05 July 2020 22:59 UTC

Return-Path: <rmacklem@uoguelph.ca>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8422A3A0ADD for <nfsv4@ietfa.amsl.com>; Sun, 5 Jul 2020 15:59:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.101
X-Spam-Level:
X-Spam-Status: No, score=-2.101 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=uoguelph.ca
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Cq6FyVfAYvdC for <nfsv4@ietfa.amsl.com>; Sun, 5 Jul 2020 15:59:18 -0700 (PDT)
Received: from CAN01-TO1-obe.outbound.protection.outlook.com (mail-eopbgr670087.outbound.protection.outlook.com [40.107.67.87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B86E03A0BB2 for <nfsv4@ietf.org>; Sun, 5 Jul 2020 15:59:18 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VLjd/KrZYewACP+z2SmtUwHJybkV6d6tHdf1hrNzU8gLtI/fIt4cpNCd4NoBTCndzsqcXgL6zv3zgM5FxwWJtNzn05XqfuAsFNXQVgmR3UrlrAGU3BcBEouPWUEHHMeTgmEmWqTMYnkKdF+MdHYfiM+gkNG/UsbEGLmTOgPiODR2tEjLyXATjE8K98ahOzIP7Zx2t9KYpRswTE1tIu71EnmelTbXbkiKBca0Oi1IEKTsUQ7cr4GI7aOqW3afJSaPhlW7fWtpyxBdX5xlbuKiShBHavYaJtLO3jQ6uv4VO1WQLGZseKwD21LOfFq5/Xwabt8EhOfGqDJWgR2jbHPosQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=WrXK3rnyviVf1G8JfK4iLum6TwofxMP1hO/8H0Zs63k=; b=keKi2Ft+6G5NhOTj+YXduWhubNi5kMKl4RbJJuvbfsYSaG5s+Htwz255pzKJmKApPiAii7hAnObwIIW6DrLLCFWby1h7Mcyr3biaqX+UC8kqJ+GovTRECyrC5RRJD8/58vQi4yHdPlMu1icZqYHfzbpbj617a/DXqwSo3t2BY3mI+tIJYg1A+jvIirlqZF6YlnkFOiEkEg23UlXK3Iy12Pm9GQmOkM5qzd5avPaeaeUYXcgP/XHaqwYOvWgjh+btanJvnk9U9x/e1acRnEng2b5Lq/oGadWLcgMKkhZfVmb0C5W1JZksg+R0jHdK3IGTQF/vNvFpoZswHKQFp2MdTw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=uoguelph.ca; dmarc=pass action=none header.from=uoguelph.ca; dkim=pass header.d=uoguelph.ca; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uoguelph.ca; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=WrXK3rnyviVf1G8JfK4iLum6TwofxMP1hO/8H0Zs63k=; b=cYcwjElzbSaPFTNRUoSqdR8KZkar7OEXje1tG737ag5Q6oLz9J0BEsLO2LMr60RxP2lZiwV3Itl9HIYDsnr0qI+6jOvDk3cpinGdoCBC3eLahYYe4aFFOQKqw2wzkqZV1P+70mSfFKn0dxMH6oU6rxU8WFL8WjyqAboky8dEWocOT1/mRbMXnIy0ZmyW47P94q0kIiD2PHCw9S1/j4B0snz0/jmk8BIOPd5lWUIrwthFFhaQpBEIkCeR/qdoliWOjSo1lm9GjrPtK2/KwuoaDBk9YPZH9l7t5DJaIBbjvOE9/JVDJ9aevZW9Pvm573gILw3UAnTEI0BPiR1xhEgRxw==
Received: from QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:c00:38::14) by YQXPR0101MB1367.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:c00:21::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3153.23; Sun, 5 Jul 2020 22:59:16 +0000
Received: from QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM ([fe80::60f3:4ca2:8a4a:1e91]) by QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM ([fe80::60f3:4ca2:8a4a:1e91%7]) with mapi id 15.20.3153.029; Sun, 5 Jul 2020 22:59:07 +0000
From: Rick Macklem <rmacklem@uoguelph.ca>
To: David Noveck <davenoveck@gmail.com>
CC: Trond Myklebust <trondmy@hammerspace.com>, NFSv4 <nfsv4@ietf.org>
Thread-Topic: [nfsv4] Notes regarding discussion of directory scalabiliy issues
Thread-Index: AQHWTturJpknuX+EYESd2ig3Ds4JHqjxKBOAgALrPICAAPM3TIAAqWUAgABa1vSAAT6UgIAA4afkgAC0ToCAALPStw==
Date: Sun, 05 Jul 2020 22:59:07 +0000
Message-ID: <QB1PR01MB3364B5C56FDEF69340598E8EDD680@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>
References: <CADaq8jev+tUs=mrGDMnZMpfmQXL=KLwDKW5S-CbBLpL-54RJTA@mail.gmail.com> <3fc0af37d7d870eeb6ab854a75d6eeb5aae61a0d.camel@hammerspace.com> <CADaq8jcb5BLiE49SyS3wxbbDmz88GJNWeLgeh5XU4oGkdBuJmw@mail.gmail.com> <5d9b4f697ec698a7f07e8168f56826dbb52e234b.camel@hammerspace.com> <CADaq8jenX2Su1MpMGpcvMuJncoDU+JMnMmdXaED9eYXiKrnGcA@mail.gmail.com> <QB1PR01MB3364D6132B8D515B7766A606DD6A0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM> <CADaq8jdSYD5n6roZ8gZh7RHS3A0SapFuPb9s2Aq05W++d32wEA@mail.gmail.com> <QB1PR01MB336438533FDE0C39CA5F2CA2DD6A0@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM> <CADaq8jef9EBWprfwa-O8rE+9jZZpN3JY=z4raTF6zzto=Diymg@mail.gmail.com> <QB1PR01MB3364A0CCF975E3A4415640E8DD680@QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM>, <CADaq8jfiV5vC-=3phrcV1MwBNO1T08jsqD3QjEWSevgxM-TjSw@mail.gmail.com>
In-Reply-To: <CADaq8jfiV5vC-=3phrcV1MwBNO1T08jsqD3QjEWSevgxM-TjSw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
authentication-results: gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=none action=none header.from=uoguelph.ca;
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: 86b9d1ca-a429-4191-7599-08d8213705ef
x-ms-traffictypediagnostic: YQXPR0101MB1367:
x-microsoft-antispam-prvs: <YQXPR0101MB13677B216508DEF3D58B01A4DD680@YQXPR0101MB1367.CANPRD01.PROD.OUTLOOK.COM>
x-ms-oob-tlc-oobclassifiers: OLM:1360;
x-forefront-prvs: 045584D28C
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: I0LqvEXQFExNul6CpBGk9V+RupDMCh6N/2IeA45rXkIdOB4vRfImKyJKxNcQnPHpjLw8eUm83Uqj2RkbSi4PcSmp339MRfXlHfBgXITbCDDdGmaWJrux+sfXLY4ZXaivrP2jWaekw6YweqsP6qoiydS18YPc1pDA/edfUKQvCccxs74dop0QynRdX1ivtig5Fx7+3mZifLbJhi8lDTsWpZXNN+uosNkILIu8irrm2f1GSSPf/fTx2k9TNLEOTa5s3OvaRqKShOchshqxkMcllsRJjWl7il77jeMYHUZEzPAGowtgxKvxOQVMM6XJVvQQNrCji1rTjnD9zxboagyWm4ed1+gCJga5Sg0amIZDsrbYb3CvPoG6XnuUQKNdwuiyEmR4nsWqafTxyTNmWih//w==
x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFTY:; SFS:(346002)(39850400004)(366004)(376002)(396003)(136003)(786003)(2906002)(30864003)(6916009)(4326008)(83380400001)(54906003)(71200400001)(8936002)(55016002)(966005)(186003)(53546011)(6506007)(66476007)(66946007)(316002)(86362001)(64756008)(66556008)(66446008)(478600001)(33656002)(8676002)(5660300002)(9686003)(76116006)(7696005)(52536014)(579004); DIR:OUT; SFP:1101;
x-ms-exchange-antispam-messagedata: w3ko1RKztsesDdA1Qsu7fTh4cN0b2Upq6Q/v+6M4xr8H4/SKF/tC4BlZIJqmEOKWT1iRqicQ0x4+LspWEyQUwb7DEAePxsYtVYaVll5J2jLqS/lzgHSyd2BQQhXJEdf8/M4WZnmWxN3tH6cUXeOs0CDbj4EUF9+ymBubznh+c+yx4U1l1nqjErHbwBg5JfTTwOYN9avJyKbcyGNa+uP46cJv6OXCbKEG+kr1jecVYxyLx2ey8o6kW23zsgYDTC2Bsxs5reJbSGAXNLz3B6zOCmfJKb7cc+eQBQuUHqAkXQvIupTNGCbOqSxzfypSxRO4Mujx07c7KNtE7YgGxqYNONW6Er+bHbgJQR9NkfTwaTvYwOO+zo+tiBzUviLndemFgWyI5W01YisfpPcW6OrNWOtMPADHi/+2QorI4WIdZeweOLlJm5tr6QOFg4aOJdgQdKYhdhUuBK+S4+LPRg5+NpnTnvIvWpfxBj16sXeBthRqGdi9MwoCkJhhg/mF05cN0LA2e6GTTonOtkThAyjApIspu/aZ2fge3FTKjcwwh1GKPfw9JoTyLsqV+1XGIqBw
x-ms-exchange-transport-forked: True
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-OriginatorOrg: uoguelph.ca
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: QB1PR01MB3364.CANPRD01.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-Network-Message-Id: 86b9d1ca-a429-4191-7599-08d8213705ef
X-MS-Exchange-CrossTenant-originalarrivaltime: 05 Jul 2020 22:59:07.7803 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: rNpDJflUcvEP+6C3h05Edqxkk7sNTjr9QPjsIamPZzfTTFP+kZI9CJKKhwrIzjf1FGy/hoNm3CRi4rbdrDGblQ==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: YQXPR0101MB1367
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/jABRaIurrG2DaKGZomcn36H3kMs>
Subject: Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Jul 2020 22:59:23 -0000

David Noveck wrote:
>Rick Macklem  wrote:
>>David Noveck wrote:
[stuff snipped]
>>>My feeling is that it is barely sufficient but we will need to discuss the possible
>>> need for further extensions at the meeting and on the list. The reasons
>>> I'm not sanguine about this approach are:
>>>
>>>  *   It is a problem for latency,which is a concern for common operations
>>> like OPEN-create.
>>>  *   There are troubling synchronization problems, given that the
>>> notification can be received before the request response or seconds after.
>>
>>Yes. Have you read the first para of RFC5661 sec. 18.39.4 lately:
>
>No.
>
>>   Directory delegations provide the benefit of improving cache
>>   consistency of namespace information.  This is done through
>>   synchronous callbacks.  A server must support synchronous callbacks
>>   in order to support directory delegations.  In addition to that,
>>   asynchronous notifications provide a way to reduce network traffic as
>>   well as improve client performance in certain conditions.
>
>Thanks for digging into this.  I appreciate you efforts to shed light on this,
> especially since I find it unlikely that they were "fun".
Kinda interesting, which can make it fun. Recently on a FreeBSD mailing list
there was a discussion about poor performance accessing a directory with
10million entries over NFS, so at least some care about this.

>>When I read this, I have no idea what "synchronous callbacks" refers to?
>
>My interpretation is that the server waits for the callback response before
> sending the reply. It makes sense but I'm not sure it's right.
>
>>It almost hints that certain callbacks are done synchronously and
>> then others asynchronously.
>
>It does more than hint.  It implies it strongly.
>
>>Maybe the original author was trying to address
>>this problem.
>
>Yes.
>
>>Could it be that "asynchronous notifications" was not meant
>>to be all the notifications?
>
>It could.
>
>>>  *   Unless you are prepared to lock the directory on the client until the callback
>>> is received (not so easy to do since the callback is asynchronous, you face
>>> an even worse synchronization problems as you receive multiple
>>> notifications (in-order but not organized by request).
>>>
>>>If we want to do this in 4.1 (e.g. Linux is OK without cookie monotonicity),
>>> then we are going have to fix up issues with regard to dirent_notiy_delay.
>>
>>I would be interested in hearing if having monotonically increasing directory
>>offset cookies would make the Linux client implementation easier?
>>
>>>  This is the minimum number of seconds to delay sending the notification.
>>>   We'd have to >change this so that it didn't apply to the same client case.
>>>  It pretty clearly is designed for the other-client case.
>>>  I'm pretty sure we could do this in the rfc5661bis context, along
>>> with some other cleanup in this area.
>>
>>Actually, my reading of RFC5661 is that dirent_notif_delay only applies to
>>attribute notifications.
(To correct this, I meant dir_notif_delay, not direny_notif_delay.)
>
>I don't agree. See below.
>
>>See this para in pg 582:
>>   NOTIFY4_CHANGE_CHILD_ATTRS/NOTIFY4_CHANGE_DIR_ATTRS
>>      The client will use the attribute mask to inform the server of
>>     attributes for which it wants to receive notifications.  This
>>      change notification can be requested for changes to the attributes
>>      of the directory as well as changes to any file's attributes in
>>      the directory by using two separate attribute masks.  The client
>>      cannot ask for change attribute notification for a specific file.
>>      One attribute mask covers all the files in the directory.  Upon
>>      any attribute change, the server will send back the values of
>>      changed attributes.  Notifications might not make sense for some
>>      file system-wide attributes, and it is up to the server to decide
>>      which subset it wants to support.  The client can negotiate the
>>      frequency of attribute notifications by letting the server know
>>     how often it wants to be notified of an attribute change.  The
>>      server will return supported notification frequencies or an
>>      indication that no notification is permitted for directory or
>>      child attributes by setting the dir_notif_delay and
>>      dir_entry_notif_delay attributes, respectively.
>
>I think you are reading too much into the fact that this the only place that these >delays are explicitly mentioned.  Maybe I'm relying too much on the fact that >delay/batching of directory entry changes makes sense in the other-client case.
Well, there is also the definition of dir_notif_delay:
5.11.1.  Attribute 56: dir_notif_delay

   The dir_notif_delay attribute is the minimum number of seconds the
   server will delay before notifying the client of a change to the
   directory's attributes.

Although Trond did say that the directory offset was an attribute, I would have
assumed "attribute" in the above sentence referred to fattr4 items.

Now, I am not sure about batched notifies, but it could be that the author
was thinking of multiple changes caused by the same compound RPC?

>>Called dir_entry_notif_delay here, just to try and confuse us;-)
>>Since attribute change notifies do not go to the client that created/removed
>>the entry, I think the above is addressed?
>>
>>That brings us back to "synchronous callbacks". Although it is not clear,
>>I might contend that the original author might have intended that the
>>NOTIFY4_ADD_ENTRY/NOTIFY4_REMOVE_ENTRY/NOTIFY4_RENAME_ENTRY
>>be done synchronously?
>>
>Might have but given the possibility that a number of clients might hold
> delegations, issues regarding unresponsive clients would have to be
> addressed to go down this path.
That is always going to be a problem. When a server receives a conflicting Open,
file delegations need to be CB_RECALL'd. All the server can do is try for a while
and then give up.

>Another possibility is that it was intended that callbacks to the same client
> were intended to be synchronous.
I think this might be a desirable trait. I think the client that has done the RPC
that adds/removes a name from the directory will need to "wait" until the
add/remove notification arrives before unlocking the directory for reading.
For example, for "add", if it does not keep the directory locked,
the entry will not be seen by a readdir(), although it exists and can be Open'd.
As such, I think the delay should be minimal for this client.
I think it is somewhat less critical for the other client(s) that hold
a delegation on the directory.

This para, from sec. 10.9.2 seems relevant:
   In addition to asking for delegations, a client can also ask for
   notifications for certain events.  These events include changes to
   the directory's attributes and/or its contents.  If a client asks for
   notification for a certain event, the server will notify the client
   when that event occurs.  This will not result in the delegation being
   recalled for that client.  The notifications are asynchronous and
   provide a way of avoiding recalls in situations where a directory is
   changing enough that the pure recall model may not be effective while
   trying to allow the client to get substantial benefit.  In the
   absence of notifications, once the delegation is recalled the client
   has to refresh its directory cache; this might not be very efficient
   for very large directories.

Although it states the notifications are asynchronous, it also states the
notifications are done "when that even occurs". It does not indicate that
there may be a significant delay before doing the notification.

>I'll do my best to involve the original author In the discussion.
>
>
>>Anyhow, I agree that this needs clarification.

rick


>I considered returning the necessary info in a bunch of new ops but the whole >thing got waaay too complicated.  I think the best choice would be a >GET_NOTIFICATION op to be issued after the CREATE, OPEN, LINK, REMOVE, >RENAME.
>
As above, the client doing the addition/deletion does get the CB_NOTIFY.

>>For example, for a simple case of a UFS file system on a server, the UFS
>>directory consists of blocks of directory entries.
>>- When an entry is added, it goes in a gap within the directory or grows a
>>   new block at the end (if I vaguely recall this correctly;-).
>>- When an entry is removed, the entry is erased without moving other
>>   entries.
>>For this example, the directory offset cookie can simply be the byte offset
>>of the entry.
>>(I haven't looked at other file systems, but hopefully offset cookies with the
>> above properties can be created for most of them?)
>
>Possibly, but given the need for this, I'm not inclined to rely on hope.
If the server cannot do it, the new read-only attribute would be false and
(at least the FreeBSD client) would choose not to get directory delegations.

>When an entry is added/deleted, the server issues a callback to the client
>with the new directory cookie offset (and the directory entry for the add case).
>
>I'd prefer that the same information that is returned as in the existing directory >notification scheme, which allows you to maintain directory order and cookies, >without tieing one to the other.
I was just being lazy. I hadn't read the section of RFC5661 in a long time and
didn't remember the terminology. (I didn't mean to imply "new callbacks" were
needed.) There was the issue w.r.t the client doing the addition/deletion
getting a callback (which it appears it does).


>>The client can maintain this structure in any number of ways (and it could be
>>fun figuring out what works well), but a trivial version could be:
>>- The reply to a readdir is kept as a list head (with the directory offset of the
>>   first entry) and a linked list of the entries in order, with their cookie.
>>  (Remember that the cookies are in the same ordering as the entries
Well, I do this for fun. If it isn't fun, I won't be doing it. (I know someone
doing this as a retirement hobby is weird, but since vendors do not
invest in clients much...)

>Not in my scheme.
>
>>- The next readdir reply creates the next head/list.
>>
>>readdir() just works through each list, following each head in order.
>>telldir() returns the cookie for the entry.
>>seekdir() just finds the correct list and then searches down that list for a match.
>>
>>The remove/add entry callbacks just insert/delete entrie(s) in the appropriate
>>list.
>>(I'd probably keep these lists in the kernel client under the VFS for FreeBSD,
>> as malloc'd data structures, but that is simply an implementation choice.)
>>
>>You would require the monotonically increasing property for directory
>>delegations to be issued. (Without that you don't know where to insert
>>additions.)
>>
>The insert notification gives you enough information to do this without the >monotonically increasing property.
That is true, but then why hasn't anyone implemented it?
(Yes, I was "misusing" the term require. I'm not a spec. writer, as you can easily tell.)

>>You would also require that extant directory entry cookies
>>remain valid and unchanged when additions/deletions occur.
>>(Note that removing an entry and then adding an entry at the same offset
>> is allowed under POSIX telldir()/seekdir() as I understand it.)
>
>I think the requirement goes away when the delegation does.
Yes, although it practice, assuming the client(s) are getting the notifications,
I only see them returning a directory delegation after a closedir() and if the
number of CB_NOTIFYs is large or the caching storage needs to be free'd.

>>This avoids any need for the client to synthesize cookies and just use the ones
>>returned by the server, I think?
>
>Yes
>
>>Just a simple idea that may be worth considering?
>>(If this has already been discussed,
>
>Don't think it has.
>
>>I apologize for not seeing it.)
>
>Even if it had, no apology would be necessary
I should have re-read the appropriate sections of RFC5661 before the last post.

rick


rick

________________________________________
From: nfsv4 <nfsv4-bounces@ietf.org<mailto:nfsv4-bounces@ietf.org><mailto:nfsv4-bounces@ietf.org<mailto:nfsv4-bounces@ietf.org>><mailto:nfsv4-bounces@ietf.org<mailto:nfsv4-bounces@ietf.org><mailto:nfsv4-bounces@ietf.org<mailto:nfsv4-bounces@ietf.org>>>> on behalf of David Noveck <davenoveck@gmail.com<mailto:davenoveck@gmail.com><mailto:davenoveck@gmail.com<mailto:davenoveck@gmail.com>><mailto:davenoveck@gmail.com<mailto:davenoveck@gmail.com><mailto:davenoveck@gmail.com<mailto:davenoveck@gmail.com>>>>
Sent: Thursday, July 2, 2020 6:06 AM
To: Trond Myklebust
Cc: nfsv4@ietf.org<mailto:nfsv4@ietf.org><mailto:nfsv4@ietf.org<mailto:nfsv4@ietf.org>><mailto:nfsv4@ietf.org<mailto:nfsv4@ietf.org><mailto:nfsv4@ietf.org<mailto:nfsv4@ietf.org>>>
Subject: Re: [nfsv4] Notes regarding discussion of directory scalabiliy issues

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca<mailto:IThelp@uoguelph.ca><mailto:IThelp@uoguelph.ca<mailto:IThelp@uoguelph.ca>><mailto:IThelp@uoguelph.ca<mailto:IThelp@uoguelph.ca><mailto:IThelp@uoguelph.ca<mailto:IThelp@uoguelph.ca>>>



On Tuesday, June 30, 2020, Trond Myklebust <trondmy@hammerspace.com<mailto:trondmy@hammerspace.com><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com>><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com>>><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com>><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com>>>>> wrote:
On Tue, 2020-06-30 at 08:40 -0400, David Noveck wrote:
Thanks for your helpful comments.

On Mon, Jun 29, 2020 at 2:43 PM Trond Myklebust <trondmy@hammerspace.com<mailto:trondmy@hammerspace.com><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com>><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com>>><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com>><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com><mailto:trondmy@hammerspace.com<mailto:trondmy@hammerspace.com>>>>> wrote:
Like it or not, the readdir cookie is an attribute of the directory.


If the protocol treated them as such, then the attribute notifications feature could provide updates to the client.   Given that it doesn't, we could add a cookie update feature to directory notification feaure as a v4.2 extension to the protocol.  However, I'm reluctant to start work on the necessary protocol additions until we are sure they are needed to provide better directory cacheability.

Actually, they are attributes of directory streams.   The difference is not all that important given that client implementations are unlikely to be aware of the specific steam associated with any particular request.  However, there are a few cases in which the difference is important in determining whether various approaches to client handling of cookies might or might not work, and will be important in the discussion below:

  *   Two requests made on different clients necessarily are made on distinct streams.
  *   Two requests made on different instances of the same client (with an intervening restart/reboot) also have to arise on different streams.

If I want to support the POSIX telldir() and seekdir() operations ( https://pubs.opengroup.org/onlinepubs/9699919799/functions/seekdir.html ), then I need to ensure that when the application calls seekdir(), I return to the exact same cursor location in the stream that I was at when I called telldir().


Agreed.

Without a server side cookie on which to anchor my telldir() cookies,


Every client has these available but it is not clear to me useful such anchoring is.   I think the flexibility that each client has to assign cookies to streams it is responsible for is valuable and could be compromised if anchoring to the server cookies is made the focus of the implementation.

then all I have is a list of filenames that can and will change every time a file is created, deleted or renamed.


Clearly it will change.  However, the directory notifications feature makes some assumptions, currently implicit about how the list will change.   Once these are made explicit, the wg could decide that server/fs pairs incapable of staying within these reasonable restrictions (if they are, in fact, reasonable), cannot support the directory notifications feature.

Both the length and ordering of that list may change whenever the directory is modified,


Clearly the length will change, but the reasonable expectation is that creating a file will increase the length by one and deleting one will decrease it by one.   I don't see the value of supporting directory notifications on server fs that do something else.

With regard ro ordering,  suppose the spec allows an fs to shuffle the directory order every time a change is made, but I'm unaware of any actual file systems that do this.   Do we need to support directory notifications for such fs's?


touch foo; touch bar; ln foo baz; rm foo; mv baz foo

There... Most filesystems will end up reordering 'foo' and 'bar' in the directory stream given the above sequence of commands. How does the client figure out what happened if the above sequence of commands is performed on the server?
Now let's say that is a directory of a million files, and something like the above is made to happen regularly. How do I maintain a stable list of synthetic cookies on the client?

I think you are right about there being cases in which it is impossible, but we either disagree or are simply talking past one another about other cases.

If the caching client is making the directory changes, then I agree this cannot be done and you are stuck having to refetch potentially large directories to deal with new READDIR requests☹️

Where we might disagree is the case in which another client is making the change.  In that case directory notifications would allow you to avoid repeated READDIR ops, whether you are providing the user synthetic or server-based cookies.

My talk on directory caching will discuss the possibility of v4.2 extensions to address the same-client directory caching issue, as well as possible clarifications regarding directory delegation/notification in v4.1.


meaning that a naive implementation


OK.   I'll plead guilty to one misdemeanor count of directory naivety.

of synthetic cookies as an offset is not compatible with the telldir()/seekdir() requirements.


It's not clear to me how this incompatibility would manifest itself.  I think I need to understand what would break.

To make matters worse, the list size is for all intents and purposes unbounded, because there is no hard limit on the size of a directory. That makes it also impossible to create a cached mapping between a synthetic cookie and a filename; such a mapping would be unbounded both in size and in duration (since we don't know a priori how long the application will keep the directory open, or for that matter, which exact set of cookies it may have cached).


Such a mapping would, in essence, be part of the cached directory.   So, if it is too big to keep in client memory,then it is too big to cache and you might as well decide not to cache it.

I expect there is an issue that is a worry in the case in which a reasonably sized directory  grows over time to be too big to cache while an open directory stream retains some directory cookies which might be incompatible with the client dropping  caching of directories and switching to server-based cookies.😖
I feel it is reasonable to treat this situation as one might a cookie-verifier failure, particularly if this is the only worrisome failure mode.   However, this possibility means that I would not ask clients to implement such local cookies. To enable that, we would have to make explicit the same sort of reasonableness requirement for cookie changes that we have already discussed for ordering changes.  RFC7530 already alludes to the need to avoid spurious cookie invalidations although not in as explicit or strict way as we would need to support directory notifications:

   As there is no way for the client to indicate that a cookie value,

   once received, will not be subsequently used, server implementations

   should avoid schemes that allocate memory corresponding to a returned

   cookie.  Such allocation can be avoided if the server bases cookie

   values on a value such as the offset within the directory where the

   scan is to be resumed.


   Cookies generated by such techniques should be designed to remain

   valid despite modification of the associated directory.  If a server

   were to invalidate a cookie because of a directory modification,

   READDIRs of large directories might never finish.


So in order to make this work the client would basically have to create its own B-tree and persist it in storage somewhere.


I don't see the need to make this persistent.  If the client restarts, all directory streams have ceased to exist and we know  a posteriori  that there are no outstanding directory cookies to which the client would have to respond.
<mailto:nfsv4@ietf.org<mailto:nfsv4@ietf.org><mailto:nfsv4@ietf.org<mailto:nfsv4@ietf.org>><mailto:nfsv4@ietf.org<mailto:nfsv4@ietf.org><mailto:nfsv4@ietf.org<mailto:nfsv4@ietf.org>>>>

--

--

Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com<mailto:trond.myklebust@hammerspace.com><mailto:trond.myklebust@hammerspace.com<mailto:trond.myklebust@hammerspace.com>><mailto:trond.myklebust@hammerspace.com<mailto:trond.myklebust@hammerspace.com><mailto:trond.myklebust@hammerspace.com<mailto:trond.myklebust@hammerspace.com>>><mailto:trond.myklebust@primarydata.com<mailto:trond.myklebust@primarydata.com><mailto:trond.myklebust@primarydata.com<mailto:trond.myklebust@primarydata.com>><mailto:trond.myklebust@primarydata.com<mailto:trond.myklebust@primarydata.com><mailto:trond.myklebust@primarydata.com<mailto:trond.myklebust@primarydata.com>>>>