[nfsv4] Re: RFC: new MUTEX_BEGIN/MUTEX_END operations

Rick Macklem <rick.macklem@gmail.com> Tue, 20 August 2024 22:43 UTC

Return-Path: <rick.macklem@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A735AC18DB80 for <nfsv4@ietfa.amsl.com>; Tue, 20 Aug 2024 15:43:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.107
X-Spam-Level:
X-Spam-Status: No, score=-7.107 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ct-JVqLacw3v for <nfsv4@ietfa.amsl.com>; Tue, 20 Aug 2024 15:43:15 -0700 (PDT)
Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D15C8C1840F4 for <nfsv4@ietf.org>; Tue, 20 Aug 2024 15:43:15 -0700 (PDT)
Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-2020e83eca1so40774775ad.2 for <nfsv4@ietf.org>; Tue, 20 Aug 2024 15:43:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724193795; x=1724798595; darn=ietf.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=bUGmnGabx36LCiRXBv/5aOpPKC3Lc4oXiu5zseUG6A0=; b=gQEsE2PkaR+tUcJHhmlGs9udFV5unlyz40nCvwmvZinhBGGNYwyeD1ZX9/QKEvaq6F pk60A2SVEgY2/C2eBlWegjafs9bCYZAtsiLUGZ+mJIF8nMrO7oNPR4s9WbQLFvWJBo42 c2To/mZaHS5Q9ycsFMD1eEwo4+EU8gx2uGrzEHKkDuT9DpE6y5attWZ8WxD97vX72fM/ etcaYKGjzlLGZz1b5iVOXfqf3PA5tRMpMDkFn+u62nYZxErXwItKCVSlNZQKeI+J4jux 9B2nlV+vNiasdXFAU9Ubp6NeBftw2Fd+7PA7nW1bJ6DkgVE5SAfbloH/Lcu6JRxfei7y sylQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724193795; x=1724798595; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=bUGmnGabx36LCiRXBv/5aOpPKC3Lc4oXiu5zseUG6A0=; b=f+5djt6FJ1Zol7jestqZvGTBzjaunoBVxpW3Pwl2AnYcoD0r2xT81vqn+zHcUa7Pyu jmzwYmVlyh/IglFy7exaxJQ6yqR9BTF/699puH4N5f7SU1eQxhmqlUEeIfw/uT/lf9g0 +LxyxtbXHlPGDDx2ltJi51e1Bcvrgm1d9tkR6zuODmixsT7mJdLjH8bfiLMrMOR3vE7d McwHA2OzDs/MGXZ6+0nuGkKZ3jsmranwLgmcRwnkuXYGxI0UJ8O4+ysx5rFg9nWb4Q2H PaBh9CiwkUQP2zxXMP/yCT+AfIVpOAbuuuVKpRUwjcCK5boLglEr5GMxsE+EthdmGCSz vwfw==
X-Gm-Message-State: AOJu0Ywn7ViLXnTEuUM3O4UvnsGTJUXOVZ10o0hWfWSge3rOO/QMOPCE vNC8H90PX+xbXCb3Epbg0MjOL+/ukfijp9rEZ31aYk0i2/KAKynHuX/gqRu3raYHHQS7qYaMRfH w42tZpvFILysgsAguL4jhqkKLCk86
X-Google-Smtp-Source: AGHT+IHn/WZbbWUne7H38GvBD7uamT9ZM4kyh6efHTW2anNW4r0D6zuuWnm5iGDbz11JeyMkwXPxY2w5mhxffwG5xXs=
X-Received: by 2002:a17:902:c946:b0:201:efe7:cb03 with SMTP id d9443c01a7336-203681e4522mr4565765ad.48.1724193794745; Tue, 20 Aug 2024 15:43:14 -0700 (PDT)
MIME-Version: 1.0
References: <CAM5tNy7g+YCiiZQD7G6Ryv_Mo8N5BeRiqMP=224zPpEXa+Yi+A@mail.gmail.com> <a8b69cf0e7d33aed66ae125f40e377bccb4b6918.camel@poochiereds.net> <CAM5tNy4D1OB2f5B3h9M_56zWpC7+a3q_Y=c-kvNNnOG2c3QZxw@mail.gmail.com> <CAM5tNy6psWjCCHiFF2v2NNZxF9FiSOOsE4xyLeWi2DSCPfvqew@mail.gmail.com> <a16c54cf2e808b8f60f1f85c960794f13e0ec9d5.camel@poochiereds.net>
In-Reply-To: <a16c54cf2e808b8f60f1f85c960794f13e0ec9d5.camel@poochiereds.net>
From: Rick Macklem <rick.macklem@gmail.com>
Date: Tue, 20 Aug 2024 15:43:04 -0700
Message-ID: <CAM5tNy7UaVroWQwq5N3LH+3zg_1vwFCLzAi+LpXa9_Yme_jeVA@mail.gmail.com>
To: Jeff Layton <jlayton@poochiereds.net>
Content-Type: multipart/alternative; boundary="000000000000a8f04006202526c0"
Message-ID-Hash: BCCICSIDUMTFUWBTBNZVFEBTV2FVK5LD
X-Message-ID-Hash: BCCICSIDUMTFUWBTBNZVFEBTV2FVK5LD
X-MailFrom: rick.macklem@gmail.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-nfsv4.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: NFSv4 <nfsv4@ietf.org>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [nfsv4] Re: RFC: new MUTEX_BEGIN/MUTEX_END operations
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/wYa2om_SglIg1mm7ZyH0HH_560E>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Owner: <mailto:nfsv4-owner@ietf.org>
List-Post: <mailto:nfsv4@ietf.org>
List-Subscribe: <mailto:nfsv4-join@ietf.org>
List-Unsubscribe: <mailto:nfsv4-leave@ietf.org>

On Tue, Aug 20, 2024 at 2:57 PM Jeff Layton <jlayton@poochiereds.net> wrote:

> On Tue, 2024-08-20 at 13:35 -0700, Rick Macklem wrote:
> >
> >
> > On Tue, Aug 20, 2024 at 1:29 PM Rick Macklem <rick.macklem@gmail.com>
> wrote:
> > >
> > >
> > > On Tue, Aug 20, 2024 at 4:49 AM Jeff Layton <jlayton@poochiereds.net>
> wrote:
> > > > On Thu, 2024-08-08 at 13:36 -0700, Rick Macklem wrote:
> > > > > Hi,
> > > > >
> > > > > Over the years, I've run into cases where it would be really
> > > > > nice to be able to perform multiple NFSv4 operations on a
> > > > > file without other operations done by other clients
> > > > > "gumming up the works" by changing the file's data/metadata
> > > > > between the operations in the compound.
> > > > >
> > > > > So, what do others think about an extension to NFSv4.2 that
> > > > > adds 2 new operations:
> > > > >   MUTEX_BEGIN(CFH)
> > > > >   MUTEX_END(CFH)
> > > > > Both would use the CFH as argument, and no other client would
> > > > > be allowed to perform operations on the CFH between the MUTEX_BEGIN
> > > > > and MUTEX_END.
> > > > >
> > > > > I think there would need to be a couple of properties for these:
> > > > > - There would need to be an "implicit" MUTEX_END when any operation
> > > > >   between MUTEX_BEGIN and MUTEX_END returns a status other than
> NFS_OK.
> > > > > - I think you would want a restriction of only one mutex for one
> CFH
> > > > >   at a time in a compound. Without that, there could easily be
> deadlocks
> > > > >   caused by other compounds acquiring mutexes on the same CFHs in a
> > > > >   different order.
> > > > > - Only one compound can hold a mutex on a given CFH at any time.
> > > > > - MUTEX_BEGIN/MUTEX_END can only be used in compounds where
> SEQUENCE
> > > > >   is the first operation.
> > > > > - All mutexes are discarded by a server when it crashes/recovers.
> > > > >   (Any time a client receives a NFS4ERR_STALE_CLIENTID.)
> > > > >   That way RPC retries after a server reboot should work ok, I
> think?
> > > > >
> > > > > I am not sure what the semantics for reading data/metadata should
> be,
> > > > > but I was thinking that would be allowed to be done by compounds
> for
> > > > > other clients for the CFH. If a client wanted to serialize against
> > > > > other compounds for the CFH, it could do a MUTEX_BEGIN/MUTEX_END.
> > > > >
> > > > > I see this as useful in a variety of ways:
> > > > > - The example in the previous email of:
> > > > >   MUTEX_BEGIN
> > > > >   NVERIFY acl_truform ACL_MODEL_NFS4
> > > > >   SETATTR posix_access_acl
> > > > >   MUTEX_END
> > > > > - Append writing:
> > > > >   MUTEX_BEGIN
> > > > >   VERIFY size "offset in WRITE that follows"
> > > > >   WRITE "offset" etc
> > > > >   MUTEX_END
> > > > > - A bunch of cases where NFSv4 lacks the postop_attributes
> > > > >   that were in NFSv3.
> > > > >   MUTEX_BEGIN
> > > > >   WRITE
> > > > >   GETATTR size, change,..
> > > > >   MUTEX_END
> > > > >
> > > > > So, what do others think?
> > > > > (This was obviously not possible without sessions.)
> > > > >
> > > >
> > > > It's an interesting idea. Quite a bit different than LOCK/LOCKU for
> > > > sure.
> > > >
> > > > The difficulty here will be in defining what activity this mutex
> > > > blocks. Just other MUTEX_BEGINs, or is this a mandatory lock for all
> > > > access to that filehandle?
> > > >
> > >
> > > I think that only other MUTEX_BEGINs would not be particularly
> > > useful, since there will be clients doing RPCs without any MUTEX_BEGINs
> > > for a long time.
> > >
> > > My original thinking was all operations in other compounds that use
> > > the FH (either CFH or SAVEFH) would block until MUTEX_END.
> > > (I'd even like to include NFSv3 RPCs, but that might be a bridge too
> far?)
> > >
> > > > Assuming the latter, would READ or GETATTR
> > > > activity also be blocked? If not, then we'll have to have a list of
> > > > operations that are blocked by the mutex, and new operations will
> have
> > > > to have their semantics spelled out clearly vs. MUTEX_BEGIN.
> > > >
> > >
> > > I suppose only blocking operations that modify the file (where the
> change
> > > attribute changes) is possible. I do think this makes it more complex
> > > to implement (a lot more difficult to implement for FreeBSD, at least)
> > > and I am not sure which would be preferred?
> > >
> > > For example, suppose a compound does:
> > > MUTEX_BEGIN
> > > SETATTR (size of 0)
> > > WRITE
> > > GETATTR (size, change,...)
> > > MUTEX_END
> > > And then another compound does:
> > > READ
> > > GETATTR (size, change,...)
> > > --> If the READ is not blocked until MUTEX_END, it might see EOF
> > >     for the file size of 0 and then see a size of non-zero for
> > >     the GETATTR.
> > >     --> Could happen now (without MUTEX_BEGIN/MUTEX_END),
> > >         so not a big deal, but it *might* be
> > >         better to block the READ in this case?
> > >
> > > As far as implementing it, I do not know what other server
> > > setups are, but for FreeBSD, doing the "block all operations
> > > on the FH by other RPCs" is by far the easiest.
> > > - FreeBSD uses a shared/exclusive lock on vnodes (think inodes)
> > >   during each operation in a compound for CFH (and SAVEFH, if it
> > >   exists). All the MUTEX_BEGIN needs to do is acquire the exclusive
> > >   vnode lock for CFH and keep it until MUTEX_END (normally, it is
> > >   unlocked after the operation).
> > >   --> Using a shared lock on the vnode would allow READ/GETATTR etc
> > >       to be done by other compounds when between MUTEX_BEGIN/MUTEX_END
> > >       but, at least for FreeBSD, a safe upgrade to an exclusive lock
> > >       (required by operations like WRITE) cannot be done.
> > >       --> So, implementing the "only block operations that change
> change"
> > >           would require some other (rather odd) locking mechanism
> layered
> > >           on top of the vnode locking. Doable, but not fun.
> > >
> > > Then there is the issue of avoiding deadlock...
> > > I think there are operations, like RENAME, which cannot be put between
> > > MUTEX_BEGIN/MUTEX_END without risk of deadlock (if the RENAME is using
> > > two different directories).
> > > I'll admit I have not worked through too many of these yet, but it
> looks
> > > like the simplest way to avoid deadlock problems is to not allow
> > > MUTEX_BEGIN to be done on directory FHs.
> > > Does this sound like too severe a restriction?
> > >
> >
> >
> > Oh, and I think operations like PUTFH, which changes the CFH need to be
> > avoided between MUTEX_BEGIN/MUTEX_END to avoid deadlock, as well.
> > If a compound does:
> > MUTEX_BEGIN (FH for foo)
> > PUTFH (FH for bar)
> > GETATTR
> > MUTEX_END
> > and another compound does:
> > MUTEX_BEGIN (FH for bar)
> > PUTFH (FH for foo)
> > GETATTR
> > MUTEX_END
> > --> They could deadlock.
> > Does not allowing PUTFH/PUTROOTFH/PUTPUBFH/RESTOREFH between MUTEX_BEGIN
> > and MUTEX_END sound like too severe a restriction?
>
> I don't think so, but there are other operations that can change the
> current_fh too -- OPEN, for instance. If I do a MUTEX_BEGIN on a
> directory and then do an OPEN (by name) in it, what happens to the
> mutex when the current_fh changes?
>
Yes, I think that any operation that changes CFH or uses both
CFH and SAVED_FH as arguments introduces the risk of deadlock.


> As far as deadlocks go, the simplest way to avoid them would be to just
> deny any nesting whatsoever, and only allow locking a single mutex in
> the compound.
>
If operations as above are not allowed between MUTEX_BEGIN/MUTEX_END
plus no nesting (I think I have mentioned that before), then I believe
deadlocks are avoied.


> We'd probably have to implement this as a per-fh thing in the Linux
> server too. One thing thing I've been wanting to build is a way to do a
> gated write using the change attr. This could make that possible:
>
> COMPOUND1:
> GETATTR (get the change attr)
> READ
>
> (modify the data in memory)
>
> COMPOUND2:
> MUTEX_BEGIN
> VERIFY
> WRITE
> MUTEX_END
>
> ...if the write fails, do the whole thing again.
>
> Synchronized, multi-host I/O without using file locks!
>
Yep, and you can do an append write using the same COMPOUND2,
but checking size == write-offset in the VERIFY.

Another one I am interested in is...
I really like Thomas's Layout_WCC, but I do like the fact
that it only works for NFSv3 DSs. This would allow a NFSv4
DS (assuming the DS supports MUTEX_BEGIN/MUTEX_END) to do
the same thing, I think?

rick

-- 
> Jeff Layton <jlayton@poochiereds.net>
>