From nobody Tue Apr 11 17:15:31 2023
Return-Path: <trondmy@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1])
 by ietfa.amsl.com (Postfix) with ESMTP id 46AD6C1CAB50
 for <nfsv4@ietfa.amsl.com>; Tue, 11 Apr 2023 17:15:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.094
X-Spam-Level: 
X-Spam-Status: No, score=-2.094 tagged_above=-999 required=5
 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
 DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001,
 RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001,
 URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key)
 header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194])
 by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id e2crTrhz4uKN for <nfsv4@ietfa.amsl.com>;
 Tue, 11 Apr 2023 17:15:29 -0700 (PDT)
Received: from mail-yb1-xb30.google.com (mail-yb1-xb30.google.com
 [IPv6:2607:f8b0:4864:20::b30])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by ietfa.amsl.com (Postfix) with ESMTPS id D2932C19E117
 for <nfsv4@ietf.org>; Tue, 11 Apr 2023 17:15:29 -0700 (PDT)
Received: by mail-yb1-xb30.google.com with SMTP id y186so10122591yby.13
 for <nfsv4@ietf.org>; Tue, 11 Apr 2023 17:15:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20221208; t=1681258527;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:from:to:cc:subject:date:message-id:reply-to;
 bh=lCPcPyEkGsLRkz26nX+KYxMWFySdgkG6uhAKRjoGpHw=;
 b=eM9bNLNslOC4qr0lPJIVrSqK+HZuKHixBHTUAt+p1nmvm8bdwg0GKMNsM9YrrhXAAh
 G0RHfygag+wS8uH8gcB44WkQ8G2I4HXU7rLiQc0qV68L+wBPtXhrHQLx5cGn5Ff6Vaj7
 3Bg/sm1axmoQQigOQgZCJu+Boxj+iG2IyXOQcVnMAGEjbUnkDobNgb7CIvwLulWfASfS
 N06YaBGgoU+6UO6FbfLUYfog5EbZLjVbj1wiRNAu/sHxDsKNPurQpsAxnnALLQSPvSYb
 4MgOGYkpEcmao/Uy8v44a7t5NKr3MYQCjprLQu1P1yF49DjjbTbbxIVnTXwXyyXyAug5
 3iDg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112; t=1681258527;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=lCPcPyEkGsLRkz26nX+KYxMWFySdgkG6uhAKRjoGpHw=;
 b=akjAa7y+Q71d0aIvoM88gN5GAgNZmsHot2V+2muL7dIc4sUBa7bECW1J+yYXFgEQLq
 FyIV2KJO3mTwMdt9Ne1q0STRA6gNRypDVQAkCok1is3aaLU+adMvwrZUiviQKTKVyT8i
 OnxuW6jDRI/Yeo9MqZPOuv0dFxqxySxE5j014ajYQ3U7UNb3mI/fppvtNTI4mB4CukH5
 Oe09wb7hH41zYJpzBD9h1aRn+PsmVd5LBGAEsD2avXVXx+yret5OKQl9CHYCgBRvUvEl
 +65D697ZMu08SojbDW9N76zXOuO2i23Y4LZQoWlkJjux8bUybA+tZDmOUH/QAT77t6VO
 FF4g==
X-Gm-Message-State: AAQBX9fbYrA4zE9BBvRuNOdf1LWzuDXNzocsqN/PKEnzTWGnwefmZrk3
 U41/4o/RqYlJD1ITrDkKoWQTve2PdX8IPqw/WkMdYIJAlQ==
X-Google-Smtp-Source: AKy350Zw7XhJE5g9ZCkvvJSpFlbDOD7s6z7P0DRHJ6Q8Nq6Gcl6okHJzNciL98ZhZAVCEtrB68p3Dxbxv/pZ5mHKEec=
X-Received: by 2002:a25:7b03:0:b0:b8b:fca4:7454 with SMTP id
 w3-20020a257b03000000b00b8bfca47454mr758657ybc.4.1681258526555; Tue, 11 Apr
 2023 17:15:26 -0700 (PDT)
MIME-Version: 1.0
References: <168022420664.45787.1714641403814314728@ietfa.amsl.com>
 <CAM5tNy6udE6gqKHTfcTfzmf+CTqWiA9kuz__fnaqCyPPUF1+ZQ@mail.gmail.com>
 <CAABAsM6SNu1getd5GgGqp96qZtL4HMqt1mKTwLY=LaDOqdefdQ@mail.gmail.com>
 <CAM5tNy4UVT1R8jeECg+N_=tVn14B7QSNCT2WgZNzA2OG17yY-w@mail.gmail.com>
In-Reply-To: <CAM5tNy4UVT1R8jeECg+N_=tVn14B7QSNCT2WgZNzA2OG17yY-w@mail.gmail.com>
From: Trond Myklebust <trondmy@gmail.com>
Date: Tue, 11 Apr 2023 20:15:15 -0400
Message-ID: <CAABAsM6ZSiXW=0igHyOfBg+QOG3a_6AAbg8zfxMqHJ3_772gxA@mail.gmail.com>
To: Rick Macklem <rick.macklem@gmail.com>
Cc: nfsv4@ietf.org
Content-Type: multipart/alternative; boundary="0000000000004044ed05f9188170"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/M3PHvlACf2tk3sXVP1USC0rNPd0>
Subject: Re: [nfsv4] I-D Action: draft-ietf-nfsv4-layoutwcc-01.txt
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>,
 <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>,
 <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Apr 2023 00:15:31 -0000

--0000000000004044ed05f9188170
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

As I've said many times before, LAYOUTCOMMIT is not only unnecessary, but
is a terrible idea. Any O_SYNC workload is literally sunk when you require
LAYOUTCOMMIT in order to fully commit the metadata, because each write()
syscall needs to go both to the data server and the metadata server.

No proposal that attempts to expand the use of LAYOUTCOMMIT is going to be
welcome on the Linux client.

On Tue, Apr 11, 2023 at 4:14=E2=80=AFPM Rick Macklem <rick.macklem@gmail.co=
m> wrote:

> On Tue, Apr 11, 2023 at 6:40=E2=80=AFAM Trond Myklebust <trondmy@gmail.co=
m> wrote:
> >
> > LAYOUTCOMMIT sort of makes sense for block based layouts, since you do
> need to persist updates of the blocks with the metadata server. It makes
> zero sense for files and flex files layout types, since you already have
> persistence in the form of the regular COMMIT operation. While you could
> argue that LAYOUTCOMMIT can be seen as just providing hints in those case=
s,
> there is nothing in the protocol spec that allows you to conclude that. I=
t
> is not worth introducing a new operation that needs to be called every ti=
me
> that the client wants to ensure persistence, just in order to resolve a
> server reboot issue.
> >
> First, note the following sentence:
>     The LAYOUTCOMMIT operation indicates that the client has completed
>     writes using a layout obtained by a previous LAYOUTGET.
> In particular, note that it says nothing like "when data needs to be
> committed to stable storage".
> Also, I cannot find anywhere where the RFC discusses the relationship
> between LAYOUTCOMMIT
> and COMMIT.  Therefore, I do not believe any relationship should be
> assumed.
>
> Now, here's the back story:
> When I implemented the pNFS server for FreeBSD, I assumed that the
> client would do a
> LayoutCommit after a series of writes and before it expected the MDS
> to have up-to-date
> attributes for the file that was written to the DS (using a RW layout).
> --> At the time of LayoutCommit, the MDS did a Getattr on the DS(s)
> for the layout to
>      update the MDS's copy of the attributes that change when writing
> occurs.
>
> This worked fine for the FreeBSD client, which does a LayoutCommit
> whenever a
> process did a POSIX fsync or close syscall.
>
> Then I started testing using a Linux client and found out that it got
> confused by
> stale attributes acquired from the MDS after doing write(s) to the DS(s).
> Looking at the packet trace I found that, for the case where the Linux
> client
> did writes to the DS(s) with File_Sync set, the Linux client did not do a
> LayoutCommit. (It, of course, did not do a Commit, but that was what I
> expected.)
> --> I looked in the RFC for a statement that would make it clear that the
> Linux
>       client was broken and should do a LayoutCommit after these write(s)
> but...
>       I could not find one.
>        So, to work around the problem I added a tunable called
>        vfs.nfsd.pnfsgetdsattr which, when enabled, makes the pNFS server
> do a
>        Getattr against the DS(s) whenever it receives a Getattr from a
> client and
>        a RW layout has been issued for the file.
>        --> This resulted in a lot of MDS->DS Getattrs, but made the Linux
> client
>              function well against the FreeBSD pNFS server.
>
> It looks to me like the proposed Layout_WCC operation is mostly just
> trying to solve
> the problem that can be solved by requiring a pNFS client to do a
> LayoutCommit
> after a series of writes are done to DS(s) based on a RW layout and befor=
e
> it
> expects the MDS to have up-to-date attributes for the file that was
> written.
> (For my client, that seems to be whenever a POSIX close or fsync is done =
on
>   the file.)
>
> Since LayoutCommits do not happen that frequently, the overhead of doing
> a Getattr against a DS for each of them is not particularily large.
>
> Then there is the separate issue of how a client reports errors encounter=
ed
> when doing write(s) to a DS if the MDS reboots, making the layout stateid
> stale.
> I'd rather see this handled via a non-empty loca_layoutupdate, since
> that handles
> the problem for both NFSv4.1 and NFSv4.2 via a layout specific driver.
> (NFSv4.1 is deficient, since it does not have LayoutError and this could
> work
>  around that.) A new NFSv4.2 operation only fixes the issue for NFSv4.2 a=
nd
> only after it is adopted and only if it is clearly specified when the
> client must
> do the operation.
>
> It seems to me that a non-empty loca_layoutupdate is only needed for the
> rare second case, although it might be useful for the maintenance of
> up-to-date attributes on the MDS.
>
> Finally, I will note that it is not obvious how the client can provide a
> Change
> attribute to the MDS after doing Writes to a NFSv3 DS.  I think that the
> client can provide a boolean that indicates if a write to a DS for the fi=
le
> has been done since the last LayoutCommit (or Layout_WCC if you
> choose to go that way).
>
> > Secondly, as you pointed out, the repurposing of LAYOUTCOMMIT for
> propagating layout stats and layout errors requires a protocol change in
> any case. While that protocol change would be limited to the flex files
> protocol, it requires a whole new flex files v2 spec, since there is no
> procedure for amending the layout protocols in the way that exists for th=
e
> NFSv4.2 protocol itself.
>
> First, I do not consider this "repurposing". I think you have assumed
> a relationship
> between LayoutCommit and Commit that does not exist and, although the RFC
> seems
> vague w.r.t. when LayoutCommits are to be performed by a pNFS client,
> I do think that
> they were intended for this use. Among other things, they are allowed
> to be done during
> Grace using a stale layout stateid.
>
> Second, what's the difference between an amended RFC for Flexible File
> Layout and
> an RFC for a new NFSv4.2 operation? All that changes is the
> loca_layoutupdate becomes
> optionally non-empty. (If everyone is really convinced that is not
> permitted, then define
> Flexible File Layout 2 with that minimal change, to fix the "how to
> report DS errors when
> the MDS reboots.)
>
> >
> > Lastly, while LAYOUTCOMMIT could potentially carry a layout stats and
> layout error payload in the same way that LAYOUTRETURN can, that payload
> size is limited by the negotiated 'ca_maxrequestsize' session value. This
> is in part why we have dedicated LAYOUTSTATS and LAYOUTERROR operations i=
n
> the first place: so that the client can report the full information in
> multiple RPC calls when necessary, without being limited by the payload
> limitations of a single RPC call.
>
> The arguments that come before loca_layoutupdate for LayoutCommit
> are (at a glance, so I might be off a little) 68bytes vs 20bytes for
> Layout_WCC.
> I doubt 48bytes is going to make a difference. I also believe that
> up-to-date
> attributes and errors received from write(s) to DS(s) while the MDS reboo=
ts
> are what matters. I do not see stats as important for the case where the
> MDS
> has rebooted and the client cannot do a LayoutStats with the old layout
> stateid.
> (After all, LayoutStats does not even exist in NFSv4.1.)
>
> rick
>
> >
> > On Mon, Apr 10, 2023 at 7:11=E2=80=AFPM Rick Macklem <rick.macklem@gmai=
l.com>
> wrote:
> >>
> >> If Tom commented w.r.t. the use of LAYOUTCOMMIT for this,
> >> I missed it.
> >>
> >> Here are some snippets from RFC8881 Sec. 18.42 plus comments:
> >>
> >>    The LAYOUTCOMMIT operation indicates that the client has completed
> >>    writes using a layout obtained by a previous LAYOUTGET.  The client
> >>    may have only written a subset of the data range it previously
> >>    requested.  LAYOUTCOMMIT allows it to commit or discard provisional=
ly
> >>    allocated space and to update the server with a new end-of-file.
> >> Note that the arguments loca_last_write_offset, loca_time_modify and
> >> loca_layoutupdate can be used by the client to inform the MDS about
> >> file changes due to writing to the DS.
> >>
> >>    If the loca_reclaim field is set to TRUE, this indicates that the
> >>    client is attempting to commit changes to a layout after the restar=
t
> >>    of the metadata server during the metadata server's recovery grace
> >>    period (see Section 12.7.4).  This type of request may be necessary
> >>    when the client has uncommitted writes to provisionally allocated
> >>    byte-ranges of a file that were sent to the storage devices before
> >>    the restart of the metadata server.
> >> This explains how LayoutCommit can be used after an MDS reboot
> >> to update the MDS w.r.t. writes done to the DS.
> >>
> >>    The loca_last_write_offset field specifies the offset of the last
> >>    byte written by the client previous to the LAYOUTCOMMIT.  Note that
> >>    this value is never equal to the file's size (at most it is one byt=
e
> >>    less than the file's size) and MUST be less than or equal to
> >>    NFS4_MAXFILEOFF.  Also, loca_last_write_offset MUST overlap the ran=
ge
> >>    described by loca_offset and loca_length.  The metadata server may
> >>    use this information to determine whether the file's size needs to =
be
> >>    updated.  If the metadata server updates the file's size as the
> >>    result of the LAYOUTCOMMIT operation, it must return the new size
> >>    (locr_newsize.ns_size) as part of the results.
> >>
> >>    The loca_time_modify field allows the client to suggest a
> >>    modification time it would like the metadata server to set.  The
> >>    metadata server may use the suggestion or it may use the time of th=
e
> >>    LAYOUTCOMMIT operation to set the modification time.  If the metada=
ta
> >>    server uses the client-provided modification time, it should ensure
> >>    that time does not flow backwards.  If the client wants to force th=
e
> >>    metadata server to set an exact time, the client should use a SETAT=
TR
> >>    operation in a COMPOUND right after LAYOUTCOMMIT.  See Section 12.5=
.4
> >>    for more details.  If the client desires the resultant modification
> >>    time, it should construct the COMPOUND so that a GETATTR follows th=
e
> >>    LAYOUTCOMMIT.
> >> Here are details on how loca_last_write_offset and loca_time_modify
> >> are used to update the MDS.
> >>
> >> Although the Flex Files layout does not define any loca_layout_update,
> >> it seems that this could be added more easily than new NFSv4.2
> operations
> >> if additional client->MDS information is deemed necessary.
> >>
> >> Now, I do agree that:
> >>    The LAYOUTCOMMIT operation commits changes in the layout represente=
d
> >>    by the current filehandle, client ID (derived from the session ID i=
n
> >>    the preceding SEQUENCE operation), byte-range, and stateid.
> >> "commits changes in the layout" is kinda weird and does not clarify wh=
en
> >> LayoutCommit should be done by a client. I think this should be
> clarified,
> >> at least for Flex File Layout, as whenever the client has completed a
> series
> >> of writes to DS(s) using the layout and expects the MDS to return
> >> up-to-date Getattr information for the file to clients.
> >>
> >> With the above clarification (or whatever works for you), I do not see
> why
> >> new operations are needed?
> >>
> >> rick
> >>
> >> On Thu, Mar 30, 2023 at 5:57=E2=80=AFPM <internet-drafts@ietf.org> wro=
te:
> >> >
> >> >
> >> > A New Internet-Draft is available from the on-line Internet-Drafts
> >> > directories. This Internet-Draft is a work item of the Network File
> System
> >> > Version 4 (NFSV4) WG of the IETF.
> >> >
> >> >    Title           : Add LAYOUT_WCC to NFSv4.2's Flex File Layout Ty=
pe
> >> >    Authors         : Thomas Haynes
> >> >                      Trond Myklebust
> >> >    Filename        : draft-ietf-nfsv4-layoutwcc-01.txt
> >> >    Pages           : 11
> >> >    Date            : 2023-03-30
> >> >
> >> > Abstract:
> >> >    The Parallel Network File System (pNFS) Flexible File Layout allo=
ws
> >> >    for a file's metadata (MDS) and data (DS) to be on different
> servers.
> >> >    It does not provide a mechanism for the data server to update the
> >> >    metadata server of changes to the data part of the file.  The
> client
> >> >    has knowledge of such updates, but lacks the ability to update th=
e
> >> >    metadata server.  This document presents a refinement to RFC8435 =
to
> >> >    allow the client to update the metadata server to changes on the
> data
> >> >    server.
> >> >
> >> > The IETF datatracker status page for this Internet-Draft is:
> >> > https://datatracker.ietf.org/doc/draft-ietf-nfsv4-layoutwcc/
> >> >
> >> > There is also an HTML version available at:
> >> > https://www.ietf.org/archive/id/draft-ietf-nfsv4-layoutwcc-01.html
> >> >
> >> > A diff from the previous version is available at:
> >> >
> https://author-tools.ietf.org/iddiff?url2=3Ddraft-ietf-nfsv4-layoutwcc-01
> >> >
> >> > Internet-Drafts are also available by rsync at rsync.ietf.org:
> :internet-drafts
> >> >
> >> >
> >> > _______________________________________________
> >> > nfsv4 mailing list
> >> > nfsv4@ietf.org
> >> > https://www.ietf.org/mailman/listinfo/nfsv4
> >>
> >> _______________________________________________
> >> nfsv4 mailing list
> >> nfsv4@ietf.org
> >> https://www.ietf.org/mailman/listinfo/nfsv4
>

--0000000000004044ed05f9188170
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>As I&#39;ve said many times before, LAYOUTCOMMIT is n=
ot only unnecessary, but is a terrible idea. Any O_SYNC workload is literal=
ly sunk when you require LAYOUTCOMMIT in order to fully commit the metadata=
, because each write() syscall needs to go both to the data server and the =
metadata server.</div><div><br></div><div>No proposal that attempts to expa=
nd the use of LAYOUTCOMMIT is going to be welcome on the Linux client.<br><=
/div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_a=
ttr">On Tue, Apr 11, 2023 at 4:14=E2=80=AFPM Rick Macklem &lt;<a href=3D"ma=
ilto:rick.macklem@gmail.com">rick.macklem@gmail.com</a>&gt; wrote:<br></div=
><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border=
-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, Apr 11, 2023 at =
6:40=E2=80=AFAM Trond Myklebust &lt;<a href=3D"mailto:trondmy@gmail.com" ta=
rget=3D"_blank">trondmy@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; LAYOUTCOMMIT sort of makes sense for block based layouts, since you do=
 need to persist updates of the blocks with the metadata server. It makes z=
ero sense for files and flex files layout types, since you already have per=
sistence in the form of the regular COMMIT operation. While you could argue=
 that LAYOUTCOMMIT can be seen as just providing hints in those cases, ther=
e is nothing in the protocol spec that allows you to conclude that. It is n=
ot worth introducing a new operation that needs to be called every time tha=
t the client wants to ensure persistence, just in order to resolve a server=
 reboot issue.<br>
&gt;<br>
First, note the following sentence:<br>
=C2=A0 =C2=A0 The LAYOUTCOMMIT operation indicates that the client has comp=
leted<br>
=C2=A0 =C2=A0 writes using a layout obtained by a previous LAYOUTGET.<br>
In particular, note that it says nothing like &quot;when data needs to be<b=
r>
committed to stable storage&quot;.<br>
Also, I cannot find anywhere where the RFC discusses the relationship<br>
between LAYOUTCOMMIT<br>
and COMMIT.=C2=A0 Therefore, I do not believe any relationship should be as=
sumed.<br>
<br>
Now, here&#39;s the back story:<br>
When I implemented the pNFS server for FreeBSD, I assumed that the<br>
client would do a<br>
LayoutCommit after a series of writes and before it expected the MDS<br>
to have up-to-date<br>
attributes for the file that was written to the DS (using a RW layout).<br>
--&gt; At the time of LayoutCommit, the MDS did a Getattr on the DS(s)<br>
for the layout to<br>
=C2=A0 =C2=A0 =C2=A0update the MDS&#39;s copy of the attributes that change=
 when writing occurs.<br>
<br>
This worked fine for the FreeBSD client, which does a LayoutCommit whenever=
 a<br>
process did a POSIX fsync or close syscall.<br>
<br>
Then I started testing using a Linux client and found out that it got<br>
confused by<br>
stale attributes acquired from the MDS after doing write(s) to the DS(s).<b=
r>
Looking at the packet trace I found that, for the case where the Linux clie=
nt<br>
did writes to the DS(s) with File_Sync set, the Linux client did not do a<b=
r>
LayoutCommit. (It, of course, did not do a Commit, but that was what I<br>
expected.)<br>
--&gt; I looked in the RFC for a statement that would make it clear that th=
e Linux<br>
=C2=A0 =C2=A0 =C2=A0 client was broken and should do a LayoutCommit after t=
hese write(s) but...<br>
=C2=A0 =C2=A0 =C2=A0 I could not find one.<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0So, to work around the problem I added a tunable=
 called<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0vfs.nfsd.pnfsgetdsattr which, when enabled, make=
s the pNFS server do a<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0Getattr against the DS(s) whenever it receives a=
 Getattr from a<br>
client and<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0a RW layout has been issued for the file.<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0--&gt; This resulted in a lot of MDS-&gt;DS Geta=
ttrs, but made the Linux client<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0function well against the F=
reeBSD pNFS server.<br>
<br>
It looks to me like the proposed Layout_WCC operation is mostly just<br>
trying to solve<br>
the problem that can be solved by requiring a pNFS client to do a LayoutCom=
mit<br>
after a series of writes are done to DS(s) based on a RW layout and before =
it<br>
expects the MDS to have up-to-date attributes for the file that was written=
.<br>
(For my client, that seems to be whenever a POSIX close or fsync is done on=
<br>
=C2=A0 the file.)<br>
<br>
Since LayoutCommits do not happen that frequently, the overhead of doing<br=
>
a Getattr against a DS for each of them is not particularily large.<br>
<br>
Then there is the separate issue of how a client reports errors encountered=
<br>
when doing write(s) to a DS if the MDS reboots, making the layout stateid s=
tale.<br>
I&#39;d rather see this handled via a non-empty loca_layoutupdate, since<br=
>
that handles<br>
the problem for both NFSv4.1 and NFSv4.2 via a layout specific driver.<br>
(NFSv4.1 is deficient, since it does not have LayoutError and this could wo=
rk<br>
=C2=A0around that.) A new NFSv4.2 operation only fixes the issue for NFSv4.=
2 and<br>
only after it is adopted and only if it is clearly specified when the<br>
client must<br>
do the operation.<br>
<br>
It seems to me that a non-empty loca_layoutupdate is only needed for the<br=
>
rare second case, although it might be useful for the maintenance of<br>
up-to-date attributes on the MDS.<br>
<br>
Finally, I will note that it is not obvious how the client can provide a Ch=
ange<br>
attribute to the MDS after doing Writes to a NFSv3 DS.=C2=A0 I think that t=
he<br>
client can provide a boolean that indicates if a write to a DS for the file=
<br>
has been done since the last LayoutCommit (or Layout_WCC if you<br>
choose to go that way).<br>
<br>
&gt; Secondly, as you pointed out, the repurposing of LAYOUTCOMMIT for prop=
agating layout stats and layout errors requires a protocol change in any ca=
se. While that protocol change would be limited to the flex files protocol,=
 it requires a whole new flex files v2 spec, since there is no procedure fo=
r amending the layout protocols in the way that exists for the NFSv4.2 prot=
ocol itself.<br>
<br>
First, I do not consider this &quot;repurposing&quot;. I think you have ass=
umed<br>
a relationship<br>
between LayoutCommit and Commit that does not exist and, although the RFC s=
eems<br>
vague w.r.t. when LayoutCommits are to be performed by a pNFS client,<br>
I do think that<br>
they were intended for this use. Among other things, they are allowed<br>
to be done during<br>
Grace using a stale layout stateid.<br>
<br>
Second, what&#39;s the difference between an amended RFC for Flexible File<=
br>
Layout and<br>
an RFC for a new NFSv4.2 operation? All that changes is the<br>
loca_layoutupdate becomes<br>
optionally non-empty. (If everyone is really convinced that is not<br>
permitted, then define<br>
Flexible File Layout 2 with that minimal change, to fix the &quot;how to<br=
>
report DS errors when<br>
the MDS reboots.)<br>
<br>
&gt;<br>
&gt; Lastly, while LAYOUTCOMMIT could potentially carry a layout stats and =
layout error payload in the same way that LAYOUTRETURN can, that payload si=
ze is limited by the negotiated &#39;ca_maxrequestsize&#39; session value. =
This is in part why we have dedicated LAYOUTSTATS and LAYOUTERROR operation=
s in the first place: so that the client can report the full information in=
 multiple RPC calls when necessary, without being limited by the payload li=
mitations of a single RPC call.<br>
<br>
The arguments that come before loca_layoutupdate for LayoutCommit<br>
are (at a glance, so I might be off a little) 68bytes vs 20bytes for Layout=
_WCC.<br>
I doubt 48bytes is going to make a difference. I also believe that up-to-da=
te<br>
attributes and errors received from write(s) to DS(s) while the MDS reboots=
<br>
are what matters. I do not see stats as important for the case where the MD=
S<br>
has rebooted and the client cannot do a LayoutStats with the old layout sta=
teid.<br>
(After all, LayoutStats does not even exist in NFSv4.1.)<br>
<br>
rick<br>
<br>
&gt;<br>
&gt; On Mon, Apr 10, 2023 at 7:11=E2=80=AFPM Rick Macklem &lt;<a href=3D"ma=
ilto:rick.macklem@gmail.com" target=3D"_blank">rick.macklem@gmail.com</a>&g=
t; wrote:<br>
&gt;&gt;<br>
&gt;&gt; If Tom commented w.r.t. the use of LAYOUTCOMMIT for this,<br>
&gt;&gt; I missed it.<br>
&gt;&gt;<br>
&gt;&gt; Here are some snippets from RFC8881 Sec. 18.42 plus comments:<br>
&gt;&gt;<br>
&gt;&gt;=C2=A0 =C2=A0 The LAYOUTCOMMIT operation indicates that the client =
has completed<br>
&gt;&gt;=C2=A0 =C2=A0 writes using a layout obtained by a previous LAYOUTGE=
T.=C2=A0 The client<br>
&gt;&gt;=C2=A0 =C2=A0 may have only written a subset of the data range it p=
reviously<br>
&gt;&gt;=C2=A0 =C2=A0 requested.=C2=A0 LAYOUTCOMMIT allows it to commit or =
discard provisionally<br>
&gt;&gt;=C2=A0 =C2=A0 allocated space and to update the server with a new e=
nd-of-file.<br>
&gt;&gt; Note that the arguments loca_last_write_offset, loca_time_modify a=
nd<br>
&gt;&gt; loca_layoutupdate can be used by the client to inform the MDS abou=
t<br>
&gt;&gt; file changes due to writing to the DS.<br>
&gt;&gt;<br>
&gt;&gt;=C2=A0 =C2=A0 If the loca_reclaim field is set to TRUE, this indica=
tes that the<br>
&gt;&gt;=C2=A0 =C2=A0 client is attempting to commit changes to a layout af=
ter the restart<br>
&gt;&gt;=C2=A0 =C2=A0 of the metadata server during the metadata server&#39=
;s recovery grace<br>
&gt;&gt;=C2=A0 =C2=A0 period (see Section 12.7.4).=C2=A0 This type of reque=
st may be necessary<br>
&gt;&gt;=C2=A0 =C2=A0 when the client has uncommitted writes to provisional=
ly allocated<br>
&gt;&gt;=C2=A0 =C2=A0 byte-ranges of a file that were sent to the storage d=
evices before<br>
&gt;&gt;=C2=A0 =C2=A0 the restart of the metadata server.<br>
&gt;&gt; This explains how LayoutCommit can be used after an MDS reboot<br>
&gt;&gt; to update the MDS w.r.t. writes done to the DS.<br>
&gt;&gt;<br>
&gt;&gt;=C2=A0 =C2=A0 The loca_last_write_offset field specifies the offset=
 of the last<br>
&gt;&gt;=C2=A0 =C2=A0 byte written by the client previous to the LAYOUTCOMM=
IT.=C2=A0 Note that<br>
&gt;&gt;=C2=A0 =C2=A0 this value is never equal to the file&#39;s size (at =
most it is one byte<br>
&gt;&gt;=C2=A0 =C2=A0 less than the file&#39;s size) and MUST be less than =
or equal to<br>
&gt;&gt;=C2=A0 =C2=A0 NFS4_MAXFILEOFF.=C2=A0 Also, loca_last_write_offset M=
UST overlap the range<br>
&gt;&gt;=C2=A0 =C2=A0 described by loca_offset and loca_length.=C2=A0 The m=
etadata server may<br>
&gt;&gt;=C2=A0 =C2=A0 use this information to determine whether the file&#3=
9;s size needs to be<br>
&gt;&gt;=C2=A0 =C2=A0 updated.=C2=A0 If the metadata server updates the fil=
e&#39;s size as the<br>
&gt;&gt;=C2=A0 =C2=A0 result of the LAYOUTCOMMIT operation, it must return =
the new size<br>
&gt;&gt;=C2=A0 =C2=A0 (locr_newsize.ns_size) as part of the results.<br>
&gt;&gt;<br>
&gt;&gt;=C2=A0 =C2=A0 The loca_time_modify field allows the client to sugge=
st a<br>
&gt;&gt;=C2=A0 =C2=A0 modification time it would like the metadata server t=
o set.=C2=A0 The<br>
&gt;&gt;=C2=A0 =C2=A0 metadata server may use the suggestion or it may use =
the time of the<br>
&gt;&gt;=C2=A0 =C2=A0 LAYOUTCOMMIT operation to set the modification time.=
=C2=A0 If the metadata<br>
&gt;&gt;=C2=A0 =C2=A0 server uses the client-provided modification time, it=
 should ensure<br>
&gt;&gt;=C2=A0 =C2=A0 that time does not flow backwards.=C2=A0 If the clien=
t wants to force the<br>
&gt;&gt;=C2=A0 =C2=A0 metadata server to set an exact time, the client shou=
ld use a SETATTR<br>
&gt;&gt;=C2=A0 =C2=A0 operation in a COMPOUND right after LAYOUTCOMMIT.=C2=
=A0 See Section 12.5.4<br>
&gt;&gt;=C2=A0 =C2=A0 for more details.=C2=A0 If the client desires the res=
ultant modification<br>
&gt;&gt;=C2=A0 =C2=A0 time, it should construct the COMPOUND so that a GETA=
TTR follows the<br>
&gt;&gt;=C2=A0 =C2=A0 LAYOUTCOMMIT.<br>
&gt;&gt; Here are details on how loca_last_write_offset and loca_time_modif=
y<br>
&gt;&gt; are used to update the MDS.<br>
&gt;&gt;<br>
&gt;&gt; Although the Flex Files layout does not define any loca_layout_upd=
ate,<br>
&gt;&gt; it seems that this could be added more easily than new NFSv4.2 ope=
rations<br>
&gt;&gt; if additional client-&gt;MDS information is deemed necessary.<br>
&gt;&gt;<br>
&gt;&gt; Now, I do agree that:<br>
&gt;&gt;=C2=A0 =C2=A0 The LAYOUTCOMMIT operation commits changes in the lay=
out represented<br>
&gt;&gt;=C2=A0 =C2=A0 by the current filehandle, client ID (derived from th=
e session ID in<br>
&gt;&gt;=C2=A0 =C2=A0 the preceding SEQUENCE operation), byte-range, and st=
ateid.<br>
&gt;&gt; &quot;commits changes in the layout&quot; is kinda weird and does =
not clarify when<br>
&gt;&gt; LayoutCommit should be done by a client. I think this should be cl=
arified,<br>
&gt;&gt; at least for Flex File Layout, as whenever the client has complete=
d a series<br>
&gt;&gt; of writes to DS(s) using the layout and expects the MDS to return<=
br>
&gt;&gt; up-to-date Getattr information for the file to clients.<br>
&gt;&gt;<br>
&gt;&gt; With the above clarification (or whatever works for you), I do not=
 see why<br>
&gt;&gt; new operations are needed?<br>
&gt;&gt;<br>
&gt;&gt; rick<br>
&gt;&gt;<br>
&gt;&gt; On Thu, Mar 30, 2023 at 5:57=E2=80=AFPM &lt;<a href=3D"mailto:inte=
rnet-drafts@ietf.org" target=3D"_blank">internet-drafts@ietf.org</a>&gt; wr=
ote:<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; A New Internet-Draft is available from the on-line Internet-D=
rafts<br>
&gt;&gt; &gt; directories. This Internet-Draft is a work item of the Networ=
k File System<br>
&gt;&gt; &gt; Version 4 (NFSV4) WG of the IETF.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 Title=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0: =
Add LAYOUT_WCC to NFSv4.2&#39;s Flex File Layout Type<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 Authors=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0: Thoma=
s Haynes<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 Trond Myklebust<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 Filename=C2=A0 =C2=A0 =C2=A0 =C2=A0 : draft-ietf=
-nfsv4-layoutwcc-01.txt<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 Pages=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0: =
11<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 Date=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : =
2023-03-30<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Abstract:<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 The Parallel Network File System (pNFS) Flexible=
 File Layout allows<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 for a file&#39;s metadata (MDS) and data (DS) to=
 be on different servers.<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 It does not provide a mechanism for the data ser=
ver to update the<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 metadata server of changes to the data part of t=
he file.=C2=A0 The client<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 has knowledge of such updates, but lacks the abi=
lity to update the<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 metadata server.=C2=A0 This document presents a =
refinement to RFC8435 to<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 allow the client to update the metadata server t=
o changes on the data<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 server.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; The IETF datatracker status page for this Internet-Draft is:<=
br>
&gt;&gt; &gt; <a href=3D"https://datatracker.ietf.org/doc/draft-ietf-nfsv4-=
layoutwcc/" rel=3D"noreferrer" target=3D"_blank">https://datatracker.ietf.o=
rg/doc/draft-ietf-nfsv4-layoutwcc/</a><br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; There is also an HTML version available at:<br>
&gt;&gt; &gt; <a href=3D"https://www.ietf.org/archive/id/draft-ietf-nfsv4-l=
ayoutwcc-01.html" rel=3D"noreferrer" target=3D"_blank">https://www.ietf.org=
/archive/id/draft-ietf-nfsv4-layoutwcc-01.html</a><br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; A diff from the previous version is available at:<br>
&gt;&gt; &gt; <a href=3D"https://author-tools.ietf.org/iddiff?url2=3Ddraft-=
ietf-nfsv4-layoutwcc-01" rel=3D"noreferrer" target=3D"_blank">https://autho=
r-tools.ietf.org/iddiff?url2=3Ddraft-ietf-nfsv4-layoutwcc-01</a><br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Internet-Drafts are also available by rsync at rsync.ietf.org=
::internet-drafts<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; _______________________________________________<br>
&gt;&gt; &gt; nfsv4 mailing list<br>
&gt;&gt; &gt; <a href=3D"mailto:nfsv4@ietf.org" target=3D"_blank">nfsv4@iet=
f.org</a><br>
&gt;&gt; &gt; <a href=3D"https://www.ietf.org/mailman/listinfo/nfsv4" rel=
=3D"noreferrer" target=3D"_blank">https://www.ietf.org/mailman/listinfo/nfs=
v4</a><br>
&gt;&gt;<br>
&gt;&gt; _______________________________________________<br>
&gt;&gt; nfsv4 mailing list<br>
&gt;&gt; <a href=3D"mailto:nfsv4@ietf.org" target=3D"_blank">nfsv4@ietf.org=
</a><br>
&gt;&gt; <a href=3D"https://www.ietf.org/mailman/listinfo/nfsv4" rel=3D"nor=
eferrer" target=3D"_blank">https://www.ietf.org/mailman/listinfo/nfsv4</a><=
br>
</blockquote></div>

--0000000000004044ed05f9188170--

