Re: [nfsv4] I-D Action: draft-ietf-nfsv4-layoutwcc-01.txt

Rick Macklem <> Tue, 11 April 2023 20:15 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 64D8FC1CAB36 for <>; Tue, 11 Apr 2023 13:15:00 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.095
X-Spam-Status: No, score=-2.095 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id IeAtKb3qEDPO for <>; Tue, 11 Apr 2023 13:14:59 -0700 (PDT)
Received: from ( [IPv6:2607:f8b0:4864:20::62e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by (Postfix) with ESMTPS id 89A08C16B5CB for <>; Tue, 11 Apr 2023 13:14:59 -0700 (PDT)
Received: by with SMTP id h24so9158269plr.1 for <>; Tue, 11 Apr 2023 13:14:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20221208; t=1681244098; x=1683836098; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=lXAt++1MxBR3icTUlon+L9tUmhZ6HyOFvJTTiZ2S+pM=; b=o2lFjsxTdA9b/ear0RqkjmhaCScSBEzPAe8Ri+gcv+R9KcuVKAmHTVgjgT1Ra1PHs9 bJuz3zfvgNJc/HpzDwz+jgn9NCUPlyjPWNy8B9oUg5URYLrS2qqUbLX4eG3QtimqFw42 7/MFz1GgQi1boyCdZjYHfDbo6gJWSXA334wiUHFzHMG3BHH8b9VyJMQfeENm9p11wNm+ GUH1rQPHw6nB7hpih0ORSYrOKL0iKN2PgQ1WsrJdQriJaUujSR8pnCiNV5plSA7cqbQg PJvSWiPZgJ+sXJhJ9v8U1D/9yucXe61vtSqw4gRyCZ+ZTI2YXkWXg3LAoNBjpu96p5Oe WwKQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20210112; t=1681244098; x=1683836098; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lXAt++1MxBR3icTUlon+L9tUmhZ6HyOFvJTTiZ2S+pM=; b=WXMcm/+4veGwPycX446ClEWTHCcq6XVyMsjod5QxBCvWA/VNtD2D+qsJU1F4ISvNf8 Z6C9D1GVTPTg1u/6ZqnH3TXwOF5AybZHNWUQC7r/ENI7FgD4EGZcCwQNa5QxOen6gCA1 6zskrGhh8FwSw6dKJBwuz27fFfJBT09BHnxmH2ps1chXyzBSIiFQkCVQM0XiFCGbTWZ9 gaVh14JDwtkObvzgv5+fPVZzmc0vMLlkBSuBnD9yPVxi/Q/cSIB9SgzVMiH/1U9dyo2K vFMMR+VSSaNoiQmEA2cefEXriV7/0rttQfdEnlAq0C7mhKUtBuQDls3lfnXM/Q0gtIg+ gQCQ==
X-Gm-Message-State: AAQBX9dL5bFk0/k7pHwNsHcc9Sq2B7/YEejQRtTo+z9NMdH0pMzlpctR 0cKKpgYH1WBfgVzmmRPyXrymG1jiKu5OaLy0SQ==
X-Google-Smtp-Source: AKy350Zk7uVZctoaZSWXdfcwOnd/NkGnotArRmFdyDsZw6RgHnGQnwlUqBEpoXHm0hOrG7ztejed+RWQ2R4L+20P7CQ=
X-Received: by 2002:a17:90a:5e4d:b0:23e:feef:38ef with SMTP id u13-20020a17090a5e4d00b0023efeef38efmr19377097pji.41.1681244098384; Tue, 11 Apr 2023 13:14:58 -0700 (PDT)
MIME-Version: 1.0
References: <> <> <>
In-Reply-To: <>
From: Rick Macklem <>
Date: Tue, 11 Apr 2023 13:14:47 -0700
Message-ID: <>
To: Trond Myklebust <>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <>
Subject: Re: [nfsv4] I-D Action: draft-ietf-nfsv4-layoutwcc-01.txt
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: NFSv4 Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 11 Apr 2023 20:15:00 -0000

On Tue, Apr 11, 2023 at 6:40 AM Trond Myklebust <> wrote:
> LAYOUTCOMMIT sort of makes sense for block based layouts, since you do need to persist updates of the blocks with the metadata server. It makes zero sense for files and flex files layout types, since you already have persistence in the form of the regular COMMIT operation. While you could argue that LAYOUTCOMMIT can be seen as just providing hints in those cases, there is nothing in the protocol spec that allows you to conclude that. It is not worth introducing a new operation that needs to be called every time that the client wants to ensure persistence, just in order to resolve a server reboot issue.
First, note the following sentence:
    The LAYOUTCOMMIT operation indicates that the client has completed
    writes using a layout obtained by a previous LAYOUTGET.
In particular, note that it says nothing like "when data needs to be
committed to stable storage".
Also, I cannot find anywhere where the RFC discusses the relationship
and COMMIT.  Therefore, I do not believe any relationship should be assumed.

Now, here's the back story:
When I implemented the pNFS server for FreeBSD, I assumed that the
client would do a
LayoutCommit after a series of writes and before it expected the MDS
to have up-to-date
attributes for the file that was written to the DS (using a RW layout).
--> At the time of LayoutCommit, the MDS did a Getattr on the DS(s)
for the layout to
     update the MDS's copy of the attributes that change when writing occurs.

This worked fine for the FreeBSD client, which does a LayoutCommit whenever a
process did a POSIX fsync or close syscall.

Then I started testing using a Linux client and found out that it got
confused by
stale attributes acquired from the MDS after doing write(s) to the DS(s).
Looking at the packet trace I found that, for the case where the Linux client
did writes to the DS(s) with File_Sync set, the Linux client did not do a
LayoutCommit. (It, of course, did not do a Commit, but that was what I
--> I looked in the RFC for a statement that would make it clear that the Linux
      client was broken and should do a LayoutCommit after these write(s) but...
      I could not find one.
       So, to work around the problem I added a tunable called
       vfs.nfsd.pnfsgetdsattr which, when enabled, makes the pNFS server do a
       Getattr against the DS(s) whenever it receives a Getattr from a
client and
       a RW layout has been issued for the file.
       --> This resulted in a lot of MDS->DS Getattrs, but made the Linux client
             function well against the FreeBSD pNFS server.

It looks to me like the proposed Layout_WCC operation is mostly just
trying to solve
the problem that can be solved by requiring a pNFS client to do a LayoutCommit
after a series of writes are done to DS(s) based on a RW layout and before it
expects the MDS to have up-to-date attributes for the file that was written.
(For my client, that seems to be whenever a POSIX close or fsync is done on
  the file.)

Since LayoutCommits do not happen that frequently, the overhead of doing
a Getattr against a DS for each of them is not particularily large.

Then there is the separate issue of how a client reports errors encountered
when doing write(s) to a DS if the MDS reboots, making the layout stateid stale.
I'd rather see this handled via a non-empty loca_layoutupdate, since
that handles
the problem for both NFSv4.1 and NFSv4.2 via a layout specific driver.
(NFSv4.1 is deficient, since it does not have LayoutError and this could work
 around that.) A new NFSv4.2 operation only fixes the issue for NFSv4.2 and
only after it is adopted and only if it is clearly specified when the
client must
do the operation.

It seems to me that a non-empty loca_layoutupdate is only needed for the
rare second case, although it might be useful for the maintenance of
up-to-date attributes on the MDS.

Finally, I will note that it is not obvious how the client can provide a Change
attribute to the MDS after doing Writes to a NFSv3 DS.  I think that the
client can provide a boolean that indicates if a write to a DS for the file
has been done since the last LayoutCommit (or Layout_WCC if you
choose to go that way).

> Secondly, as you pointed out, the repurposing of LAYOUTCOMMIT for propagating layout stats and layout errors requires a protocol change in any case. While that protocol change would be limited to the flex files protocol, it requires a whole new flex files v2 spec, since there is no procedure for amending the layout protocols in the way that exists for the NFSv4.2 protocol itself.

First, I do not consider this "repurposing". I think you have assumed
a relationship
between LayoutCommit and Commit that does not exist and, although the RFC seems
vague w.r.t. when LayoutCommits are to be performed by a pNFS client,
I do think that
they were intended for this use. Among other things, they are allowed
to be done during
Grace using a stale layout stateid.

Second, what's the difference between an amended RFC for Flexible File
Layout and
an RFC for a new NFSv4.2 operation? All that changes is the
loca_layoutupdate becomes
optionally non-empty. (If everyone is really convinced that is not
permitted, then define
Flexible File Layout 2 with that minimal change, to fix the "how to
report DS errors when
the MDS reboots.)

> Lastly, while LAYOUTCOMMIT could potentially carry a layout stats and layout error payload in the same way that LAYOUTRETURN can, that payload size is limited by the negotiated 'ca_maxrequestsize' session value. This is in part why we have dedicated LAYOUTSTATS and LAYOUTERROR operations in the first place: so that the client can report the full information in multiple RPC calls when necessary, without being limited by the payload limitations of a single RPC call.

The arguments that come before loca_layoutupdate for LayoutCommit
are (at a glance, so I might be off a little) 68bytes vs 20bytes for Layout_WCC.
I doubt 48bytes is going to make a difference. I also believe that up-to-date
attributes and errors received from write(s) to DS(s) while the MDS reboots
are what matters. I do not see stats as important for the case where the MDS
has rebooted and the client cannot do a LayoutStats with the old layout stateid.
(After all, LayoutStats does not even exist in NFSv4.1.)


> On Mon, Apr 10, 2023 at 7:11 PM Rick Macklem <> wrote:
>> If Tom commented w.r.t. the use of LAYOUTCOMMIT for this,
>> I missed it.
>> Here are some snippets from RFC8881 Sec. 18.42 plus comments:
>>    The LAYOUTCOMMIT operation indicates that the client has completed
>>    writes using a layout obtained by a previous LAYOUTGET.  The client
>>    may have only written a subset of the data range it previously
>>    requested.  LAYOUTCOMMIT allows it to commit or discard provisionally
>>    allocated space and to update the server with a new end-of-file.
>> Note that the arguments loca_last_write_offset, loca_time_modify and
>> loca_layoutupdate can be used by the client to inform the MDS about
>> file changes due to writing to the DS.
>>    If the loca_reclaim field is set to TRUE, this indicates that the
>>    client is attempting to commit changes to a layout after the restart
>>    of the metadata server during the metadata server's recovery grace
>>    period (see Section 12.7.4).  This type of request may be necessary
>>    when the client has uncommitted writes to provisionally allocated
>>    byte-ranges of a file that were sent to the storage devices before
>>    the restart of the metadata server.
>> This explains how LayoutCommit can be used after an MDS reboot
>> to update the MDS w.r.t. writes done to the DS.
>>    The loca_last_write_offset field specifies the offset of the last
>>    byte written by the client previous to the LAYOUTCOMMIT.  Note that
>>    this value is never equal to the file's size (at most it is one byte
>>    less than the file's size) and MUST be less than or equal to
>>    NFS4_MAXFILEOFF.  Also, loca_last_write_offset MUST overlap the range
>>    described by loca_offset and loca_length.  The metadata server may
>>    use this information to determine whether the file's size needs to be
>>    updated.  If the metadata server updates the file's size as the
>>    result of the LAYOUTCOMMIT operation, it must return the new size
>>    (locr_newsize.ns_size) as part of the results.
>>    The loca_time_modify field allows the client to suggest a
>>    modification time it would like the metadata server to set.  The
>>    metadata server may use the suggestion or it may use the time of the
>>    LAYOUTCOMMIT operation to set the modification time.  If the metadata
>>    server uses the client-provided modification time, it should ensure
>>    that time does not flow backwards.  If the client wants to force the
>>    metadata server to set an exact time, the client should use a SETATTR
>>    operation in a COMPOUND right after LAYOUTCOMMIT.  See Section 12.5.4
>>    for more details.  If the client desires the resultant modification
>>    time, it should construct the COMPOUND so that a GETATTR follows the
>> Here are details on how loca_last_write_offset and loca_time_modify
>> are used to update the MDS.
>> Although the Flex Files layout does not define any loca_layout_update,
>> it seems that this could be added more easily than new NFSv4.2 operations
>> if additional client->MDS information is deemed necessary.
>> Now, I do agree that:
>>    The LAYOUTCOMMIT operation commits changes in the layout represented
>>    by the current filehandle, client ID (derived from the session ID in
>>    the preceding SEQUENCE operation), byte-range, and stateid.
>> "commits changes in the layout" is kinda weird and does not clarify when
>> LayoutCommit should be done by a client. I think this should be clarified,
>> at least for Flex File Layout, as whenever the client has completed a series
>> of writes to DS(s) using the layout and expects the MDS to return
>> up-to-date Getattr information for the file to clients.
>> With the above clarification (or whatever works for you), I do not see why
>> new operations are needed?
>> rick
>> On Thu, Mar 30, 2023 at 5:57 PM <> wrote:
>> >
>> >
>> > A New Internet-Draft is available from the on-line Internet-Drafts
>> > directories. This Internet-Draft is a work item of the Network File System
>> > Version 4 (NFSV4) WG of the IETF.
>> >
>> >    Title           : Add LAYOUT_WCC to NFSv4.2's Flex File Layout Type
>> >    Authors         : Thomas Haynes
>> >                      Trond Myklebust
>> >    Filename        : draft-ietf-nfsv4-layoutwcc-01.txt
>> >    Pages           : 11
>> >    Date            : 2023-03-30
>> >
>> > Abstract:
>> >    The Parallel Network File System (pNFS) Flexible File Layout allows
>> >    for a file's metadata (MDS) and data (DS) to be on different servers.
>> >    It does not provide a mechanism for the data server to update the
>> >    metadata server of changes to the data part of the file.  The client
>> >    has knowledge of such updates, but lacks the ability to update the
>> >    metadata server.  This document presents a refinement to RFC8435 to
>> >    allow the client to update the metadata server to changes on the data
>> >    server.
>> >
>> > The IETF datatracker status page for this Internet-Draft is:
>> >
>> >
>> > There is also an HTML version available at:
>> >
>> >
>> > A diff from the previous version is available at:
>> >
>> >
>> > Internet-Drafts are also available by rsync at
>> >
>> >
>> > _______________________________________________
>> > nfsv4 mailing list
>> >
>> >
>> _______________________________________________
>> nfsv4 mailing list