Re: [nfsv4] I-D Action: draft-ietf-nfsv4-layoutwcc-01.txt

Trond Myklebust <trondmy@gmail.com> Wed, 12 April 2023 00:15 UTC

Return-Path: <trondmy@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 46AD6C1CAB50 for <nfsv4@ietfa.amsl.com>; Tue, 11 Apr 2023 17:15:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.094
X-Spam-Level:
X-Spam-Status: No, score=-2.094 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id e2crTrhz4uKN for <nfsv4@ietfa.amsl.com>; Tue, 11 Apr 2023 17:15:29 -0700 (PDT)
Received: from mail-yb1-xb30.google.com (mail-yb1-xb30.google.com [IPv6:2607:f8b0:4864:20::b30]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D2932C19E117 for <nfsv4@ietf.org>; Tue, 11 Apr 2023 17:15:29 -0700 (PDT)
Received: by mail-yb1-xb30.google.com with SMTP id y186so10122591yby.13 for <nfsv4@ietf.org>; Tue, 11 Apr 2023 17:15:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681258527; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=lCPcPyEkGsLRkz26nX+KYxMWFySdgkG6uhAKRjoGpHw=; b=eM9bNLNslOC4qr0lPJIVrSqK+HZuKHixBHTUAt+p1nmvm8bdwg0GKMNsM9YrrhXAAh G0RHfygag+wS8uH8gcB44WkQ8G2I4HXU7rLiQc0qV68L+wBPtXhrHQLx5cGn5Ff6Vaj7 3Bg/sm1axmoQQigOQgZCJu+Boxj+iG2IyXOQcVnMAGEjbUnkDobNgb7CIvwLulWfASfS N06YaBGgoU+6UO6FbfLUYfog5EbZLjVbj1wiRNAu/sHxDsKNPurQpsAxnnALLQSPvSYb 4MgOGYkpEcmao/Uy8v44a7t5NKr3MYQCjprLQu1P1yF49DjjbTbbxIVnTXwXyyXyAug5 3iDg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1681258527; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=lCPcPyEkGsLRkz26nX+KYxMWFySdgkG6uhAKRjoGpHw=; b=akjAa7y+Q71d0aIvoM88gN5GAgNZmsHot2V+2muL7dIc4sUBa7bECW1J+yYXFgEQLq FyIV2KJO3mTwMdt9Ne1q0STRA6gNRypDVQAkCok1is3aaLU+adMvwrZUiviQKTKVyT8i OnxuW6jDRI/Yeo9MqZPOuv0dFxqxySxE5j014ajYQ3U7UNb3mI/fppvtNTI4mB4CukH5 Oe09wb7hH41zYJpzBD9h1aRn+PsmVd5LBGAEsD2avXVXx+yret5OKQl9CHYCgBRvUvEl +65D697ZMu08SojbDW9N76zXOuO2i23Y4LZQoWlkJjux8bUybA+tZDmOUH/QAT77t6VO FF4g==
X-Gm-Message-State: AAQBX9fbYrA4zE9BBvRuNOdf1LWzuDXNzocsqN/PKEnzTWGnwefmZrk3 U41/4o/RqYlJD1ITrDkKoWQTve2PdX8IPqw/WkMdYIJAlQ==
X-Google-Smtp-Source: AKy350Zw7XhJE5g9ZCkvvJSpFlbDOD7s6z7P0DRHJ6Q8Nq6Gcl6okHJzNciL98ZhZAVCEtrB68p3Dxbxv/pZ5mHKEec=
X-Received: by 2002:a25:7b03:0:b0:b8b:fca4:7454 with SMTP id w3-20020a257b03000000b00b8bfca47454mr758657ybc.4.1681258526555; Tue, 11 Apr 2023 17:15:26 -0700 (PDT)
MIME-Version: 1.0
References: <168022420664.45787.1714641403814314728@ietfa.amsl.com> <CAM5tNy6udE6gqKHTfcTfzmf+CTqWiA9kuz__fnaqCyPPUF1+ZQ@mail.gmail.com> <CAABAsM6SNu1getd5GgGqp96qZtL4HMqt1mKTwLY=LaDOqdefdQ@mail.gmail.com> <CAM5tNy4UVT1R8jeECg+N_=tVn14B7QSNCT2WgZNzA2OG17yY-w@mail.gmail.com>
In-Reply-To: <CAM5tNy4UVT1R8jeECg+N_=tVn14B7QSNCT2WgZNzA2OG17yY-w@mail.gmail.com>
From: Trond Myklebust <trondmy@gmail.com>
Date: Tue, 11 Apr 2023 20:15:15 -0400
Message-ID: <CAABAsM6ZSiXW=0igHyOfBg+QOG3a_6AAbg8zfxMqHJ3_772gxA@mail.gmail.com>
To: Rick Macklem <rick.macklem@gmail.com>
Cc: nfsv4@ietf.org
Content-Type: multipart/alternative; boundary="0000000000004044ed05f9188170"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/M3PHvlACf2tk3sXVP1USC0rNPd0>
Subject: Re: [nfsv4] I-D Action: draft-ietf-nfsv4-layoutwcc-01.txt
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Apr 2023 00:15:31 -0000

As I've said many times before, LAYOUTCOMMIT is not only unnecessary, but
is a terrible idea. Any O_SYNC workload is literally sunk when you require
LAYOUTCOMMIT in order to fully commit the metadata, because each write()
syscall needs to go both to the data server and the metadata server.

No proposal that attempts to expand the use of LAYOUTCOMMIT is going to be
welcome on the Linux client.

On Tue, Apr 11, 2023 at 4:14 PM Rick Macklem <rick.macklem@gmail.com> wrote:

> On Tue, Apr 11, 2023 at 6:40 AM Trond Myklebust <trondmy@gmail.com> wrote:
> >
> > LAYOUTCOMMIT sort of makes sense for block based layouts, since you do
> need to persist updates of the blocks with the metadata server. It makes
> zero sense for files and flex files layout types, since you already have
> persistence in the form of the regular COMMIT operation. While you could
> argue that LAYOUTCOMMIT can be seen as just providing hints in those cases,
> there is nothing in the protocol spec that allows you to conclude that. It
> is not worth introducing a new operation that needs to be called every time
> that the client wants to ensure persistence, just in order to resolve a
> server reboot issue.
> >
> First, note the following sentence:
>     The LAYOUTCOMMIT operation indicates that the client has completed
>     writes using a layout obtained by a previous LAYOUTGET.
> In particular, note that it says nothing like "when data needs to be
> committed to stable storage".
> Also, I cannot find anywhere where the RFC discusses the relationship
> between LAYOUTCOMMIT
> and COMMIT.  Therefore, I do not believe any relationship should be
> assumed.
>
> Now, here's the back story:
> When I implemented the pNFS server for FreeBSD, I assumed that the
> client would do a
> LayoutCommit after a series of writes and before it expected the MDS
> to have up-to-date
> attributes for the file that was written to the DS (using a RW layout).
> --> At the time of LayoutCommit, the MDS did a Getattr on the DS(s)
> for the layout to
>      update the MDS's copy of the attributes that change when writing
> occurs.
>
> This worked fine for the FreeBSD client, which does a LayoutCommit
> whenever a
> process did a POSIX fsync or close syscall.
>
> Then I started testing using a Linux client and found out that it got
> confused by
> stale attributes acquired from the MDS after doing write(s) to the DS(s).
> Looking at the packet trace I found that, for the case where the Linux
> client
> did writes to the DS(s) with File_Sync set, the Linux client did not do a
> LayoutCommit. (It, of course, did not do a Commit, but that was what I
> expected.)
> --> I looked in the RFC for a statement that would make it clear that the
> Linux
>       client was broken and should do a LayoutCommit after these write(s)
> but...
>       I could not find one.
>        So, to work around the problem I added a tunable called
>        vfs.nfsd.pnfsgetdsattr which, when enabled, makes the pNFS server
> do a
>        Getattr against the DS(s) whenever it receives a Getattr from a
> client and
>        a RW layout has been issued for the file.
>        --> This resulted in a lot of MDS->DS Getattrs, but made the Linux
> client
>              function well against the FreeBSD pNFS server.
>
> It looks to me like the proposed Layout_WCC operation is mostly just
> trying to solve
> the problem that can be solved by requiring a pNFS client to do a
> LayoutCommit
> after a series of writes are done to DS(s) based on a RW layout and before
> it
> expects the MDS to have up-to-date attributes for the file that was
> written.
> (For my client, that seems to be whenever a POSIX close or fsync is done on
>   the file.)
>
> Since LayoutCommits do not happen that frequently, the overhead of doing
> a Getattr against a DS for each of them is not particularily large.
>
> Then there is the separate issue of how a client reports errors encountered
> when doing write(s) to a DS if the MDS reboots, making the layout stateid
> stale.
> I'd rather see this handled via a non-empty loca_layoutupdate, since
> that handles
> the problem for both NFSv4.1 and NFSv4.2 via a layout specific driver.
> (NFSv4.1 is deficient, since it does not have LayoutError and this could
> work
>  around that.) A new NFSv4.2 operation only fixes the issue for NFSv4.2 and
> only after it is adopted and only if it is clearly specified when the
> client must
> do the operation.
>
> It seems to me that a non-empty loca_layoutupdate is only needed for the
> rare second case, although it might be useful for the maintenance of
> up-to-date attributes on the MDS.
>
> Finally, I will note that it is not obvious how the client can provide a
> Change
> attribute to the MDS after doing Writes to a NFSv3 DS.  I think that the
> client can provide a boolean that indicates if a write to a DS for the file
> has been done since the last LayoutCommit (or Layout_WCC if you
> choose to go that way).
>
> > Secondly, as you pointed out, the repurposing of LAYOUTCOMMIT for
> propagating layout stats and layout errors requires a protocol change in
> any case. While that protocol change would be limited to the flex files
> protocol, it requires a whole new flex files v2 spec, since there is no
> procedure for amending the layout protocols in the way that exists for the
> NFSv4.2 protocol itself.
>
> First, I do not consider this "repurposing". I think you have assumed
> a relationship
> between LayoutCommit and Commit that does not exist and, although the RFC
> seems
> vague w.r.t. when LayoutCommits are to be performed by a pNFS client,
> I do think that
> they were intended for this use. Among other things, they are allowed
> to be done during
> Grace using a stale layout stateid.
>
> Second, what's the difference between an amended RFC for Flexible File
> Layout and
> an RFC for a new NFSv4.2 operation? All that changes is the
> loca_layoutupdate becomes
> optionally non-empty. (If everyone is really convinced that is not
> permitted, then define
> Flexible File Layout 2 with that minimal change, to fix the "how to
> report DS errors when
> the MDS reboots.)
>
> >
> > Lastly, while LAYOUTCOMMIT could potentially carry a layout stats and
> layout error payload in the same way that LAYOUTRETURN can, that payload
> size is limited by the negotiated 'ca_maxrequestsize' session value. This
> is in part why we have dedicated LAYOUTSTATS and LAYOUTERROR operations in
> the first place: so that the client can report the full information in
> multiple RPC calls when necessary, without being limited by the payload
> limitations of a single RPC call.
>
> The arguments that come before loca_layoutupdate for LayoutCommit
> are (at a glance, so I might be off a little) 68bytes vs 20bytes for
> Layout_WCC.
> I doubt 48bytes is going to make a difference. I also believe that
> up-to-date
> attributes and errors received from write(s) to DS(s) while the MDS reboots
> are what matters. I do not see stats as important for the case where the
> MDS
> has rebooted and the client cannot do a LayoutStats with the old layout
> stateid.
> (After all, LayoutStats does not even exist in NFSv4.1.)
>
> rick
>
> >
> > On Mon, Apr 10, 2023 at 7:11 PM Rick Macklem <rick.macklem@gmail.com>
> wrote:
> >>
> >> If Tom commented w.r.t. the use of LAYOUTCOMMIT for this,
> >> I missed it.
> >>
> >> Here are some snippets from RFC8881 Sec. 18.42 plus comments:
> >>
> >>    The LAYOUTCOMMIT operation indicates that the client has completed
> >>    writes using a layout obtained by a previous LAYOUTGET.  The client
> >>    may have only written a subset of the data range it previously
> >>    requested.  LAYOUTCOMMIT allows it to commit or discard provisionally
> >>    allocated space and to update the server with a new end-of-file.
> >> Note that the arguments loca_last_write_offset, loca_time_modify and
> >> loca_layoutupdate can be used by the client to inform the MDS about
> >> file changes due to writing to the DS.
> >>
> >>    If the loca_reclaim field is set to TRUE, this indicates that the
> >>    client is attempting to commit changes to a layout after the restart
> >>    of the metadata server during the metadata server's recovery grace
> >>    period (see Section 12.7.4).  This type of request may be necessary
> >>    when the client has uncommitted writes to provisionally allocated
> >>    byte-ranges of a file that were sent to the storage devices before
> >>    the restart of the metadata server.
> >> This explains how LayoutCommit can be used after an MDS reboot
> >> to update the MDS w.r.t. writes done to the DS.
> >>
> >>    The loca_last_write_offset field specifies the offset of the last
> >>    byte written by the client previous to the LAYOUTCOMMIT.  Note that
> >>    this value is never equal to the file's size (at most it is one byte
> >>    less than the file's size) and MUST be less than or equal to
> >>    NFS4_MAXFILEOFF.  Also, loca_last_write_offset MUST overlap the range
> >>    described by loca_offset and loca_length.  The metadata server may
> >>    use this information to determine whether the file's size needs to be
> >>    updated.  If the metadata server updates the file's size as the
> >>    result of the LAYOUTCOMMIT operation, it must return the new size
> >>    (locr_newsize.ns_size) as part of the results.
> >>
> >>    The loca_time_modify field allows the client to suggest a
> >>    modification time it would like the metadata server to set.  The
> >>    metadata server may use the suggestion or it may use the time of the
> >>    LAYOUTCOMMIT operation to set the modification time.  If the metadata
> >>    server uses the client-provided modification time, it should ensure
> >>    that time does not flow backwards.  If the client wants to force the
> >>    metadata server to set an exact time, the client should use a SETATTR
> >>    operation in a COMPOUND right after LAYOUTCOMMIT.  See Section 12.5.4
> >>    for more details.  If the client desires the resultant modification
> >>    time, it should construct the COMPOUND so that a GETATTR follows the
> >>    LAYOUTCOMMIT.
> >> Here are details on how loca_last_write_offset and loca_time_modify
> >> are used to update the MDS.
> >>
> >> Although the Flex Files layout does not define any loca_layout_update,
> >> it seems that this could be added more easily than new NFSv4.2
> operations
> >> if additional client->MDS information is deemed necessary.
> >>
> >> Now, I do agree that:
> >>    The LAYOUTCOMMIT operation commits changes in the layout represented
> >>    by the current filehandle, client ID (derived from the session ID in
> >>    the preceding SEQUENCE operation), byte-range, and stateid.
> >> "commits changes in the layout" is kinda weird and does not clarify when
> >> LayoutCommit should be done by a client. I think this should be
> clarified,
> >> at least for Flex File Layout, as whenever the client has completed a
> series
> >> of writes to DS(s) using the layout and expects the MDS to return
> >> up-to-date Getattr information for the file to clients.
> >>
> >> With the above clarification (or whatever works for you), I do not see
> why
> >> new operations are needed?
> >>
> >> rick
> >>
> >> On Thu, Mar 30, 2023 at 5:57 PM <internet-drafts@ietf.org> wrote:
> >> >
> >> >
> >> > A New Internet-Draft is available from the on-line Internet-Drafts
> >> > directories. This Internet-Draft is a work item of the Network File
> System
> >> > Version 4 (NFSV4) WG of the IETF.
> >> >
> >> >    Title           : Add LAYOUT_WCC to NFSv4.2's Flex File Layout Type
> >> >    Authors         : Thomas Haynes
> >> >                      Trond Myklebust
> >> >    Filename        : draft-ietf-nfsv4-layoutwcc-01.txt
> >> >    Pages           : 11
> >> >    Date            : 2023-03-30
> >> >
> >> > Abstract:
> >> >    The Parallel Network File System (pNFS) Flexible File Layout allows
> >> >    for a file's metadata (MDS) and data (DS) to be on different
> servers.
> >> >    It does not provide a mechanism for the data server to update the
> >> >    metadata server of changes to the data part of the file.  The
> client
> >> >    has knowledge of such updates, but lacks the ability to update the
> >> >    metadata server.  This document presents a refinement to RFC8435 to
> >> >    allow the client to update the metadata server to changes on the
> data
> >> >    server.
> >> >
> >> > The IETF datatracker status page for this Internet-Draft is:
> >> > https://datatracker.ietf.org/doc/draft-ietf-nfsv4-layoutwcc/
> >> >
> >> > There is also an HTML version available at:
> >> > https://www.ietf.org/archive/id/draft-ietf-nfsv4-layoutwcc-01.html
> >> >
> >> > A diff from the previous version is available at:
> >> >
> https://author-tools.ietf.org/iddiff?url2=draft-ietf-nfsv4-layoutwcc-01
> >> >
> >> > Internet-Drafts are also available by rsync at rsync.ietf.org:
> :internet-drafts
> >> >
> >> >
> >> > _______________________________________________
> >> > nfsv4 mailing list
> >> > nfsv4@ietf.org
> >> > https://www.ietf.org/mailman/listinfo/nfsv4
> >>
> >> _______________________________________________
> >> nfsv4 mailing list
> >> nfsv4@ietf.org
> >> https://www.ietf.org/mailman/listinfo/nfsv4
>