Re: [nfsv4] I-D Action: draft-ietf-nfsv4-layoutwcc-01.txt

David Noveck <davenoveck@gmail.com> Wed, 12 April 2023 12:20 UTC

Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 44FA4C15170B for <nfsv4@ietfa.amsl.com>; Wed, 12 Apr 2023 05:20:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.094
X-Spam-Level:
X-Spam-Status: No, score=-2.094 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3mF-jQDVLfEY for <nfsv4@ietfa.amsl.com>; Wed, 12 Apr 2023 05:20:35 -0700 (PDT)
Received: from mail-qk1-x733.google.com (mail-qk1-x733.google.com [IPv6:2607:f8b0:4864:20::733]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5D7D6C1516F8 for <nfsv4@ietf.org>; Wed, 12 Apr 2023 05:20:35 -0700 (PDT)
Received: by mail-qk1-x733.google.com with SMTP id 201so2192417qkl.2 for <nfsv4@ietf.org>; Wed, 12 Apr 2023 05:20:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681302034; x=1683894034; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=1zRPqOV1YUTajmudoh/Wm8Z9uTOWpX2zjNg7t11s2kU=; b=Rg4QiJdS9hyc5TPSC2chDJ7A/Og1hwYL75JntyMCqaBD2ujaCvFatePsLZINSuTUpG hwmllfulPocAWWB6/SopMm2+PfqY0mRuOzxoMErwjUuEVM5i5dITNmQ59lTthztnCN7b CPaU8mlIvuObBxZloCDx8y7DpjohMrXpzll/820c4EZXmlsL9dZQkEnU/wpMm2rS5cwY yuyvcc6HMi1rmaIRGqng/1gbD6X+0MTbulSDmZ4VYCtIqHW5pRC4TXb7+BymKM7VA1Iw QknEw0FFtxvJ1WhnucTkhA8mUypforO4ycHmjDfGQMrrXTbOJqHcM83HttSP16ClT4tj KQUQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1681302034; x=1683894034; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=1zRPqOV1YUTajmudoh/Wm8Z9uTOWpX2zjNg7t11s2kU=; b=an4tcWlDZkEEelrgh2QwTzr1YNM5Z4icYJUsrIjOC582AZQpqfWUW4kOaCfrJzbdkS e+T9Ft2+r3fiS65RvwvmH9jEKWZctEVezgMxGxiMY5t0BhrSSfEOru24p1V6NbMU5ki8 oSBBqK32KJfAXuc4T9NMfmzjGbpfP5V/0XYhMKPBHcG2Iy9z45yfQnnDEQ1FpKvJZOJ4 t1tA+5066Hhn5spcWxU/dWrkTBxDXVcxBTJoswRSD+yXK57WiMmQw0D0AFIEXi0mI7KN wpSVcH+E78+XnVSGIvdmH7xGMfOHZ/DTax4NhoLG3y6s4eesbSW4CNVMjXKm7akEblLJ nvFQ==
X-Gm-Message-State: AAQBX9eoDsXb4+Ye8VobzHz7NweToMloSjh7KQdofmZgTBnqjF4iP+BU A89rGAeCxFvgIOdWqkmINfktMg/N4sWWSO94QfcIClF5
X-Google-Smtp-Source: AKy350Zw/dZtsSJF7SwL62QRn46Gg5Odev8aXKMtkgY3SzrNXTomv786TqNJqL7huMTc6cZAANeSrWXjwNEtgKUJTvQ=
X-Received: by 2002:a05:620a:190c:b0:746:96c2:e458 with SMTP id bj12-20020a05620a190c00b0074696c2e458mr823150qkb.4.1681302034190; Wed, 12 Apr 2023 05:20:34 -0700 (PDT)
MIME-Version: 1.0
References: <168022420664.45787.1714641403814314728@ietfa.amsl.com> <CAM5tNy6udE6gqKHTfcTfzmf+CTqWiA9kuz__fnaqCyPPUF1+ZQ@mail.gmail.com> <CAABAsM6SNu1getd5GgGqp96qZtL4HMqt1mKTwLY=LaDOqdefdQ@mail.gmail.com> <CAM5tNy4UVT1R8jeECg+N_=tVn14B7QSNCT2WgZNzA2OG17yY-w@mail.gmail.com> <CAABAsM6ZSiXW=0igHyOfBg+QOG3a_6AAbg8zfxMqHJ3_772gxA@mail.gmail.com>
In-Reply-To: <CAABAsM6ZSiXW=0igHyOfBg+QOG3a_6AAbg8zfxMqHJ3_772gxA@mail.gmail.com>
From: David Noveck <davenoveck@gmail.com>
Date: Wed, 12 Apr 2023 08:20:23 -0400
Message-ID: <CADaq8jcymU8iGAd5fdUg0dSVYEOZX85breXR3AbJWNpKoGFz2A@mail.gmail.com>
To: Trond Myklebust <trondmy@gmail.com>
Cc: Rick Macklem <rick.macklem@gmail.com>, NFSv4 <nfsv4@ietf.org>
Content-Type: multipart/alternative; boundary="0000000000008216aa05f922a219"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/6NLttYDFtY48UhFxUlNjeUeyFiA>
Subject: Re: [nfsv4] I-D Action: draft-ietf-nfsv4-layoutwcc-01.txt
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Apr 2023 12:20:39 -0000

On Tue, Apr 11, 2023, 8:15 PM Trond Myklebust <trondmy@gmail.com> wrote:

> As I've said many times before, LAYOUTCOMMIT is not only unnecessary, but
> is a terrible idea.
>

 No disagreement on my part but we will have to deal with the fact that it
is part of rfc8881.   Coming up with suitable text for rfc5661bis will be a
challenge.  Sigh!


Any O_SYNC workload is literally sunk when you require LAYOUTCOMMIT in
> order to fully commit the metadata, because each write() syscall needs to
> go both to the data server and the metadata server.
>
> No proposal that attempts to expand the use of LAYOUTCOMMIT is going to be
> welcome on the Linux client.
>
> On Tue, Apr 11, 2023 at 4:14 PM Rick Macklem <rick.macklem@gmail.com>
> wrote:
>
>> On Tue, Apr 11, 2023 at 6:40 AM Trond Myklebust <trondmy@gmail.com>
>> wrote:
>> >
>> > LAYOUTCOMMIT sort of makes sense for block based layouts, since you do
>> need to persist updates of the blocks with the metadata server. It makes
>> zero sense for files and flex files layout types, since you already have
>> persistence in the form of the regular COMMIT operation. While you could
>> argue that LAYOUTCOMMIT can be seen as just providing hints in those cases,
>> there is nothing in the protocol spec that allows you to conclude that. It
>> is not worth introducing a new operation that needs to be called every time
>> that the client wants to ensure persistence, just in order to resolve a
>> server reboot issue.
>> >
>> First, note the following sentence:
>>     The LAYOUTCOMMIT operation indicates that the client has completed
>>     writes using a layout obtained by a previous LAYOUTGET.
>> In particular, note that it says nothing like "when data needs to be
>> committed to stable storage".
>> Also, I cannot find anywhere where the RFC discusses the relationship
>> between LAYOUTCOMMIT
>> and COMMIT.  Therefore, I do not believe any relationship should be
>> assumed.
>>
>> Now, here's the back story:
>> When I implemented the pNFS server for FreeBSD, I assumed that the
>> client would do a
>> LayoutCommit after a series of writes and before it expected the MDS
>> to have up-to-date
>> attributes for the file that was written to the DS (using a RW layout).
>> --> At the time of LayoutCommit, the MDS did a Getattr on the DS(s)
>> for the layout to
>>      update the MDS's copy of the attributes that change when writing
>> occurs.
>>
>> This worked fine for the FreeBSD client, which does a LayoutCommit
>> whenever a
>> process did a POSIX fsync or close syscall.
>>
>> Then I started testing using a Linux client and found out that it got
>> confused by
>> stale attributes acquired from the MDS after doing write(s) to the DS(s).
>> Looking at the packet trace I found that, for the case where the Linux
>> client
>> did writes to the DS(s) with File_Sync set, the Linux client did not do a
>> LayoutCommit. (It, of course, did not do a Commit, but that was what I
>> expected.)
>> --> I looked in the RFC for a statement that would make it clear that the
>> Linux
>>       client was broken and should do a LayoutCommit after these write(s)
>> but...
>>       I could not find one.
>>        So, to work around the problem I added a tunable called
>>        vfs.nfsd.pnfsgetdsattr which, when enabled, makes the pNFS server
>> do a
>>        Getattr against the DS(s) whenever it receives a Getattr from a
>> client and
>>        a RW layout has been issued for the file.
>>        --> This resulted in a lot of MDS->DS Getattrs, but made the Linux
>> client
>>              function well against the FreeBSD pNFS server.
>>
>> It looks to me like the proposed Layout_WCC operation is mostly just
>> trying to solve
>> the problem that can be solved by requiring a pNFS client to do a
>> LayoutCommit
>> after a series of writes are done to DS(s) based on a RW layout and
>> before it
>> expects the MDS to have up-to-date attributes for the file that was
>> written.
>> (For my client, that seems to be whenever a POSIX close or fsync is done
>> on
>>   the file.)
>>
>> Since LayoutCommits do not happen that frequently, the overhead of doing
>> a Getattr against a DS for each of them is not particularily large.
>>
>> Then there is the separate issue of how a client reports errors
>> encountered
>> when doing write(s) to a DS if the MDS reboots, making the layout stateid
>> stale.
>> I'd rather see this handled via a non-empty loca_layoutupdate, since
>> that handles
>> the problem for both NFSv4.1 and NFSv4.2 via a layout specific driver.
>> (NFSv4.1 is deficient, since it does not have LayoutError and this could
>> work
>>  around that.) A new NFSv4.2 operation only fixes the issue for NFSv4.2
>> and
>> only after it is adopted and only if it is clearly specified when the
>> client must
>> do the operation.
>>
>> It seems to me that a non-empty loca_layoutupdate is only needed for the
>> rare second case, although it might be useful for the maintenance of
>> up-to-date attributes on the MDS.
>>
>> Finally, I will note that it is not obvious how the client can provide a
>> Change
>> attribute to the MDS after doing Writes to a NFSv3 DS.  I think that the
>> client can provide a boolean that indicates if a write to a DS for the
>> file
>> has been done since the last LayoutCommit (or Layout_WCC if you
>> choose to go that way).
>>
>> > Secondly, as you pointed out, the repurposing of LAYOUTCOMMIT for
>> propagating layout stats and layout errors requires a protocol change in
>> any case. While that protocol change would be limited to the flex files
>> protocol, it requires a whole new flex files v2 spec, since there is no
>> procedure for amending the layout protocols in the way that exists for the
>> NFSv4.2 protocol itself.
>>
>> First, I do not consider this "repurposing". I think you have assumed
>> a relationship
>> between LayoutCommit and Commit that does not exist and, although the RFC
>> seems
>> vague w.r.t. when LayoutCommits are to be performed by a pNFS client,
>> I do think that
>> they were intended for this use. Among other things, they are allowed
>> to be done during
>> Grace using a stale layout stateid.
>>
>> Second, what's the difference between an amended RFC for Flexible File
>> Layout and
>> an RFC for a new NFSv4.2 operation? All that changes is the
>> loca_layoutupdate becomes
>> optionally non-empty. (If everyone is really convinced that is not
>> permitted, then define
>> Flexible File Layout 2 with that minimal change, to fix the "how to
>> report DS errors when
>> the MDS reboots.)
>>
>> >
>> > Lastly, while LAYOUTCOMMIT could potentially carry a layout stats and
>> layout error payload in the same way that LAYOUTRETURN can, that payload
>> size is limited by the negotiated 'ca_maxrequestsize' session value. This
>> is in part why we have dedicated LAYOUTSTATS and LAYOUTERROR operations in
>> the first place: so that the client can report the full information in
>> multiple RPC calls when necessary, without being limited by the payload
>> limitations of a single RPC call.
>>
>> The arguments that come before loca_layoutupdate for LayoutCommit
>> are (at a glance, so I might be off a little) 68bytes vs 20bytes for
>> Layout_WCC.
>> I doubt 48bytes is going to make a difference. I also believe that
>> up-to-date
>> attributes and errors received from write(s) to DS(s) while the MDS
>> reboots
>> are what matters. I do not see stats as important for the case where the
>> MDS
>> has rebooted and the client cannot do a LayoutStats with the old layout
>> stateid.
>> (After all, LayoutStats does not even exist in NFSv4.1.)
>>
>> rick
>>
>> >
>> > On Mon, Apr 10, 2023 at 7:11 PM Rick Macklem <rick.macklem@gmail.com>
>> wrote:
>> >>
>> >> If Tom commented w.r.t. the use of LAYOUTCOMMIT for this,
>> >> I missed it.
>> >>
>> >> Here are some snippets from RFC8881 Sec. 18.42 plus comments:
>> >>
>> >>    The LAYOUTCOMMIT operation indicates that the client has completed
>> >>    writes using a layout obtained by a previous LAYOUTGET.  The client
>> >>    may have only written a subset of the data range it previously
>> >>    requested.  LAYOUTCOMMIT allows it to commit or discard
>> provisionally
>> >>    allocated space and to update the server with a new end-of-file.
>> >> Note that the arguments loca_last_write_offset, loca_time_modify and
>> >> loca_layoutupdate can be used by the client to inform the MDS about
>> >> file changes due to writing to the DS.
>> >>
>> >>    If the loca_reclaim field is set to TRUE, this indicates that the
>> >>    client is attempting to commit changes to a layout after the restart
>> >>    of the metadata server during the metadata server's recovery grace
>> >>    period (see Section 12.7.4).  This type of request may be necessary
>> >>    when the client has uncommitted writes to provisionally allocated
>> >>    byte-ranges of a file that were sent to the storage devices before
>> >>    the restart of the metadata server.
>> >> This explains how LayoutCommit can be used after an MDS reboot
>> >> to update the MDS w.r.t. writes done to the DS.
>> >>
>> >>    The loca_last_write_offset field specifies the offset of the last
>> >>    byte written by the client previous to the LAYOUTCOMMIT.  Note that
>> >>    this value is never equal to the file's size (at most it is one byte
>> >>    less than the file's size) and MUST be less than or equal to
>> >>    NFS4_MAXFILEOFF.  Also, loca_last_write_offset MUST overlap the
>> range
>> >>    described by loca_offset and loca_length.  The metadata server may
>> >>    use this information to determine whether the file's size needs to
>> be
>> >>    updated.  If the metadata server updates the file's size as the
>> >>    result of the LAYOUTCOMMIT operation, it must return the new size
>> >>    (locr_newsize.ns_size) as part of the results.
>> >>
>> >>    The loca_time_modify field allows the client to suggest a
>> >>    modification time it would like the metadata server to set.  The
>> >>    metadata server may use the suggestion or it may use the time of the
>> >>    LAYOUTCOMMIT operation to set the modification time.  If the
>> metadata
>> >>    server uses the client-provided modification time, it should ensure
>> >>    that time does not flow backwards.  If the client wants to force the
>> >>    metadata server to set an exact time, the client should use a
>> SETATTR
>> >>    operation in a COMPOUND right after LAYOUTCOMMIT.  See Section
>> 12.5.4
>> >>    for more details.  If the client desires the resultant modification
>> >>    time, it should construct the COMPOUND so that a GETATTR follows the
>> >>    LAYOUTCOMMIT.
>> >> Here are details on how loca_last_write_offset and loca_time_modify
>> >> are used to update the MDS.
>> >>
>> >> Although the Flex Files layout does not define any loca_layout_update,
>> >> it seems that this could be added more easily than new NFSv4.2
>> operations
>> >> if additional client->MDS information is deemed necessary.
>> >>
>> >> Now, I do agree that:
>> >>    The LAYOUTCOMMIT operation commits changes in the layout represented
>> >>    by the current filehandle, client ID (derived from the session ID in
>> >>    the preceding SEQUENCE operation), byte-range, and stateid.
>> >> "commits changes in the layout" is kinda weird and does not clarify
>> when
>> >> LayoutCommit should be done by a client. I think this should be
>> clarified,
>> >> at least for Flex File Layout, as whenever the client has completed a
>> series
>> >> of writes to DS(s) using the layout and expects the MDS to return
>> >> up-to-date Getattr information for the file to clients.
>> >>
>> >> With the above clarification (or whatever works for you), I do not see
>> why
>> >> new operations are needed?
>> >>
>> >> rick
>> >>
>> >> On Thu, Mar 30, 2023 at 5:57 PM <internet-drafts@ietf.org> wrote:
>> >> >
>> >> >
>> >> > A New Internet-Draft is available from the on-line Internet-Drafts
>> >> > directories. This Internet-Draft is a work item of the Network File
>> System
>> >> > Version 4 (NFSV4) WG of the IETF.
>> >> >
>> >> >    Title           : Add LAYOUT_WCC to NFSv4.2's Flex File Layout
>> Type
>> >> >    Authors         : Thomas Haynes
>> >> >                      Trond Myklebust
>> >> >    Filename        : draft-ietf-nfsv4-layoutwcc-01.txt
>> >> >    Pages           : 11
>> >> >    Date            : 2023-03-30
>> >> >
>> >> > Abstract:
>> >> >    The Parallel Network File System (pNFS) Flexible File Layout
>> allows
>> >> >    for a file's metadata (MDS) and data (DS) to be on different
>> servers.
>> >> >    It does not provide a mechanism for the data server to update the
>> >> >    metadata server of changes to the data part of the file.  The
>> client
>> >> >    has knowledge of such updates, but lacks the ability to update the
>> >> >    metadata server.  This document presents a refinement to RFC8435
>> to
>> >> >    allow the client to update the metadata server to changes on the
>> data
>> >> >    server.
>> >> >
>> >> > The IETF datatracker status page for this Internet-Draft is:
>> >> > https://datatracker.ietf.org/doc/draft-ietf-nfsv4-layoutwcc/
>> >> >
>> >> > There is also an HTML version available at:
>> >> > https://www.ietf.org/archive/id/draft-ietf-nfsv4-layoutwcc-01.html
>> >> >
>> >> > A diff from the previous version is available at:
>> >> >
>> https://author-tools.ietf.org/iddiff?url2=draft-ietf-nfsv4-layoutwcc-01
>> >> >
>> >> > Internet-Drafts are also available by rsync at rsync.ietf.org:
>> :internet-drafts
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > nfsv4 mailing list
>> >> > nfsv4@ietf.org
>> >> > https://www.ietf.org/mailman/listinfo/nfsv4
>> >>
>> >> _______________________________________________
>> >> nfsv4 mailing list
>> >> nfsv4@ietf.org
>> >> https://www.ietf.org/mailman/listinfo/nfsv4
>>
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4
>