Re: [nfsv4] Progressing RFC errata for RFC 5661

Trond Myklebust <trondmy@gmail.com> Thu, 19 September 2019 14:57 UTC

Return-Path: <trondmy@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id F015712023E for <nfsv4@ietfa.amsl.com>; Thu, 19 Sep 2019 07:57:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AJBtFbsbSNRi for <nfsv4@ietfa.amsl.com>; Thu, 19 Sep 2019 07:57:33 -0700 (PDT)
Received: from mail-io1-xd44.google.com (mail-io1-xd44.google.com [IPv6:2607:f8b0:4864:20::d44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 78F7C120144 for <nfsv4@ietf.org>; Thu, 19 Sep 2019 07:57:30 -0700 (PDT)
Received: by mail-io1-xd44.google.com with SMTP id b19so8462695iob.4 for <nfsv4@ietf.org>; Thu, 19 Sep 2019 07:57:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=vTiACtLtzRIO8cuw8Pla9eEWev6yfabfPyAK+iBuom8=; b=ZObGvWpY9I4WmN6rQG6Pr0eRLlLU9maS2N+8Rstp0vt71YFv+xMTy0DaZuRUeYj8oy N2tvnvjJuMmW6gvcxHUnX5zT+/IPFiJ6IoeZx0DUNkCJ6t+xvn3ufGLIB5tEUWQg0BIj 3kJka4WkKV5L3yzOiL32018PbWv3VRF5n+Bemz54cQm4hRgZ52xoTrhxXDQcN/+cbqLx FWUXzwQ9qesP+LoklcfldVpnTFI96u0Mn40bSJsq9Egz/14sSJxYS3M4IoulHyEYKx2P bdog5AMFqjYOxh0T5VO8qF5axEmapBX6j9XgF9+wgNhD+DdZknLH+Y/oz3MOcOrio5TC OYiQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=vTiACtLtzRIO8cuw8Pla9eEWev6yfabfPyAK+iBuom8=; b=dYO3W4TvWbTgXtW0fuP+t9hQnJgQOEyUdWnKPxrsjJOW3siL/bS3VjM2y8/zcr8sE+ wm5G6CEtQ02xQ9KTzmpv61K71RKxDhCFNBW6vAzto14j8i3/1gRrViglL2Qd2EGJ8rTq lr7emzmMh6FL8c0TRdo/134+t7Wybz2jlZ629o3BF7xP0Ua/2h/yZ/QJLY9Jl6pkJI/o NHtKLgVEqxccrjqpBy+rOaJEilYn/Kd+dWNID8KiqsD0jav7qDGoP1q57H2XL7PgWGTj Z7VrYjgejLx3iTGrXGoWT2UchJx/OOTVbJmvuq8NJz63F004laq/ru5Nw2rwrMFuIAgs 3W4A==
X-Gm-Message-State: APjAAAVxS7wWdBmUkE0Zfcgs5a7OxJtP17MmVGJPyUdVUdaeng49Z9UK Go3iYae/CVY0uPaQ4KC5ESkosLLT2PCYhhc8mA==
X-Google-Smtp-Source: APXvYqyqhs6H72wf++w5hBgwJis17Pki0OYc3WoxCPDamSVo3NJCyzcYnp883Wal+YgenLUpOhSFoYKOT9kG3etXyOc=
X-Received: by 2002:a02:2382:: with SMTP id u124mr12947331jau.7.1568905049245; Thu, 19 Sep 2019 07:57:29 -0700 (PDT)
MIME-Version: 1.0
References: <DB7PR07MB5736124B2F507DA20F317BC195B30@DB7PR07MB5736.eurprd07.prod.outlook.com> <CADaq8jd4u-Lwvy_Csu2jqrcGFZ_tLOeSkqKwUW0eivuc=trsBg@mail.gmail.com> <CAABAsM5TDGx0qiMv+Ln4WLOKjuQiTFKr6HD6d9zqD3NfjpvoFg@mail.gmail.com> <YTXPR0101MB2189B0CB69FA090BD1D54F92DD8E0@YTXPR0101MB2189.CANPRD01.PROD.OUTLOOK.COM> <1941956044.30576385.1568804777656.JavaMail.zimbra@desy.de> <YTXPR0101MB21892F5E56B089AC6A255506DD8E0@YTXPR0101MB2189.CANPRD01.PROD.OUTLOOK.COM> <CAABAsM6wB-Jik_RqEHkzy3RrsOhrCx4X=LSxpqWz=PKvVJ7QxA@mail.gmail.com> <YT1PR01MB35931DB2A308E81571FFDD38DD890@YT1PR01MB3593.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <YT1PR01MB35931DB2A308E81571FFDD38DD890@YT1PR01MB3593.CANPRD01.PROD.OUTLOOK.COM>
From: Trond Myklebust <trondmy@gmail.com>
Date: Thu, 19 Sep 2019 10:57:17 -0400
Message-ID: <CAABAsM5OM4XjKd+se5os7dHPpKOrO8dOeu=3TOEOFrXzgDQ47g@mail.gmail.com>
To: Rick Macklem <rmacklem@uoguelph.ca>
Cc: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>, Dave Noveck <davenoveck@gmail.com>, Magnus Westerlund <magnus.westerlund@ericsson.com>, NFSv4 <nfsv4@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/NSee5q7QEnIRG16J5-dbtXKh-8Q>
Subject: Re: [nfsv4] Progressing RFC errata for RFC 5661
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Sep 2019 14:57:40 -0000

On Thu, 19 Sep 2019 at 00:07, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>
> Trond Myklebust wrote:
> >Rick,
> >
> >That errata predates most of the Linux pNFS client implementation. We
> >wrote the implementation to conform to the errata.
> >
> >So no. It's not a bug. It's a deliberate design based on a decision
> >that was discussed in the IETF WG, on the mailing list
> >
> >https://mailarchive.ietf.org/arch/msg/nfsv4/_KTtO6uz-MvRoStbhPuOXWZr6yI
> This actually appears to be a discussion related to the offset and length arguments for LayoutCommit, but...
> >

See Spencer's email:
https://mailarchive.ietf.org/arch/msg/nfsv4/tj4_tBSTLMEKqf43Wg_bv1K0TeI

It may have focussed on the offset and length stuff, but it was
intended to cover all the layoutcommit issues that were being
discussed at the time.

> >and in a special session of the IETF:
> >
> >https://mailarchive.ietf.org/arch/msg/nfsv4/Rpw9XCwCARxfaU4ym5L2TauV6ao
> Ok. This was long before I got around to implementing it, so I wouldn't have
> understood the implications.
> --> I would have been interested in hearing the rationale behind not doing
>       LayoutCommit for FILE_SYNC4 writes, since it seems to me that RFC-5661
>       had gotten it right when it required them.

It doesn't mention anything about FILE_SYNC4 in the context of
pNFS/files. All it says is that the server is indicating that the data
+ metadata changes have been persisted. Since the MDS/DS back end
protocol is not part of the pNFS/files protocol specification, but is
required to provide strong coupling of NFSv4.1 state, etc between the
MDS and DS, it allows the server to implement strong policies w.r.t.
data+metadata persistence. Since indeed the only server implementation
available at the time actually did persist data + metadata when it
returned FILE_SYNC4, then we needed that clarification.

Note that pNFS/flexfiles does not impose the same interpretation of
the WRITE flags, because the MDS/DS back end protocol is loosely
coupled (and is designed so that NFS can act as that back end
protocol).

Also note that the requirements of POSIX when it comes to O_SYNC
write()s, provides additional motivation here. O_SYNC requires both
data+metadata to be persisted when the write() system call returns.
You cannot get scale out behaviour if each O_SYNC write requires both
I/O to the DS, and then a LAYOUTCOMMIT to the MDS in order to persist
the file metadata. In fact, you can't even get basic NFSv4 protocol
consistency if you allow the file data to visibly change on the DS but
don't record that fact in the change attribute on the MDS. The generic
pNFS spec in RFC5661 section 12.5.4. is extremely oily about this
scenario and seems to put the onus on the clients for ensuring
data/metadata consistency. However it does not at all discuss what
should happen in the case where the client does not do this, or is
incapable of doing so due to a reboot, network partition or crash
other than to note in section 12.7.1. that some data may be lost (Yes,
but what about the metadata, Jim?).

> As I said, the FreeBSD server can handle this case, it just results in a lot of
> overhead synchronizing Size, Change, Time_Modify between MDS and DS
> whenever a RW layout is issued to a client for the file.
>
> Thanks for pointing this out, rick
>
> On Wed, 18 Sep 2019 at 12:39, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> >
> > Mkrtchyan, Tigran wrote:
> > [stuff snipped]
> > >Hi Rick,
> > >
> > >here is the public link to errata
> > >
> > >https://www.rfc-editor.org/errata/eid2751
> > >
> > >Tigran.
> > Thanks Tigran.
> >
> > Ok, so now that I've read it I have to admit I think it is rewriting the RFC
> > to conform with what the Linux client does.
> >
> > I think this para. from Sec. 13.10 of RFC-5661 is clear:
> >    The NFSv4.1 protocol only provides close-to-open file data cache
> >    semantics; meaning that when the file is closed, all modified data is
> >    written to the server.  When a subsequent OPEN of the file is done,
> >    the change attribute is inspected for a difference from a cached
> >    value for the change attribute.  For the case above, this means that
> >    a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and
> >    will update the file's size and change attribute.  Access from
> >    another client after that point will result in the appropriate size
> >    being returned.
> >
> > It states "will be done". It doesn't say anything about UNSTABLE4 vs FILE_SYNC4.
> > (I think most POSIX-like clients would consider the fsync(2) syscall to require
> >  the same treatment as "close" above, but that is a POSIX-specific client issue.)
> > I can see the argument that, since there is no "must" in the statement, that a
> > client can choose not to do this, but that would also imply that the client will
> > need to live with the consequences of it.
> >
> > I think the second sentence of the first para. of the errata is bogus:
> > For file layouts, WRITEs to a Data Server that return a stable_how4 value of
> > FILE_SYNC4 guarantee that data and file system metadata are on stable
> > storage.  This means that a LAYOUTCOMMIT is not needed in order to make the
> > data and metadata visible to the metadata server and other clients.
> >
> > Why?
> > The FILE_SYNC4 was returned by the DS. This would imply the DS
> > has committed data and metadata to stable storage on the DS.
> > However, I am not aware of anything in RFC-5661 that would imply that
> > the Size, Time_Modify and Change attributes or anything else must have
> > been updated or in stable storage on the MDS at this time.
> >
> > If a server does not require LayoutCommit operations for correct behaviour
> > then it can simply reply NFS4ERR_NOTSUPP (as I believe the Netapp filer
> > does) and the client then no longer needs to do them.
> >
> > If there is somewhere in RFC-5661 that it is stated that LayoutCommits are
> > not required when the DS replies FILE_SYNC4, then I missed it and there
> > is a problem with RFC-5661 that needs to be addressed.
> >
> > Otherwise, sorry, but it seems that the bug is in the Linux client implementation
> > and not RFC-5661.
> >
> > Is there a File Layout pNFS server implementation where the DSs return
> > FILE_SYNC4 that will break if the client does a LayoutCommit for this case?
> > (If so, then something may need to be done.)
> >
> > rick
> >
> >
> >