Re: [nfsv4] Write-behind caching

Trond Myklebust <trond.myklebust@fys.uio.no> Wed, 27 October 2010 13:38 UTC

Return-Path: <trond.myklebust@fys.uio.no>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 07B883A6A15 for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 06:38:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.453
X-Spam-Level:
X-Spam-Status: No, score=-6.453 tagged_above=-999 required=5 tests=[AWL=0.146, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id IVcu1U1cEeXu for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 06:38:38 -0700 (PDT)
Received: from mail-out2.uio.no (mail-out2.uio.no [129.240.10.58]) by core3.amsl.com (Postfix) with ESMTP id 2BC673A6A22 for <nfsv4@ietf.org>; Wed, 27 Oct 2010 06:38:37 -0700 (PDT)
Received: from mail-mx2.uio.no ([129.240.10.30]) by mail-out2.uio.no with esmtp (Exim 4.69) (envelope-from <trond.myklebust@fys.uio.no>) id 1PB6Er-00039v-HA; Wed, 27 Oct 2010 15:40:25 +0200
Received: from c-68-40-206-115.hsd1.mi.comcast.net ([68.40.206.115] helo=[192.168.1.113]) by mail-mx2.uio.no with esmtpsa (SSLv3:CAMELLIA256-SHA:256) user trondmy (Exim 4.69) (envelope-from <trond.myklebust@fys.uio.no>) id 1PB6Eq-0004BF-5b; Wed, 27 Oct 2010 15:40:25 +0200
From: Trond Myklebust <trond.myklebust@fys.uio.no>
To: Jason Glasgow <jglasgow@aya.yale.edu>
In-Reply-To: <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com>
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498D54@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76E0@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498E1D@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com> <4CC7B3AE.8000802@gmail.com> <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Date: Wed, 27 Oct 2010 09:40:21 -0400
Message-ID: <1288186821.8477.28.camel@heimdal.trondhjem.org>
Mime-Version: 1.0
X-Mailer: Evolution 2.30.3 (2.30.3-1.fc13)
Content-Transfer-Encoding: 7bit
X-UiO-Ratelimit-Test: rcpts/h 3 msgs/h 1 sum rcpts/h 3 sum msgs/h 1 total rcpts 1078 max rcpts/h 20 ratelimit 0
X-UiO-Spam-info: not spam, SpamAssassin (score=-5.0, required=5.0, autolearn=disabled, UIO_MAIL_IS_INTERNAL=-5, uiobl=NO, uiouri=NO)
X-UiO-Scanned: 70894A8B54E40CA15D9ECE03F41E7E3E70E2BDBB
X-UiO-SPAM-Test: remote_host: 68.40.206.115 spam_score: -49 maxlevel 80 minaction 2 bait 0 mail/h: 1 total 433 max/h 7 blacklist 0 greylist 0 ratelimit 0
Cc: nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Oct 2010 13:38:44 -0000

On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote:
> With regards to block layout there are some specific implementation
> issues that historically made it important to flush dirty data before
> returning a layout.  Let me describe one that arose when implementing
> MPFS at EMC.  I have not followed the UNIX implementations of the pNFS
> block layout recently enough to know if this is still a concern.
> 
> 
> Assume a page cache with page size 4K, a file system with block size
> 8K and a file that is 8K long.  In many UNIX like operating systems
> NFS3/MPFS implementations, a 2K write to the range 2K-4K would obtain
> a layout for reading and writing, read the first 8K of the file into 2
> pages in the kernel page cache.  It would then copy the 2K of data
> from userspace and overwrite the later half of the first page.  When
> it was time to write out the page, MPFS would write out the complete
> 8K block.  If a layout were recalled between the time the data was
> read from disk, and when it was written, it is possible that another
> client would modify the range of the file from 0K to 2K.  Unless
> specific care was taken when flushing the entire 8K block later, data
> corruption would occur.
>
> There are two ways to avoid the problem.
> 
> 
> 1. Keep track of the bytes ranges of a page (or file system block)
> that are dirty and only perform the read-modify-write cycle while the
> client holds the layout.  This can get messy if a clients writes every
> other byte on a page.
> 2.  Do not return a layout containing a dirty page until that page has
> been written.

3. In order to ensure you can safely do read-modify-write in the block
pNFS case, never request layout sizes that are not a multiple of the
block size.

> Perhaps this sheds some light on the original motivation.

Yes, but the problem you are describing does not justify a _requirement_
that the client flush out its data. Only that it consider the
implications of corner cases such as the above.

Trond


> Regarding the block layout, I am entirely sympathetic to arguments
> that a layout recall should only wait for outstanding writes complete,
> and should not cause the client to initiate new writes.
> 
> 
> -Jason
> 
> 
> 
> 
> On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand
> <seattleplus@gmail.com> wrote:
>         I remember at one time there was a thought that all dirty data
>         would have to be written to disk when it receives a
>         layoutrecall.  Once the data was written, it would send a
>         layoutreturn.  I think this was the thinking before all the
>         timing issues and other such things cropped up.  I assume
>         someone wrote that as general advice, somehow thinking that
>         responding to a layoutrecall was more important than actually
>         achieving good write performance.
>         
>         In this light, the analogy with delegreturn makes sense if you
>         take a very specific example, but obviously not in general.
>         
>         I would vote to just cut this text, as I think it is simply
>         outdated.
>         Dean
>         
>         
>         
>         On 10/26/2010 3:34 AM, david.noveck@emc.com wrote:
>                 That makes sense.  Let me take on this issue with
>                 regard to the file
>                 layout.  Are there volunteers to address it with
>                 regard to block and
>                 object?  It would be great if we could get together in
>                 Beijing, discuss
>                 this, and come to a joint conclusion to present to the
>                 working group
>                 (via email I mean).  I'm not planning trying to do
>                 this before the
>                 working group meeting.  In any case, I'm pretty sure
>                 there won't be any
>                 time during the working group meeting.
>                 
>                 -----Original Message-----
>                 From: Spencer Shepler [mailto:sshepler@microsoft.com]
>                 Sent: Monday, October 25, 2010 11:34 PM
>                 To: Noveck, David; nfsv4@ietf.org
>                 Subject: RE: [nfsv4] Write-behind caching
>                 
>                 
>                 Fair enough.  I haven't looked to see if the layout
>                 types
>                 address this specific, needed, behavior.  Obviously
>                 the
>                 statement you reference and the individual layout
>                 descriptions
>                 should be tied together.  Again, I don't remember but
>                 there
>                 may be layout specific steps needed in the case of
>                 handling
>                 layoutreturns.
>                 
>                 In any case, we can handle the eventual conclusion as
>                 an errata.
>                 
>                 Spencer
>                 
>                 
>                         -----Original Message-----
>                         From: david.noveck@emc.com
>                         [mailto:david.noveck@emc.com]
>                         Sent: Monday, October 25, 2010 8:25 PM
>                         To: Spencer Shepler; nfsv4@ietf.org
>                         Subject: RE: [nfsv4] Write-behind caching
>                         
>                         I agree that the intent was to cover a variety
>                         of layout types.
>                         
>                         I think what you are saying about the issue of
>                         different throughputs
>                 for
>                         having and not having layouts also makes
>                         sense.  It may in some way
>                 have
>                         led to the statement in RFC5661 but those
>                         statements are by no means
>                 the
>                         same.  They have different consequences.  I
>                         take it that you are
>                 saying
>                         (correctly) something like:
>                         
>                              However, write-behind implementations
>                         will generally need to
>                 bound
>                              the amount of unwritten date so that
>                         given the bandwidth of the
>                              output path, the data can be written in a
>                         reasonable time.
>                 Clients
>                              which have layouts should avoid keeping
>                         larger amounts to reflect
>                 a
>                              situation in which a layout provides a
>                         write path of higher
>                         bandwidth.
>                              This is because a CB_LAYOUTRECALL may be
>                         received.  The client
>                              should not delay returning the layout so
>                         as to use that higher-
>                         bandwidth
>                              path, so it is best if it assumes, in
>                         limiting the amount of data
>                              to be written, that the write bandwidth
>                         is only what is available
>                              without the layout, and that it uses this
>                         bandwidth assumption
>                 even
>                              if it does happen to have a layout.
>                         
>                         This differs from the text in RFC5661 in a few
>                         respects.
>                         
>                                First it says that the amount of dirty
>                         data should be the same
>                 when
>                                you have the layout and when you don't,
>                         rather than simply
>                 saying it
>                                should be small when you have the
>                         layout, possibly implying that
>                 it
>                                should be smaller than when you don't
>                         have a layout.
>                         
>                                Second the text now in RFC5661 strongly
>                         implies that when you
>                 get
>                                CB_LAYOUTRECALL, you would normally
>                         start new IO's, rather than
>                               simply drain the pending IO's and return
>                         the layout ASAP.
>                         
>                         So I don't agree that what is in RFC5661 is
>                         good implementation
>                 advice,
>                         particularly in suggesting that clients should
>                         delay the LAYOUTRETURN
>                         while doing a bunch of IO, including starting
>                         new IO's.
>                         
>                         
>                         -----Original Message-----
>                         From: nfsv4-bounces@ietf.org
>                         [mailto:nfsv4-bounces@ietf.org] On Behalf
>                 Of
>                         Spencer Shepler
>                         Sent: Monday, October 25, 2010 10:07 PM
>                         To: Noveck, David; nfsv4@ietf.org
>                         Subject: Re: [nfsv4] Write-behind caching
>                         
>                         
>                         Since this description is part of the general
>                         pNFS description, the
>                 intent
>                         may have been to cover a variety of layout
>                         types.  However, I agree
>                 that
>                         the client is not guaranteed access to the
>                         layout and is fully capable
>                 of
>                         writing the data via the MDS if all else fails
>                         (inability to obtain
>                 the
>                         layout after a return); it may not be the most
>                         performant path but it
>                         should be functional.  And maybe that is the
>                         source of the statement
>                 that
>                         the client should take care in managing its
>                         dirty pages given the lack
>                 of
>                         guarantee of access to the supposed, higher
>                         throughput path for
>                 writing
>                         data.
>                         
>                         As implementation guidance it seems okay but
>                         truly a requirement for
>                         correct function.
>                         
>                         Spencer
>                         
>                                 -----Original Message-----
>                                 From: nfsv4-bounces@ietf.org
>                                 [mailto:nfsv4-bounces@ietf.org] On
>                 Behalf
>                         Of
>                                 david.noveck@emc.com
>                                 Sent: Monday, October 25, 2010 6:58 PM
>                                 To: nfsv4@ietf.org
>                                 Subject: [nfsv4] Write-behind caching
>                                 
>                                 The following statement appears at the
>                                 bottom of page 292 of
>                 RFC5661.
>                                    However, write-behind caching may
>                                 negatively
>                                    affect the latency in returning a
>                                 layout in response to a
>                                    CB_LAYOUTRECALL; this is similar to
>                                 file delegations and the
>                 impact
>                                    that file data caching has on
>                                 DELEGRETURN.  Client
>                 implementations
>                                    SHOULD limit the amount of
>                                 unwritten data they have outstanding
>                 at
>                                    any one time in order to prevent
>                                 excessively long responses to
>                                    CB_LAYOUTRECALL.
>                                 
>                                 This does not seem to make sense to
>                                 me.
>                                 
>                                 First of all the analogy between
>                                 DELEGRETURN and
>                                 CB_LAYOUTRECALL/LAYOUTRETURN doesn't
>                                 seem to me to be correct.  In
>                 the
>                                 case of DELEGRETURN, at least if the
>                                 file in question has been
>                 closed,
>                                 during the pendency of the delegation,
>                                 you do need to write all of
>                 the
>                                 dirty data associated with those
>                                 previously open files.  Normally,
>                         clients
>                                 just write all dirty data.
>                                 
>                                 LAYOUTRETURN does not have that sort
>                                 of requirement.  If it is valid
>                         to
>                                 hold the dirty data when you do have
>                                 the layout, it is just as valid
>                         to
>                                 hold it when you don't.  You could
>                                 very well return the layout and
>                 get
>                         it
>                                 again before some of those dirty
>                                 blocks are written.  Having a
>                 layout
>                                 grants you the right to do IO using a
>                                 particular means (different
>                         based on
>                                 the mapping type), but if you don't
>                                 have the layout, you still have
>                 a
>                         way
>                                 to do the writeback, and there is no
>                                 particular need to write back
>                 all
>                         the
>                                 data before returning the layout.  As
>                                 mentioned above, you may well
>                         get
>                                 the layout again before there is any
>                                 need to actually do the
>                         write-back.
>                                 You have to wait until IO's that are
>                                 in flight are completed before
>                         you
>                                 return the layout.  However, I don't
>                                 see why you would have to or
>                 want
>                         to
>                                 start new IO's using the layout if you
>                                 have received a
>                         CB_LAYOUTRECALL..
>                                 Am I missing something?  Is there some
>                                 valid reason for this
>                         statement?
>                                 Or should this be dealt with via the
>                                 errata mechanism?
>                                 
>                                 What do existing clients actually do
>                                 with pending writeback data
>                 when
>                         they
>                                 get a CB_LAYOUTRECALL?  Do they start
>                                 new IO's using the layout?
>                                 If so, is there any other reason other
>                                 than the paragraph above?
>                                 _______________________________________________
>                                 nfsv4 mailing list
>                                 nfsv4@ietf.org
>                                 https://www.ietf.org/mailman/listinfo/nfsv4
>                         _______________________________________________
>                         nfsv4 mailing list
>                         nfsv4@ietf.org
>                         https://www.ietf.org/mailman/listinfo/nfsv4
>                         
>                 
>                 _______________________________________________
>                 nfsv4 mailing list
>                 nfsv4@ietf.org
>                 https://www.ietf.org/mailman/listinfo/nfsv4
>         _______________________________________________
>         nfsv4 mailing list
>         nfsv4@ietf.org
>         https://www.ietf.org/mailman/listinfo/nfsv4
>         
> 
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4