Re: [nfsv4] Write-behind caching

On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote:
> With regards to block layout there are some specific implementation
> issues that historically made it important to flush dirty data before
> returning a layout.  Let me describe one that arose when implementing
> MPFS at EMC.  I have not followed the UNIX implementations of the pNFS
> block layout recently enough to know if this is still a concern.
> 
> 
> Assume a page cache with page size 4K, a file system with block size
> 8K and a file that is 8K long.  In many UNIX like operating systems
> NFS3/MPFS implementations, a 2K write to the range 2K-4K would obtain
> a layout for reading and writing, read the first 8K of the file into 2
> pages in the kernel page cache.  It would then copy the 2K of data
> from userspace and overwrite the later half of the first page.  When
> it was time to write out the page, MPFS would write out the complete
> 8K block.  If a layout were recalled between the time the data was
> read from disk, and when it was written, it is possible that another
> client would modify the range of the file from 0K to 2K.  Unless
> specific care was taken when flushing the entire 8K block later, data
> corruption would occur.
>
> There are two ways to avoid the problem.
> 
> 
> 1. Keep track of the bytes ranges of a page (or file system block)
> that are dirty and only perform the read-modify-write cycle while the
> client holds the layout.  This can get messy if a clients writes every
> other byte on a page.
> 2.  Do not return a layout containing a dirty page until that page has
> been written.

3. In order to ensure you can safely do read-modify-write in the block
pNFS case, never request layout sizes that are not a multiple of the
block size.

> Perhaps this sheds some light on the original motivation.

Yes, but the problem you are describing does not justify a _requirement_
that the client flush out its data. Only that it consider the
implications of corner cases such as the above.

Trond

> Regarding the block layout, I am entirely sympathetic to arguments
> that a layout recall should only wait for outstanding writes complete,
> and should not cause the client to initiate new writes.
> 
> 
> -Jason
> 
> 
> 
> 
> On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand
> <seattleplus@gmail.com> wrote:
>         I remember at one time there was a thought that all dirty data
>         would have to be written to disk when it receives a
>         layoutrecall.  Once the data was written, it would send a
>         layoutreturn.  I think this was the thinking before all the
>         timing issues and other such things cropped up.  I assume
>         someone wrote that as general advice, somehow thinking that
>         responding to a layoutrecall was more important than actually
>         achieving good write performance.
>         
>         In this light, the analogy with delegreturn makes sense if you
>         take a very specific example, but obviously not in general.
>         
>         I would vote to just cut this text, as I think it is simply
>         outdated.
>         Dean
>         
>         
>         
>         On 10/26/2010 3:34 AM, david.noveck@emc.com wrote:
>                 That makes sense.  Let me take on this issue with
>                 regard to the file
>                 layout.  Are there volunteers to address it with
>                 regard to block and
>                 object?  It would be great if we could get together in
>                 Beijing, discuss
>                 this, and come to a joint conclusion to present to the
>                 working group
>                 (via email I mean).  I'm not planning trying to do
>                 this before the
>                 working group meeting.  In any case, I'm pretty sure
>                 there won't be any
>                 time during the working group meeting.
>                 
>                 -----Original Message-----
>                 From: Spencer Shepler [mailto:sshepler@microsoft.com]
>                 Sent: Monday, October 25, 2010 11:34 PM
>                 To: Noveck, David; nfsv4@ietf.org
>                 Subject: RE: [nfsv4] Write-behind caching
>                 
>                 
>                 Fair enough.  I haven't looked to see if the layout
>                 types
>                 address this specific, needed, behavior.  Obviously
>                 the
>                 statement you reference and the individual layout
>                 descriptions
>                 should be tied together.  Again, I don't remember but
>                 there
>                 may be layout specific steps needed in the case of
>                 handling
>                 layoutreturns.
>                 
>                 In any case, we can handle the eventual conclusion as
>                 an errata.
>                 
>                 Spencer
>                 
>                 
>                         -----Original Message-----
>                         From: david.noveck@emc.com
>                         [mailto:david.noveck@emc.com]
>                         Sent: Monday, October 25, 2010 8:25 PM
>                         To: Spencer Shepler; nfsv4@ietf.org
>                         Subject: RE: [nfsv4] Write-behind caching
>                         
>                         I agree that the intent was to cover a variety
>                         of layout types.
>                         
>                         I think what you are saying about the issue of
>                         different throughputs
>                 for
>                         having and not having layouts also makes
>                         sense.  It may in some way
>                 have
>                         led to the statement in RFC5661 but those
>                         statements are by no means
>                 the
>                         same.  They have different consequences.  I
>                         take it that you are
>                 saying
>                         (correctly) something like:
>                         
>                              However, write-behind implementations
>                         will generally need to
>                 bound
>                              the amount of unwritten date so that
>                         given the bandwidth of the
>                              output path, the data can be written in a
>                         reasonable time.
>                 Clients
>                              which have layouts should avoid keeping
>                         larger amounts to reflect
>                 a
>                              situation in which a layout provides a
>                         write path of higher
>                         bandwidth.
>                              This is because a CB_LAYOUTRECALL may be
>                         received.  The client
>                              should not delay returning the layout so
>                         as to use that higher-
>                         bandwidth
>                              path, so it is best if it assumes, in
>                         limiting the amount of data
>                              to be written, that the write bandwidth
>                         is only what is available
>                              without the layout, and that it uses this
>                         bandwidth assumption
>                 even
>                              if it does happen to have a layout.
>                         
>                         This differs from the text in RFC5661 in a few
>                         respects.
>                         
>                                First it says that the amount of dirty
>                         data should be the same
>                 when
>                                you have the layout and when you don't,
>                         rather than simply
>                 saying it
>                                should be small when you have the
>                         layout, possibly implying that
>                 it
>                                should be smaller than when you don't
>                         have a layout.
>                         
>                                Second the text now in RFC5661 strongly
>                         implies that when you
>                 get
>                                CB_LAYOUTRECALL, you would normally
>                         start new IO's, rather than
>                               simply drain the pending IO's and return
>                         the layout ASAP.
>                         
>                         So I don't agree that what is in RFC5661 is
>                         good implementation
>                 advice,
>                         particularly in suggesting that clients should
>                         delay the LAYOUTRETURN
>                         while doing a bunch of IO, including starting
>                         new IO's.
>                         
>                         
>                         -----Original Message-----
>                         From: nfsv4-bounces@ietf.org
>                         [mailto:nfsv4-bounces@ietf.org] On Behalf
>                 Of
>                         Spencer Shepler
>                         Sent: Monday, October 25, 2010 10:07 PM
>                         To: Noveck, David; nfsv4@ietf.org
>                         Subject: Re: [nfsv4] Write-behind caching
>                         
>                         
>                         Since this description is part of the general
>                         pNFS description, the
>                 intent
>                         may have been to cover a variety of layout
>                         types.  However, I agree
>                 that
>                         the client is not guaranteed access to the
>                         layout and is fully capable
>                 of
>                         writing the data via the MDS if all else fails
>                         (inability to obtain
>                 the
>                         layout after a return); it may not be the most
>                         performant path but it
>                         should be functional.  And maybe that is the
>                         source of the statement
>                 that
>                         the client should take care in managing its
>                         dirty pages given the lack
>                 of
>                         guarantee of access to the supposed, higher
>                         throughput path for
>                 writing
>                         data.
>                         
>                         As implementation guidance it seems okay but
>                         truly a requirement for
>                         correct function.
>                         
>                         Spencer
>                         
>                                 -----Original Message-----
>                                 From: nfsv4-bounces@ietf.org
>                                 [mailto:nfsv4-bounces@ietf.org] On
>                 Behalf
>                         Of
>                                 david.noveck@emc.com
>                                 Sent: Monday, October 25, 2010 6:58 PM
>                                 To: nfsv4@ietf.org
>                                 Subject: [nfsv4] Write-behind caching
>                                 
>                                 The following statement appears at the
>                                 bottom of page 292 of
>                 RFC5661.
>                                    However, write-behind caching may
>                                 negatively
>                                    affect the latency in returning a
>                                 layout in response to a
>                                    CB_LAYOUTRECALL; this is similar to
>                                 file delegations and the
>                 impact
>                                    that file data caching has on
>                                 DELEGRETURN.  Client
>                 implementations
>                                    SHOULD limit the amount of
>                                 unwritten data they have outstanding
>                 at
>                                    any one time in order to prevent
>                                 excessively long responses to
>                                    CB_LAYOUTRECALL.
>                                 
>                                 This does not seem to make sense to
>                                 me.
>                                 
>                                 First of all the analogy between
>                                 DELEGRETURN and
>                                 CB_LAYOUTRECALL/LAYOUTRETURN doesn't
>                                 seem to me to be correct.  In
>                 the
>                                 case of DELEGRETURN, at least if the
>                                 file in question has been
>                 closed,
>                                 during the pendency of the delegation,
>                                 you do need to write all of
>                 the
>                                 dirty data associated with those
>                                 previously open files.  Normally,
>                         clients
>                                 just write all dirty data.
>                                 
>                                 LAYOUTRETURN does not have that sort
>                                 of requirement.  If it is valid
>                         to
>                                 hold the dirty data when you do have
>                                 the layout, it is just as valid
>                         to
>                                 hold it when you don't.  You could
>                                 very well return the layout and
>                 get
>                         it
>                                 again before some of those dirty
>                                 blocks are written.  Having a
>                 layout
>                                 grants you the right to do IO using a
>                                 particular means (different
>                         based on
>                                 the mapping type), but if you don't
>                                 have the layout, you still have
>                 a
>                         way
>                                 to do the writeback, and there is no
>                                 particular need to write back
>                 all
>                         the
>                                 data before returning the layout.  As
>                                 mentioned above, you may well
>                         get
>                                 the layout again before there is any
>                                 need to actually do the
>                         write-back.
>                                 You have to wait until IO's that are
>                                 in flight are completed before
>                         you
>                                 return the layout.  However, I don't
>                                 see why you would have to or
>                 want
>                         to
>                                 start new IO's using the layout if you
>                                 have received a
>                         CB_LAYOUTRECALL..
>                                 Am I missing something?  Is there some
>                                 valid reason for this
>                         statement?
>                                 Or should this be dealt with via the
>                                 errata mechanism?
>                                 
>                                 What do existing clients actually do
>                                 with pending writeback data
>                 when
>                         they
>                                 get a CB_LAYOUTRECALL?  Do they start
>                                 new IO's using the layout?
>                                 If so, is there any other reason other
>                                 than the paragraph above?
>                                 _______________________________________________
>                                 nfsv4 mailing list
>                                 nfsv4@ietf.org
>                                 https://www.ietf.org/mailman/listinfo/nfsv4
>                         _______________________________________________
>                         nfsv4 mailing list
>                         nfsv4@ietf.org
>                         https://www.ietf.org/mailman/listinfo/nfsv4
>                         
>                 
>                 _______________________________________________
>                 nfsv4 mailing list
>                 nfsv4@ietf.org
>                 https://www.ietf.org/mailman/listinfo/nfsv4
>         _______________________________________________
>         nfsv4 mailing list
>         nfsv4@ietf.org
>         https://www.ietf.org/mailman/listinfo/nfsv4
>         
> 
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4