Re: [nfsv4] Write-behind caching

I agree with Trond.  This does not justify a requirement. -Jason

On Wed, Oct 27, 2010 at 9:40 AM, Trond Myklebust <trond.myklebust@fys.uio.no
> wrote:

>
>
> On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote:
> > With regards to block layout there are some specific implementation
> > issues that historically made it important to flush dirty data before
> > returning a layout.  Let me describe one that arose when implementing
> > MPFS at EMC.  I have not followed the UNIX implementations of the pNFS
> > block layout recently enough to know if this is still a concern.
> >
> >
> > Assume a page cache with page size 4K, a file system with block size
> > 8K and a file that is 8K long.  In many UNIX like operating systems
> > NFS3/MPFS implementations, a 2K write to the range 2K-4K would obtain
> > a layout for reading and writing, read the first 8K of the file into 2
> > pages in the kernel page cache.  It would then copy the 2K of data
> > from userspace and overwrite the later half of the first page.  When
> > it was time to write out the page, MPFS would write out the complete
> > 8K block.  If a layout were recalled between the time the data was
> > read from disk, and when it was written, it is possible that another
> > client would modify the range of the file from 0K to 2K.  Unless
> > specific care was taken when flushing the entire 8K block later, data
> > corruption would occur.
> >
> > There are two ways to avoid the problem.
> >
> >
> > 1. Keep track of the bytes ranges of a page (or file system block)
> > that are dirty and only perform the read-modify-write cycle while the
> > client holds the layout.  This can get messy if a clients writes every
> > other byte on a page.
> > 2.  Do not return a layout containing a dirty page until that page has
> > been written.
>
> 3. In order to ensure you can safely do read-modify-write in the block
> pNFS case, never request layout sizes that are not a multiple of the
> block size.
>
> > Perhaps this sheds some light on the original motivation.
>
> Yes, but the problem you are describing does not justify a _requirement_
> that the client flush out its data. Only that it consider the
> implications of corner cases such as the above.
>
> Trond
>
>
> > Regarding the block layout, I am entirely sympathetic to arguments
> > that a layout recall should only wait for outstanding writes complete,
> > and should not cause the client to initiate new writes.
> >
> >
> > -Jason
> >
> >
> >
> >
> > On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand
> > <seattleplus@gmail.com> wrote:
> >         I remember at one time there was a thought that all dirty data
> >         would have to be written to disk when it receives a
> >         layoutrecall.  Once the data was written, it would send a
> >         layoutreturn.  I think this was the thinking before all the
> >         timing issues and other such things cropped up.  I assume
> >         someone wrote that as general advice, somehow thinking that
> >         responding to a layoutrecall was more important than actually
> >         achieving good write performance.
> >
> >         In this light, the analogy with delegreturn makes sense if you
> >         take a very specific example, but obviously not in general.
> >
> >         I would vote to just cut this text, as I think it is simply
> >         outdated.
> >         Dean
> >
> >
> >
> >         On 10/26/2010 3:34 AM, david.noveck@emc.com wrote:
> >                 That makes sense.  Let me take on this issue with
> >                 regard to the file
> >                 layout.  Are there volunteers to address it with
> >                 regard to block and
> >                 object?  It would be great if we could get together in
> >                 Beijing, discuss
> >                 this, and come to a joint conclusion to present to the
> >                 working group
> >                 (via email I mean).  I'm not planning trying to do
> >                 this before the
> >                 working group meeting.  In any case, I'm pretty sure
> >                 there won't be any
> >                 time during the working group meeting.
> >
> >                 -----Original Message-----
> >                 From: Spencer Shepler [mailto:sshepler@microsoft.com]
> >                 Sent: Monday, October 25, 2010 11:34 PM
> >                 To: Noveck, David; nfsv4@ietf.org
> >                 Subject: RE: [nfsv4] Write-behind caching
> >
> >
> >                 Fair enough.  I haven't looked to see if the layout
> >                 types
> >                 address this specific, needed, behavior.  Obviously
> >                 the
> >                 statement you reference and the individual layout
> >                 descriptions
> >                 should be tied together.  Again, I don't remember but
> >                 there
> >                 may be layout specific steps needed in the case of
> >                 handling
> >                 layoutreturns.
> >
> >                 In any case, we can handle the eventual conclusion as
> >                 an errata.
> >
> >                 Spencer
> >
> >
> >                         -----Original Message-----
> >                         From: david.noveck@emc.com
> >                         [mailto:david.noveck@emc.com]
> >                         Sent: Monday, October 25, 2010 8:25 PM
> >                         To: Spencer Shepler; nfsv4@ietf.org
> >                         Subject: RE: [nfsv4] Write-behind caching
> >
> >                         I agree that the intent was to cover a variety
> >                         of layout types.
> >
> >                         I think what you are saying about the issue of
> >                         different throughputs
> >                 for
> >                         having and not having layouts also makes
> >                         sense.  It may in some way
> >                 have
> >                         led to the statement in RFC5661 but those
> >                         statements are by no means
> >                 the
> >                         same.  They have different consequences.  I
> >                         take it that you are
> >                 saying
> >                         (correctly) something like:
> >
> >                              However, write-behind implementations
> >                         will generally need to
> >                 bound
> >                              the amount of unwritten date so that
> >                         given the bandwidth of the
> >                              output path, the data can be written in a
> >                         reasonable time.
> >                 Clients
> >                              which have layouts should avoid keeping
> >                         larger amounts to reflect
> >                 a
> >                              situation in which a layout provides a
> >                         write path of higher
> >                         bandwidth.
> >                              This is because a CB_LAYOUTRECALL may be
> >                         received.  The client
> >                              should not delay returning the layout so
> >                         as to use that higher-
> >                         bandwidth
> >                              path, so it is best if it assumes, in
> >                         limiting the amount of data
> >                              to be written, that the write bandwidth
> >                         is only what is available
> >                              without the layout, and that it uses this
> >                         bandwidth assumption
> >                 even
> >                              if it does happen to have a layout.
> >
> >                         This differs from the text in RFC5661 in a few
> >                         respects.
> >
> >                                First it says that the amount of dirty
> >                         data should be the same
> >                 when
> >                                you have the layout and when you don't,
> >                         rather than simply
> >                 saying it
> >                                should be small when you have the
> >                         layout, possibly implying that
> >                 it
> >                                should be smaller than when you don't
> >                         have a layout.
> >
> >                                Second the text now in RFC5661 strongly
> >                         implies that when you
> >                 get
> >                                CB_LAYOUTRECALL, you would normally
> >                         start new IO's, rather than
> >                               simply drain the pending IO's and return
> >                         the layout ASAP.
> >
> >                         So I don't agree that what is in RFC5661 is
> >                         good implementation
> >                 advice,
> >                         particularly in suggesting that clients should
> >                         delay the LAYOUTRETURN
> >                         while doing a bunch of IO, including starting
> >                         new IO's.
> >
> >
> >                         -----Original Message-----
> >                         From: nfsv4-bounces@ietf.org
> >                         [mailto:nfsv4-bounces@ietf.org] On Behalf
> >                 Of
> >                         Spencer Shepler
> >                         Sent: Monday, October 25, 2010 10:07 PM
> >                         To: Noveck, David; nfsv4@ietf.org
> >                         Subject: Re: [nfsv4] Write-behind caching
> >
> >
> >                         Since this description is part of the general
> >                         pNFS description, the
> >                 intent
> >                         may have been to cover a variety of layout
> >                         types.  However, I agree
> >                 that
> >                         the client is not guaranteed access to the
> >                         layout and is fully capable
> >                 of
> >                         writing the data via the MDS if all else fails
> >                         (inability to obtain
> >                 the
> >                         layout after a return); it may not be the most
> >                         performant path but it
> >                         should be functional.  And maybe that is the
> >                         source of the statement
> >                 that
> >                         the client should take care in managing its
> >                         dirty pages given the lack
> >                 of
> >                         guarantee of access to the supposed, higher
> >                         throughput path for
> >                 writing
> >                         data.
> >
> >                         As implementation guidance it seems okay but
> >                         truly a requirement for
> >                         correct function.
> >
> >                         Spencer
> >
> >                                 -----Original Message-----
> >                                 From: nfsv4-bounces@ietf.org
> >                                 [mailto:nfsv4-bounces@ietf.org] On
> >                 Behalf
> >                         Of
> >                                 david.noveck@emc.com
> >                                 Sent: Monday, October 25, 2010 6:58 PM
> >                                 To: nfsv4@ietf.org
> >                                 Subject: [nfsv4] Write-behind caching
> >
> >                                 The following statement appears at the
> >                                 bottom of page 292 of
> >                 RFC5661.
> >                                    However, write-behind caching may
> >                                 negatively
> >                                    affect the latency in returning a
> >                                 layout in response to a
> >                                    CB_LAYOUTRECALL; this is similar to
> >                                 file delegations and the
> >                 impact
> >                                    that file data caching has on
> >                                 DELEGRETURN.  Client
> >                 implementations
> >                                    SHOULD limit the amount of
> >                                 unwritten data they have outstanding
> >                 at
> >                                    any one time in order to prevent
> >                                 excessively long responses to
> >                                    CB_LAYOUTRECALL.
> >
> >                                 This does not seem to make sense to
> >                                 me.
> >
> >                                 First of all the analogy between
> >                                 DELEGRETURN and
> >                                 CB_LAYOUTRECALL/LAYOUTRETURN doesn't
> >                                 seem to me to be correct.  In
> >                 the
> >                                 case of DELEGRETURN, at least if the
> >                                 file in question has been
> >                 closed,
> >                                 during the pendency of the delegation,
> >                                 you do need to write all of
> >                 the
> >                                 dirty data associated with those
> >                                 previously open files.  Normally,
> >                         clients
> >                                 just write all dirty data.
> >
> >                                 LAYOUTRETURN does not have that sort
> >                                 of requirement.  If it is valid
> >                         to
> >                                 hold the dirty data when you do have
> >                                 the layout, it is just as valid
> >                         to
> >                                 hold it when you don't.  You could
> >                                 very well return the layout and
> >                 get
> >                         it
> >                                 again before some of those dirty
> >                                 blocks are written.  Having a
> >                 layout
> >                                 grants you the right to do IO using a
> >                                 particular means (different
> >                         based on
> >                                 the mapping type), but if you don't
> >                                 have the layout, you still have
> >                 a
> >                         way
> >                                 to do the writeback, and there is no
> >                                 particular need to write back
> >                 all
> >                         the
> >                                 data before returning the layout.  As
> >                                 mentioned above, you may well
> >                         get
> >                                 the layout again before there is any
> >                                 need to actually do the
> >                         write-back.
> >                                 You have to wait until IO's that are
> >                                 in flight are completed before
> >                         you
> >                                 return the layout.  However, I don't
> >                                 see why you would have to or
> >                 want
> >                         to
> >                                 start new IO's using the layout if you
> >                                 have received a
> >                         CB_LAYOUTRECALL..
> >                                 Am I missing something?  Is there some
> >                                 valid reason for this
> >                         statement?
> >                                 Or should this be dealt with via the
> >                                 errata mechanism?
> >
> >                                 What do existing clients actually do
> >                                 with pending writeback data
> >                 when
> >                         they
> >                                 get a CB_LAYOUTRECALL?  Do they start
> >                                 new IO's using the layout?
> >                                 If so, is there any other reason other
> >                                 than the paragraph above?
> >
> _______________________________________________
> >                                 nfsv4 mailing list
> >                                 nfsv4@ietf.org
> >
> https://www.ietf.org/mailman/listinfo/nfsv4
> >                         _______________________________________________
> >                         nfsv4 mailing list
> >                         nfsv4@ietf.org
> >                         https://www.ietf.org/mailman/listinfo/nfsv4
> >
> >
> >                 _______________________________________________
> >                 nfsv4 mailing list
> >                 nfsv4@ietf.org
> >                 https://www.ietf.org/mailman/listinfo/nfsv4
> >         _______________________________________________
> >         nfsv4 mailing list
> >         nfsv4@ietf.org
> >         https://www.ietf.org/mailman/listinfo/nfsv4
> >
> >
> >
> > _______________________________________________
> > nfsv4 mailing list
> > nfsv4@ietf.org
> > https://www.ietf.org/mailman/listinfo/nfsv4
>
>
>
>