Re: [nfsv4] Write-behind caching

On 2010-10-27 18:35, sfaibish wrote:
> On Wed, 27 Oct 2010 10:52:00 -0400, <david.noveck@emc.com> wrote:
> 
>> Trond's item 3 seems good to me but I'm not understanding some
>> assumptions that are being made.
>>
>> So suppose you have these partially updated blocks where you are
>> writing.  It to me to be that it is OK to write in various ways under
>> the following conditions.
>>
>> 1) You have the block layout not recalled:
>>
>>      a) It is OK to do an NFS WRITE of a partial block.
>>
>>      b) It is dangerous to do a block write because of the
>>      possibility of corruption.
>>
>>      c) It would be foolish/erroneous to do an NFS WRITE of
>>      the full block including the modified part and the part
>>      read earlier.
>>
>> 2) When the block layout is being recalled:
>>
>>      a) It is still OK to do an NFS WRITE of the partial
>>      block.
>>
>>      b) It is dangerous to do a block write because of the
>>      possibility of corruption.
>>
>>      c) It would be foolish/erroneous to do an NFS WRITE of
>>      the full block including the modified part and the part
>>      read earlier.
>>
>> 3) When the block layout is not present
>>
>>      a) It is definitely OK to do an NFS WRITE of the partial
>>      block.
>>
>>      b) Block write is not OK.
>>
>>      c) It would be foolish/erroneous to do an NFS WRITE of
>>      the full block including the modified part and the part
>>      read earlier.
>>
>> So to me the message is that when you have partial writes in your buffer
>> cache you should keep track of them (small bit mask in these cases) and
>> do the partial write.  When you give up your layout, you retain the
>> right do the partial NFS WRITE.
>>
>> So it sounds to me there are two cases.
>>
>> If having the delegation gives you exclusive access (as if you had a
>> delegation), then (2b) causes a possible delay problem and the
>> appropriate warning is to avoid gathering up too many of those and
>> delaying recalls only applies to partial writes.
>>
>> On the other hand, if it doesn't, then (1b) and (2b) are corruption
>> sources anyway and that rather than do (2b), the important thing is the
>> write when actually done be a (3a) rather than a (3b).
>>
>> I think the fundamental point is the one others have been making, which
>> is that there is a difference between,
>>
>>      "If there are dirty blocks where specific
>>      circumstances, make it advisable that they be
>>      written before returning a layout, care should
>>      be taken to avoid a significant accumulation of such
>>      blocks which might unduly delay a pending recall
>>      of a layout.  If possible, the client should ensure
>>      that the writes can safely be done after the recall
>>      is completed, so as avoid this sort of delay."
>>
>> And
>>
>>      "If there are dirty blocks, they have to be written
>>      before returning a layout.  Care should be taken to
>>      avoid an undue accumulation of dirty blocks since
>>      writing these before return the layout might unduly
>>      delay a pending recall of layout."
> I can see an additional problem with this statement; depending on the  
> amount
> of dirty pages that are written before returning the layout the server  
> might
> get nervous (timeout pending) and either resend the layoutrecall or simply
> start fencing the I/Os to DS regardless of the amount of dirty pages
> accumulated at the client and that the client expects to write before
> returning the layout. So, we might want to address to this problem
> in the recommendation.

Note that this risk exists regardless of flushing any data.
Even when waiting on outstanding I/Os, the amount of I/O
in flight can be large, interconnect can be slow, timeouts can occur, etc.

Benny

> 
> /Sorin
> 
>>
>> With the first being more complicated and detailed but also accurate.
>>
>>
>> -----Original Message-----
>> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf
>> Of Trond Myklebust
>> Sent: Wednesday, October 27, 2010 9:40 AM
>> To: Jason Glasgow
>> Cc: nfsv4@ietf.org
>> Subject: Re: [nfsv4] Write-behind caching
>>
>>
>>
>> On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote:
>>> With regards to block layout there are some specific implementation
>>> issues that historically made it important to flush dirty data before
>>> returning a layout.  Let me describe one that arose when implementing
>>> MPFS at EMC.  I have not followed the UNIX implementations of the pNFS
>>> block layout recently enough to know if this is still a concern.
>>>
>>>
>>> Assume a page cache with page size 4K, a file system with block size
>>> 8K and a file that is 8K long.  In many UNIX like operating systems
>>> NFS3/MPFS implementations, a 2K write to the range 2K-4K would obtain
>>> a layout for reading and writing, read the first 8K of the file into 2
>>> pages in the kernel page cache.  It would then copy the 2K of data
>>> from userspace and overwrite the later half of the first page.  When
>>> it was time to write out the page, MPFS would write out the complete
>>> 8K block.  If a layout were recalled between the time the data was
>>> read from disk, and when it was written, it is possible that another
>>> client would modify the range of the file from 0K to 2K.  Unless
>>> specific care was taken when flushing the entire 8K block later, data
>>> corruption would occur.
>>>
>>> There are two ways to avoid the problem.
>>>
>>>
>>> 1. Keep track of the bytes ranges of a page (or file system block)
>>> that are dirty and only perform the read-modify-write cycle while the
>>> client holds the layout.  This can get messy if a clients writes every
>>> other byte on a page.
>>> 2.  Do not return a layout containing a dirty page until that page has
>>> been written.
>>
>> 3. In order to ensure you can safely do read-modify-write in the block
>> pNFS case, never request layout sizes that are not a multiple of the
>> block size.
>>
>>> Perhaps this sheds some light on the original motivation.
>>
>> Yes, but the problem you are describing does not justify a _requirement_
>> that the client flush out its data. Only that it consider the
>> implications of corner cases such as the above.
>>
>> Trond
>>
>>
>>> Regarding the block layout, I am entirely sympathetic to arguments
>>> that a layout recall should only wait for outstanding writes complete,
>>> and should not cause the client to initiate new writes.
>>>
>>>
>>> -Jason
>>>
>>>
>>>
>>>
>>> On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand
>>> <seattleplus@gmail.com> wrote:
>>>         I remember at one time there was a thought that all dirty data
>>>         would have to be written to disk when it receives a
>>>         layoutrecall.  Once the data was written, it would send a
>>>         layoutreturn.  I think this was the thinking before all the
>>>         timing issues and other such things cropped up.  I assume
>>>         someone wrote that as general advice, somehow thinking that
>>>         responding to a layoutrecall was more important than actually
>>>         achieving good write performance.
>>>
>>>         In this light, the analogy with delegreturn makes sense if you
>>>         take a very specific example, but obviously not in general.
>>>
>>>         I would vote to just cut this text, as I think it is simply
>>>         outdated.
>>>         Dean
>>>
>>>
>>>
>>>         On 10/26/2010 3:34 AM, david.noveck@emc.com wrote:
>>>                 That makes sense.  Let me take on this issue with
>>>                 regard to the file
>>>                 layout.  Are there volunteers to address it with
>>>                 regard to block and
>>>                 object?  It would be great if we could get together in
>>>                 Beijing, discuss
>>>                 this, and come to a joint conclusion to present to the
>>>                 working group
>>>                 (via email I mean).  I'm not planning trying to do
>>>                 this before the
>>>                 working group meeting.  In any case, I'm pretty sure
>>>                 there won't be any
>>>                 time during the working group meeting.
>>>
>>>                 -----Original Message-----
>>>                 From: Spencer Shepler [mailto:sshepler@microsoft.com]
>>>                 Sent: Monday, October 25, 2010 11:34 PM
>>>                 To: Noveck, David; nfsv4@ietf.org
>>>                 Subject: RE: [nfsv4] Write-behind caching
>>>
>>>
>>>                 Fair enough.  I haven't looked to see if the layout
>>>                 types
>>>                 address this specific, needed, behavior.  Obviously
>>>                 the
>>>                 statement you reference and the individual layout
>>>                 descriptions
>>>                 should be tied together.  Again, I don't remember but
>>>                 there
>>>                 may be layout specific steps needed in the case of
>>>                 handling
>>>                 layoutreturns.
>>>
>>>                 In any case, we can handle the eventual conclusion as
>>>                 an errata.
>>>
>>>                 Spencer
>>>
>>>
>>>                         -----Original Message-----
>>>                         From: david.noveck@emc.com
>>>                         [mailto:david.noveck@emc.com]
>>>                         Sent: Monday, October 25, 2010 8:25 PM
>>>                         To: Spencer Shepler; nfsv4@ietf.org
>>>                         Subject: RE: [nfsv4] Write-behind caching
>>>
>>>                         I agree that the intent was to cover a variety
>>>                         of layout types.
>>>
>>>                         I think what you are saying about the issue of
>>>                         different throughputs
>>>                 for
>>>                         having and not having layouts also makes
>>>                         sense.  It may in some way
>>>                 have
>>>                         led to the statement in RFC5661 but those
>>>                         statements are by no means
>>>                 the
>>>                         same.  They have different consequences.  I
>>>                         take it that you are
>>>                 saying
>>>                         (correctly) something like:
>>>
>>>                              However, write-behind implementations
>>>                         will generally need to
>>>                 bound
>>>                              the amount of unwritten date so that
>>>                         given the bandwidth of the
>>>                              output path, the data can be written in a
>>>                         reasonable time.
>>>                 Clients
>>>                              which have layouts should avoid keeping
>>>                         larger amounts to reflect
>>>                 a
>>>                              situation in which a layout provides a
>>>                         write path of higher
>>>                         bandwidth.
>>>                              This is because a CB_LAYOUTRECALL may be
>>>                         received.  The client
>>>                              should not delay returning the layout so
>>>                         as to use that higher-
>>>                         bandwidth
>>>                              path, so it is best if it assumes, in
>>>                         limiting the amount of data
>>>                              to be written, that the write bandwidth
>>>                         is only what is available
>>>                              without the layout, and that it uses this
>>>                         bandwidth assumption
>>>                 even
>>>                              if it does happen to have a layout.
>>>
>>>                         This differs from the text in RFC5661 in a few
>>>                         respects.
>>>
>>>                                First it says that the amount of dirty
>>>                         data should be the same
>>>                 when
>>>                                you have the layout and when you don't,
>>>                         rather than simply
>>>                 saying it
>>>                                should be small when you have the
>>>                         layout, possibly implying that
>>>                 it
>>>                                should be smaller than when you don't
>>>                         have a layout.
>>>
>>>                                Second the text now in RFC5661 strongly
>>>                         implies that when you
>>>                 get
>>>                                CB_LAYOUTRECALL, you would normally
>>>                         start new IO's, rather than
>>>                               simply drain the pending IO's and return
>>>                         the layout ASAP.
>>>
>>>                         So I don't agree that what is in RFC5661 is
>>>                         good implementation
>>>                 advice,
>>>                         particularly in suggesting that clients should
>>>                         delay the LAYOUTRETURN
>>>                         while doing a bunch of IO, including starting
>>>                         new IO's.
>>>
>>>
>>>                         -----Original Message-----
>>>                         From: nfsv4-bounces@ietf.org
>>>                         [mailto:nfsv4-bounces@ietf.org] On Behalf
>>>                 Of
>>>                         Spencer Shepler
>>>                         Sent: Monday, October 25, 2010 10:07 PM
>>>                         To: Noveck, David; nfsv4@ietf.org
>>>                         Subject: Re: [nfsv4] Write-behind caching
>>>
>>>
>>>                         Since this description is part of the general
>>>                         pNFS description, the
>>>                 intent
>>>                         may have been to cover a variety of layout
>>>                         types.  However, I agree
>>>                 that
>>>                         the client is not guaranteed access to the
>>>                         layout and is fully capable
>>>                 of
>>>                         writing the data via the MDS if all else fails
>>>                         (inability to obtain
>>>                 the
>>>                         layout after a return); it may not be the most
>>>                         performant path but it
>>>                         should be functional.  And maybe that is the
>>>                         source of the statement
>>>                 that
>>>                         the client should take care in managing its
>>>                         dirty pages given the lack
>>>                 of
>>>                         guarantee of access to the supposed, higher
>>>                         throughput path for
>>>                 writing
>>>                         data.
>>>
>>>                         As implementation guidance it seems okay but
>>>                         truly a requirement for
>>>                         correct function.
>>>
>>>                         Spencer
>>>
>>>                                 -----Original Message-----
>>>                                 From: nfsv4-bounces@ietf.org
>>>                                 [mailto:nfsv4-bounces@ietf.org] On
>>>                 Behalf
>>>                         Of
>>>                                 david.noveck@emc.com
>>>                                 Sent: Monday, October 25, 2010 6:58 PM
>>>                                 To: nfsv4@ietf.org
>>>                                 Subject: [nfsv4] Write-behind caching
>>>
>>>                                 The following statement appears at the
>>>                                 bottom of page 292 of
>>>                 RFC5661.
>>>                                    However, write-behind caching may
>>>                                 negatively
>>>                                    affect the latency in returning a
>>>                                 layout in response to a
>>>                                    CB_LAYOUTRECALL; this is similar to
>>>                                 file delegations and the
>>>                 impact
>>>                                    that file data caching has on
>>>                                 DELEGRETURN.  Client
>>>                 implementations
>>>                                    SHOULD limit the amount of
>>>                                 unwritten data they have outstanding
>>>                 at
>>>                                    any one time in order to prevent
>>>                                 excessively long responses to
>>>                                    CB_LAYOUTRECALL.
>>>
>>>                                 This does not seem to make sense to
>>>                                 me.
>>>
>>>                                 First of all the analogy between
>>>                                 DELEGRETURN and
>>>                                 CB_LAYOUTRECALL/LAYOUTRETURN doesn't
>>>                                 seem to me to be correct.  In
>>>                 the
>>>                                 case of DELEGRETURN, at least if the
>>>                                 file in question has been
>>>                 closed,
>>>                                 during the pendency of the delegation,
>>>                                 you do need to write all of
>>>                 the
>>>                                 dirty data associated with those
>>>                                 previously open files.  Normally,
>>>                         clients
>>>                                 just write all dirty data.
>>>
>>>                                 LAYOUTRETURN does not have that sort
>>>                                 of requirement.  If it is valid
>>>                         to
>>>                                 hold the dirty data when you do have
>>>                                 the layout, it is just as valid
>>>                         to
>>>                                 hold it when you don't.  You could
>>>                                 very well return the layout and
>>>                 get
>>>                         it
>>>                                 again before some of those dirty
>>>                                 blocks are written.  Having a
>>>                 layout
>>>                                 grants you the right to do IO using a
>>>                                 particular means (different
>>>                         based on
>>>                                 the mapping type), but if you don't
>>>                                 have the layout, you still have
>>>                 a
>>>                         way
>>>                                 to do the writeback, and there is no
>>>                                 particular need to write back
>>>                 all
>>>                         the
>>>                                 data before returning the layout.  As
>>>                                 mentioned above, you may well
>>>                         get
>>>                                 the layout again before there is any
>>>                                 need to actually do the
>>>                         write-back.
>>>                                 You have to wait until IO's that are
>>>                                 in flight are completed before
>>>                         you
>>>                                 return the layout.  However, I don't
>>>                                 see why you would have to or
>>>                 want
>>>                         to
>>>                                 start new IO's using the layout if you
>>>                                 have received a
>>>                         CB_LAYOUTRECALL..
>>>                                 Am I missing something?  Is there some
>>>                                 valid reason for this
>>>                         statement?
>>>                                 Or should this be dealt with via the
>>>                                 errata mechanism?
>>>
>>>                                 What do existing clients actually do
>>>                                 with pending writeback data
>>>                 when
>>>                         they
>>>                                 get a CB_LAYOUTRECALL?  Do they start
>>>                                 new IO's using the layout?
>>>                                 If so, is there any other reason other
>>>                                 than the paragraph above?
>>>
>> _______________________________________________
>>>                                 nfsv4 mailing list
>>>                                 nfsv4@ietf.org
>>>
>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>> _______________________________________________
>>>                         nfsv4 mailing list
>>>                         nfsv4@ietf.org
>>>                         https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>>>
>>>                 _______________________________________________
>>>                 nfsv4 mailing list
>>>                 nfsv4@ietf.org
>>>                 https://www.ietf.org/mailman/listinfo/nfsv4
>>>         _______________________________________________
>>>         nfsv4 mailing list
>>>         nfsv4@ietf.org
>>>         https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>>>
>>>
>>> _______________________________________________
>>> nfsv4 mailing list
>>> nfsv4@ietf.org
>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>
>>
>>
>> _______________________________________________
>> nfsv4 mailing list
>> nfsv4@ietf.org
>> https://www.ietf.org/mailman/listinfo/nfsv4
>>
>> _______________________________________________
>> nfsv4 mailing list
>> nfsv4@ietf.org
>> https://www.ietf.org/mailman/listinfo/nfsv4
>>
>>
> 
> 
>