Re: [nfsv4] Write-behind caching

With regards to block layout there are some specific implementation issues
that historically made it important to flush dirty data before returning a
layout.  Let me describe one that arose when implementing MPFS at EMC.  I
have not followed the UNIX implementations of the pNFS block layout recently
enough to know if this is still a concern.

Assume a page cache with page size 4K, a file system with block size 8K and
a file that is 8K long.  In many UNIX like operating systems NFS3/MPFS
implementations, a 2K write to the range 2K-4K would obtain a layout for
reading and writing, read the first 8K of the file into 2 pages in the
kernel page cache.  It would then copy the 2K of data from userspace and
overwrite the later half of the first page.  When it was time to write out
the page, MPFS would write out the complete 8K block.  If a layout were
recalled between the time the data was read from disk, and when it was
written, it is possible that another client would modify the range of the
file from 0K to 2K.  Unless specific care was taken when flushing the entire
8K block later, data corruption would occur.

There are two ways to avoid the problem.

1. Keep track of the bytes ranges of a page (or file system block) that are
dirty and only perform the read-modify-write cycle while the client holds
the layout.  This can get messy if a clients writes every other byte on a
page.
2.  Do not return a layout containing a dirty page until that page has been
written.

Perhaps this sheds some light on the original motivation.

Regarding the block layout, I am entirely sympathetic to arguments that a
layout recall should only wait for outstanding writes complete, and should
not cause the client to initiate new writes.

-Jason

On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand <seattleplus@gmail.com>wrote:

> I remember at one time there was a thought that all dirty data would have
> to be written to disk when it receives a layoutrecall.  Once the data was
> written, it would send a layoutreturn.  I think this was the thinking before
> all the timing issues and other such things cropped up.  I assume someone
> wrote that as general advice, somehow thinking that responding to a
> layoutrecall was more important than actually achieving good write
> performance.
>
> In this light, the analogy with delegreturn makes sense if you take a very
> specific example, but obviously not in general.
>
> I would vote to just cut this text, as I think it is simply outdated.
> Dean
>
>
> On 10/26/2010 3:34 AM, david.noveck@emc.com wrote:
>
>> That makes sense.  Let me take on this issue with regard to the file
>> layout.  Are there volunteers to address it with regard to block and
>> object?  It would be great if we could get together in Beijing, discuss
>> this, and come to a joint conclusion to present to the working group
>> (via email I mean).  I'm not planning trying to do this before the
>> working group meeting.  In any case, I'm pretty sure there won't be any
>> time during the working group meeting.
>>
>> -----Original Message-----
>> From: Spencer Shepler [mailto:sshepler@microsoft.com]
>> Sent: Monday, October 25, 2010 11:34 PM
>> To: Noveck, David; nfsv4@ietf.org
>> Subject: RE: [nfsv4] Write-behind caching
>>
>>
>> Fair enough.  I haven't looked to see if the layout types
>> address this specific, needed, behavior.  Obviously the
>> statement you reference and the individual layout descriptions
>> should be tied together.  Again, I don't remember but there
>> may be layout specific steps needed in the case of handling
>> layoutreturns.
>>
>> In any case, we can handle the eventual conclusion as an errata.
>>
>> Spencer
>>
>>
>>  -----Original Message-----
>>> From: david.noveck@emc.com [mailto:david.noveck@emc.com]
>>> Sent: Monday, October 25, 2010 8:25 PM
>>> To: Spencer Shepler; nfsv4@ietf.org
>>> Subject: RE: [nfsv4] Write-behind caching
>>>
>>> I agree that the intent was to cover a variety of layout types.
>>>
>>> I think what you are saying about the issue of different throughputs
>>>
>> for
>>
>>> having and not having layouts also makes sense.  It may in some way
>>>
>> have
>>
>>> led to the statement in RFC5661 but those statements are by no means
>>>
>> the
>>
>>> same.  They have different consequences.  I take it that you are
>>>
>> saying
>>
>>> (correctly) something like:
>>>
>>>      However, write-behind implementations will generally need to
>>>
>> bound
>>
>>>      the amount of unwritten date so that given the bandwidth of the
>>>      output path, the data can be written in a reasonable time.
>>>
>> Clients
>>
>>>      which have layouts should avoid keeping larger amounts to reflect
>>>
>> a
>>
>>>      situation in which a layout provides a write path of higher
>>> bandwidth.
>>>      This is because a CB_LAYOUTRECALL may be received.  The client
>>>      should not delay returning the layout so as to use that higher-
>>> bandwidth
>>>      path, so it is best if it assumes, in limiting the amount of data
>>>      to be written, that the write bandwidth is only what is available
>>>      without the layout, and that it uses this bandwidth assumption
>>>
>> even
>>
>>>      if it does happen to have a layout.
>>>
>>> This differs from the text in RFC5661 in a few respects.
>>>
>>>        First it says that the amount of dirty data should be the same
>>>
>> when
>>
>>>        you have the layout and when you don't, rather than simply
>>>
>> saying it
>>
>>>        should be small when you have the layout, possibly implying that
>>>
>> it
>>
>>>        should be smaller than when you don't have a layout.
>>>
>>>        Second the text now in RFC5661 strongly implies that when you
>>>
>> get
>>
>>>        CB_LAYOUTRECALL, you would normally start new IO's, rather than
>>>       simply drain the pending IO's and return the layout ASAP.
>>>
>>> So I don't agree that what is in RFC5661 is good implementation
>>>
>> advice,
>>
>>> particularly in suggesting that clients should delay the LAYOUTRETURN
>>> while doing a bunch of IO, including starting new IO's.
>>>
>>>
>>> -----Original Message-----
>>> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf
>>>
>> Of
>>
>>> Spencer Shepler
>>> Sent: Monday, October 25, 2010 10:07 PM
>>> To: Noveck, David; nfsv4@ietf.org
>>> Subject: Re: [nfsv4] Write-behind caching
>>>
>>>
>>> Since this description is part of the general pNFS description, the
>>>
>> intent
>>
>>> may have been to cover a variety of layout types.  However, I agree
>>>
>> that
>>
>>> the client is not guaranteed access to the layout and is fully capable
>>>
>> of
>>
>>> writing the data via the MDS if all else fails (inability to obtain
>>>
>> the
>>
>>> layout after a return); it may not be the most performant path but it
>>> should be functional.  And maybe that is the source of the statement
>>>
>> that
>>
>>> the client should take care in managing its dirty pages given the lack
>>>
>> of
>>
>>> guarantee of access to the supposed, higher throughput path for
>>>
>> writing
>>
>>> data.
>>>
>>> As implementation guidance it seems okay but truly a requirement for
>>> correct function.
>>>
>>> Spencer
>>>
>>>  -----Original Message-----
>>>> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On
>>>>
>>> Behalf
>>
>>> Of
>>>
>>>> david.noveck@emc.com
>>>> Sent: Monday, October 25, 2010 6:58 PM
>>>> To: nfsv4@ietf.org
>>>> Subject: [nfsv4] Write-behind caching
>>>>
>>>> The following statement appears at the bottom of page 292 of
>>>>
>>> RFC5661.
>>
>>>    However, write-behind caching may negatively
>>>>    affect the latency in returning a layout in response to a
>>>>    CB_LAYOUTRECALL; this is similar to file delegations and the
>>>>
>>> impact
>>
>>>    that file data caching has on DELEGRETURN.  Client
>>>>
>>> implementations
>>
>>>    SHOULD limit the amount of unwritten data they have outstanding
>>>>
>>> at
>>
>>>    any one time in order to prevent excessively long responses to
>>>>    CB_LAYOUTRECALL.
>>>>
>>>> This does not seem to make sense to me.
>>>>
>>>> First of all the analogy between DELEGRETURN and
>>>> CB_LAYOUTRECALL/LAYOUTRETURN doesn't seem to me to be correct.  In
>>>>
>>> the
>>
>>> case of DELEGRETURN, at least if the file in question has been
>>>>
>>> closed,
>>
>>> during the pendency of the delegation, you do need to write all of
>>>>
>>> the
>>
>>> dirty data associated with those previously open files.  Normally,
>>>>
>>> clients
>>>
>>>> just write all dirty data.
>>>>
>>>> LAYOUTRETURN does not have that sort of requirement.  If it is valid
>>>>
>>> to
>>>
>>>> hold the dirty data when you do have the layout, it is just as valid
>>>>
>>> to
>>>
>>>> hold it when you don't.  You could very well return the layout and
>>>>
>>> get
>>
>>> it
>>>
>>>> again before some of those dirty blocks are written.  Having a
>>>>
>>> layout
>>
>>> grants you the right to do IO using a particular means (different
>>>>
>>> based on
>>>
>>>> the mapping type), but if you don't have the layout, you still have
>>>>
>>> a
>>
>>> way
>>>
>>>> to do the writeback, and there is no particular need to write back
>>>>
>>> all
>>
>>> the
>>>
>>>> data before returning the layout.  As mentioned above, you may well
>>>>
>>> get
>>>
>>>> the layout again before there is any need to actually do the
>>>>
>>> write-back.
>>>
>>>> You have to wait until IO's that are in flight are completed before
>>>>
>>> you
>>>
>>>> return the layout.  However, I don't see why you would have to or
>>>>
>>> want
>>
>>> to
>>>
>>>> start new IO's using the layout if you have received a
>>>>
>>> CB_LAYOUTRECALL..
>>>
>>>> Am I missing something?  Is there some valid reason for this
>>>>
>>> statement?
>>>
>>>> Or should this be dealt with via the errata mechanism?
>>>>
>>>> What do existing clients actually do with pending writeback data
>>>>
>>> when
>>
>>> they
>>>
>>>> get a CB_LAYOUTRECALL?  Do they start new IO's using the layout?
>>>> If so, is there any other reason other than the paragraph above?
>>>> _______________________________________________
>>>> nfsv4 mailing list
>>>> nfsv4@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>>
>>> _______________________________________________
>>> nfsv4 mailing list
>>> nfsv4@ietf.org
>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>>>
>> _______________________________________________
>> nfsv4 mailing list
>> nfsv4@ietf.org
>> https://www.ietf.org/mailman/listinfo/nfsv4
>>
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4
>