Re: [nfsv4] Write-behind caching

Benny Halevy <bhalevy@panasas.com> Wed, 27 October 2010 16:48 UTC

Return-Path: <bhalevy@panasas.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 248D93A685C for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 09:48:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.572
X-Spam-Level:
X-Spam-Status: No, score=-6.572 tagged_above=-999 required=5 tests=[AWL=0.027, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xryG8DL+MHeM for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 09:46:46 -0700 (PDT)
Received: from exprod5og111.obsmtp.com (exprod5og111.obsmtp.com [64.18.0.22]) by core3.amsl.com (Postfix) with SMTP id 8404C3A6A1F for <nfsv4@ietf.org>; Wed, 27 Oct 2010 09:46:35 -0700 (PDT)
Received: from source ([67.152.220.89]) by exprod5ob111.postini.com ([64.18.4.12]) with SMTP ID DSNKTMhX2ELxjgWW08SF2Ioibv6qeOxnTqu7@postini.com; Wed, 27 Oct 2010 09:48:33 PDT
Received: from fs1.bhalevy.com ([172.17.33.166]) by daytona.int.panasas.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 27 Oct 2010 12:48:23 -0400
Message-ID: <4CC857D5.5010104@panasas.com>
Date: Wed, 27 Oct 2010 18:48:21 +0200
From: Benny Halevy <bhalevy@panasas.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100921 Fedora/3.1.4-1.fc13 Thunderbird/3.1.4
MIME-Version: 1.0
To: sfaibish <sfaibish@emc.com>
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498D54@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76E0@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498E1D@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com> <4CC7B3AE.8000802@gmail.com> <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com> <1288186821.8477.28.camel@heimdal.trondhjem.org> <BF3BB6D12298F54B89C8DCC1E4073D80028C7A3E@CORPUSMX50A.corp.emc.com> <op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com>
In-Reply-To: <op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com>
Content-Type: text/plain; charset="ISO-8859-15"
Content-Transfer-Encoding: 7bit
X-OriginalArrivalTime: 27 Oct 2010 16:48:23.0339 (UTC) FILETIME=[C4BE1FB0:01CB75F6]
Cc: nfsv4@ietf.org, trond.myklebust@fys.uio.no
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Oct 2010 16:48:10 -0000

On 2010-10-27 18:35, sfaibish wrote:
> On Wed, 27 Oct 2010 10:52:00 -0400, <david.noveck@emc.com> wrote:
> 
>> Trond's item 3 seems good to me but I'm not understanding some
>> assumptions that are being made.
>>
>> So suppose you have these partially updated blocks where you are
>> writing.  It to me to be that it is OK to write in various ways under
>> the following conditions.
>>
>> 1) You have the block layout not recalled:
>>
>>      a) It is OK to do an NFS WRITE of a partial block.
>>
>>      b) It is dangerous to do a block write because of the
>>      possibility of corruption.
>>
>>      c) It would be foolish/erroneous to do an NFS WRITE of
>>      the full block including the modified part and the part
>>      read earlier.
>>
>> 2) When the block layout is being recalled:
>>
>>      a) It is still OK to do an NFS WRITE of the partial
>>      block.
>>
>>      b) It is dangerous to do a block write because of the
>>      possibility of corruption.
>>
>>      c) It would be foolish/erroneous to do an NFS WRITE of
>>      the full block including the modified part and the part
>>      read earlier.
>>
>> 3) When the block layout is not present
>>
>>      a) It is definitely OK to do an NFS WRITE of the partial
>>      block.
>>
>>      b) Block write is not OK.
>>
>>      c) It would be foolish/erroneous to do an NFS WRITE of
>>      the full block including the modified part and the part
>>      read earlier.
>>
>> So to me the message is that when you have partial writes in your buffer
>> cache you should keep track of them (small bit mask in these cases) and
>> do the partial write.  When you give up your layout, you retain the
>> right do the partial NFS WRITE.
>>
>> So it sounds to me there are two cases.
>>
>> If having the delegation gives you exclusive access (as if you had a
>> delegation), then (2b) causes a possible delay problem and the
>> appropriate warning is to avoid gathering up too many of those and
>> delaying recalls only applies to partial writes.
>>
>> On the other hand, if it doesn't, then (1b) and (2b) are corruption
>> sources anyway and that rather than do (2b), the important thing is the
>> write when actually done be a (3a) rather than a (3b).
>>
>> I think the fundamental point is the one others have been making, which
>> is that there is a difference between,
>>
>>      "If there are dirty blocks where specific
>>      circumstances, make it advisable that they be
>>      written before returning a layout, care should
>>      be taken to avoid a significant accumulation of such
>>      blocks which might unduly delay a pending recall
>>      of a layout.  If possible, the client should ensure
>>      that the writes can safely be done after the recall
>>      is completed, so as avoid this sort of delay."
>>
>> And
>>
>>      "If there are dirty blocks, they have to be written
>>      before returning a layout.  Care should be taken to
>>      avoid an undue accumulation of dirty blocks since
>>      writing these before return the layout might unduly
>>      delay a pending recall of layout."
> I can see an additional problem with this statement; depending on the  
> amount
> of dirty pages that are written before returning the layout the server  
> might
> get nervous (timeout pending) and either resend the layoutrecall or simply
> start fencing the I/Os to DS regardless of the amount of dirty pages
> accumulated at the client and that the client expects to write before
> returning the layout. So, we might want to address to this problem
> in the recommendation.

Note that this risk exists regardless of flushing any data.
Even when waiting on outstanding I/Os, the amount of I/O
in flight can be large, interconnect can be slow, timeouts can occur, etc.

Benny

> 
> /Sorin
> 
>>
>> With the first being more complicated and detailed but also accurate.
>>
>>
>> -----Original Message-----
>> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf
>> Of Trond Myklebust
>> Sent: Wednesday, October 27, 2010 9:40 AM
>> To: Jason Glasgow
>> Cc: nfsv4@ietf.org
>> Subject: Re: [nfsv4] Write-behind caching
>>
>>
>>
>> On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote:
>>> With regards to block layout there are some specific implementation
>>> issues that historically made it important to flush dirty data before
>>> returning a layout.  Let me describe one that arose when implementing
>>> MPFS at EMC.  I have not followed the UNIX implementations of the pNFS
>>> block layout recently enough to know if this is still a concern.
>>>
>>>
>>> Assume a page cache with page size 4K, a file system with block size
>>> 8K and a file that is 8K long.  In many UNIX like operating systems
>>> NFS3/MPFS implementations, a 2K write to the range 2K-4K would obtain
>>> a layout for reading and writing, read the first 8K of the file into 2
>>> pages in the kernel page cache.  It would then copy the 2K of data
>>> from userspace and overwrite the later half of the first page.  When
>>> it was time to write out the page, MPFS would write out the complete
>>> 8K block.  If a layout were recalled between the time the data was
>>> read from disk, and when it was written, it is possible that another
>>> client would modify the range of the file from 0K to 2K.  Unless
>>> specific care was taken when flushing the entire 8K block later, data
>>> corruption would occur.
>>>
>>> There are two ways to avoid the problem.
>>>
>>>
>>> 1. Keep track of the bytes ranges of a page (or file system block)
>>> that are dirty and only perform the read-modify-write cycle while the
>>> client holds the layout.  This can get messy if a clients writes every
>>> other byte on a page.
>>> 2.  Do not return a layout containing a dirty page until that page has
>>> been written.
>>
>> 3. In order to ensure you can safely do read-modify-write in the block
>> pNFS case, never request layout sizes that are not a multiple of the
>> block size.
>>
>>> Perhaps this sheds some light on the original motivation.
>>
>> Yes, but the problem you are describing does not justify a _requirement_
>> that the client flush out its data. Only that it consider the
>> implications of corner cases such as the above.
>>
>> Trond
>>
>>
>>> Regarding the block layout, I am entirely sympathetic to arguments
>>> that a layout recall should only wait for outstanding writes complete,
>>> and should not cause the client to initiate new writes.
>>>
>>>
>>> -Jason
>>>
>>>
>>>
>>>
>>> On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand
>>> <seattleplus@gmail.com> wrote:
>>>         I remember at one time there was a thought that all dirty data
>>>         would have to be written to disk when it receives a
>>>         layoutrecall.  Once the data was written, it would send a
>>>         layoutreturn.  I think this was the thinking before all the
>>>         timing issues and other such things cropped up.  I assume
>>>         someone wrote that as general advice, somehow thinking that
>>>         responding to a layoutrecall was more important than actually
>>>         achieving good write performance.
>>>
>>>         In this light, the analogy with delegreturn makes sense if you
>>>         take a very specific example, but obviously not in general.
>>>
>>>         I would vote to just cut this text, as I think it is simply
>>>         outdated.
>>>         Dean
>>>
>>>
>>>
>>>         On 10/26/2010 3:34 AM, david.noveck@emc.com wrote:
>>>                 That makes sense.  Let me take on this issue with
>>>                 regard to the file
>>>                 layout.  Are there volunteers to address it with
>>>                 regard to block and
>>>                 object?  It would be great if we could get together in
>>>                 Beijing, discuss
>>>                 this, and come to a joint conclusion to present to the
>>>                 working group
>>>                 (via email I mean).  I'm not planning trying to do
>>>                 this before the
>>>                 working group meeting.  In any case, I'm pretty sure
>>>                 there won't be any
>>>                 time during the working group meeting.
>>>
>>>                 -----Original Message-----
>>>                 From: Spencer Shepler [mailto:sshepler@microsoft.com]
>>>                 Sent: Monday, October 25, 2010 11:34 PM
>>>                 To: Noveck, David; nfsv4@ietf.org
>>>                 Subject: RE: [nfsv4] Write-behind caching
>>>
>>>
>>>                 Fair enough.  I haven't looked to see if the layout
>>>                 types
>>>                 address this specific, needed, behavior.  Obviously
>>>                 the
>>>                 statement you reference and the individual layout
>>>                 descriptions
>>>                 should be tied together.  Again, I don't remember but
>>>                 there
>>>                 may be layout specific steps needed in the case of
>>>                 handling
>>>                 layoutreturns.
>>>
>>>                 In any case, we can handle the eventual conclusion as
>>>                 an errata.
>>>
>>>                 Spencer
>>>
>>>
>>>                         -----Original Message-----
>>>                         From: david.noveck@emc.com
>>>                         [mailto:david.noveck@emc.com]
>>>                         Sent: Monday, October 25, 2010 8:25 PM
>>>                         To: Spencer Shepler; nfsv4@ietf.org
>>>                         Subject: RE: [nfsv4] Write-behind caching
>>>
>>>                         I agree that the intent was to cover a variety
>>>                         of layout types.
>>>
>>>                         I think what you are saying about the issue of
>>>                         different throughputs
>>>                 for
>>>                         having and not having layouts also makes
>>>                         sense.  It may in some way
>>>                 have
>>>                         led to the statement in RFC5661 but those
>>>                         statements are by no means
>>>                 the
>>>                         same.  They have different consequences.  I
>>>                         take it that you are
>>>                 saying
>>>                         (correctly) something like:
>>>
>>>                              However, write-behind implementations
>>>                         will generally need to
>>>                 bound
>>>                              the amount of unwritten date so that
>>>                         given the bandwidth of the
>>>                              output path, the data can be written in a
>>>                         reasonable time.
>>>                 Clients
>>>                              which have layouts should avoid keeping
>>>                         larger amounts to reflect
>>>                 a
>>>                              situation in which a layout provides a
>>>                         write path of higher
>>>                         bandwidth.
>>>                              This is because a CB_LAYOUTRECALL may be
>>>                         received.  The client
>>>                              should not delay returning the layout so
>>>                         as to use that higher-
>>>                         bandwidth
>>>                              path, so it is best if it assumes, in
>>>                         limiting the amount of data
>>>                              to be written, that the write bandwidth
>>>                         is only what is available
>>>                              without the layout, and that it uses this
>>>                         bandwidth assumption
>>>                 even
>>>                              if it does happen to have a layout.
>>>
>>>                         This differs from the text in RFC5661 in a few
>>>                         respects.
>>>
>>>                                First it says that the amount of dirty
>>>                         data should be the same
>>>                 when
>>>                                you have the layout and when you don't,
>>>                         rather than simply
>>>                 saying it
>>>                                should be small when you have the
>>>                         layout, possibly implying that
>>>                 it
>>>                                should be smaller than when you don't
>>>                         have a layout.
>>>
>>>                                Second the text now in RFC5661 strongly
>>>                         implies that when you
>>>                 get
>>>                                CB_LAYOUTRECALL, you would normally
>>>                         start new IO's, rather than
>>>                               simply drain the pending IO's and return
>>>                         the layout ASAP.
>>>
>>>                         So I don't agree that what is in RFC5661 is
>>>                         good implementation
>>>                 advice,
>>>                         particularly in suggesting that clients should
>>>                         delay the LAYOUTRETURN
>>>                         while doing a bunch of IO, including starting
>>>                         new IO's.
>>>
>>>
>>>                         -----Original Message-----
>>>                         From: nfsv4-bounces@ietf.org
>>>                         [mailto:nfsv4-bounces@ietf.org] On Behalf
>>>                 Of
>>>                         Spencer Shepler
>>>                         Sent: Monday, October 25, 2010 10:07 PM
>>>                         To: Noveck, David; nfsv4@ietf.org
>>>                         Subject: Re: [nfsv4] Write-behind caching
>>>
>>>
>>>                         Since this description is part of the general
>>>                         pNFS description, the
>>>                 intent
>>>                         may have been to cover a variety of layout
>>>                         types.  However, I agree
>>>                 that
>>>                         the client is not guaranteed access to the
>>>                         layout and is fully capable
>>>                 of
>>>                         writing the data via the MDS if all else fails
>>>                         (inability to obtain
>>>                 the
>>>                         layout after a return); it may not be the most
>>>                         performant path but it
>>>                         should be functional.  And maybe that is the
>>>                         source of the statement
>>>                 that
>>>                         the client should take care in managing its
>>>                         dirty pages given the lack
>>>                 of
>>>                         guarantee of access to the supposed, higher
>>>                         throughput path for
>>>                 writing
>>>                         data.
>>>
>>>                         As implementation guidance it seems okay but
>>>                         truly a requirement for
>>>                         correct function.
>>>
>>>                         Spencer
>>>
>>>                                 -----Original Message-----
>>>                                 From: nfsv4-bounces@ietf.org
>>>                                 [mailto:nfsv4-bounces@ietf.org] On
>>>                 Behalf
>>>                         Of
>>>                                 david.noveck@emc.com
>>>                                 Sent: Monday, October 25, 2010 6:58 PM
>>>                                 To: nfsv4@ietf.org
>>>                                 Subject: [nfsv4] Write-behind caching
>>>
>>>                                 The following statement appears at the
>>>                                 bottom of page 292 of
>>>                 RFC5661.
>>>                                    However, write-behind caching may
>>>                                 negatively
>>>                                    affect the latency in returning a
>>>                                 layout in response to a
>>>                                    CB_LAYOUTRECALL; this is similar to
>>>                                 file delegations and the
>>>                 impact
>>>                                    that file data caching has on
>>>                                 DELEGRETURN.  Client
>>>                 implementations
>>>                                    SHOULD limit the amount of
>>>                                 unwritten data they have outstanding
>>>                 at
>>>                                    any one time in order to prevent
>>>                                 excessively long responses to
>>>                                    CB_LAYOUTRECALL.
>>>
>>>                                 This does not seem to make sense to
>>>                                 me.
>>>
>>>                                 First of all the analogy between
>>>                                 DELEGRETURN and
>>>                                 CB_LAYOUTRECALL/LAYOUTRETURN doesn't
>>>                                 seem to me to be correct.  In
>>>                 the
>>>                                 case of DELEGRETURN, at least if the
>>>                                 file in question has been
>>>                 closed,
>>>                                 during the pendency of the delegation,
>>>                                 you do need to write all of
>>>                 the
>>>                                 dirty data associated with those
>>>                                 previously open files.  Normally,
>>>                         clients
>>>                                 just write all dirty data.
>>>
>>>                                 LAYOUTRETURN does not have that sort
>>>                                 of requirement.  If it is valid
>>>                         to
>>>                                 hold the dirty data when you do have
>>>                                 the layout, it is just as valid
>>>                         to
>>>                                 hold it when you don't.  You could
>>>                                 very well return the layout and
>>>                 get
>>>                         it
>>>                                 again before some of those dirty
>>>                                 blocks are written.  Having a
>>>                 layout
>>>                                 grants you the right to do IO using a
>>>                                 particular means (different
>>>                         based on
>>>                                 the mapping type), but if you don't
>>>                                 have the layout, you still have
>>>                 a
>>>                         way
>>>                                 to do the writeback, and there is no
>>>                                 particular need to write back
>>>                 all
>>>                         the
>>>                                 data before returning the layout.  As
>>>                                 mentioned above, you may well
>>>                         get
>>>                                 the layout again before there is any
>>>                                 need to actually do the
>>>                         write-back.
>>>                                 You have to wait until IO's that are
>>>                                 in flight are completed before
>>>                         you
>>>                                 return the layout.  However, I don't
>>>                                 see why you would have to or
>>>                 want
>>>                         to
>>>                                 start new IO's using the layout if you
>>>                                 have received a
>>>                         CB_LAYOUTRECALL..
>>>                                 Am I missing something?  Is there some
>>>                                 valid reason for this
>>>                         statement?
>>>                                 Or should this be dealt with via the
>>>                                 errata mechanism?
>>>
>>>                                 What do existing clients actually do
>>>                                 with pending writeback data
>>>                 when
>>>                         they
>>>                                 get a CB_LAYOUTRECALL?  Do they start
>>>                                 new IO's using the layout?
>>>                                 If so, is there any other reason other
>>>                                 than the paragraph above?
>>>
>> _______________________________________________
>>>                                 nfsv4 mailing list
>>>                                 nfsv4@ietf.org
>>>
>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>> _______________________________________________
>>>                         nfsv4 mailing list
>>>                         nfsv4@ietf.org
>>>                         https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>>>
>>>                 _______________________________________________
>>>                 nfsv4 mailing list
>>>                 nfsv4@ietf.org
>>>                 https://www.ietf.org/mailman/listinfo/nfsv4
>>>         _______________________________________________
>>>         nfsv4 mailing list
>>>         nfsv4@ietf.org
>>>         https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>>>
>>>
>>> _______________________________________________
>>> nfsv4 mailing list
>>> nfsv4@ietf.org
>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>
>>
>>
>> _______________________________________________
>> nfsv4 mailing list
>> nfsv4@ietf.org
>> https://www.ietf.org/mailman/listinfo/nfsv4
>>
>> _______________________________________________
>> nfsv4 mailing list
>> nfsv4@ietf.org
>> https://www.ietf.org/mailman/listinfo/nfsv4
>>
>>
> 
> 
>