Re: [nfsv4] Write-behind caching

sfaibish <sfaibish@emc.com> Fri, 29 October 2010 21:31 UTC

Return-Path: <sfaibish@emc.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 87BAF3A67FD for <nfsv4@core3.amsl.com>; Fri, 29 Oct 2010 14:31:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.349
X-Spam-Level:
X-Spam-Status: No, score=-6.349 tagged_above=-999 required=5 tests=[AWL=0.250, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3uuewiJb9lwh for <nfsv4@core3.amsl.com>; Fri, 29 Oct 2010 14:31:12 -0700 (PDT)
Received: from mexforward.lss.emc.com (mexforward.lss.emc.com [128.222.32.20]) by core3.amsl.com (Postfix) with ESMTP id E12BC3A67A8 for <nfsv4@ietf.org>; Fri, 29 Oct 2010 14:31:11 -0700 (PDT)
Received: from hop04-l1d11-si04.isus.emc.com (HOP04-L1D11-SI04.isus.emc.com [10.254.111.24]) by mexforward.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id o9TLWvmZ025551 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 29 Oct 2010 17:32:57 -0400
Received: from mailhub.lss.emc.com (mailhub.lss.emc.com [10.254.221.145]) by hop04-l1d11-si04.isus.emc.com (RSA Interceptor); Fri, 29 Oct 2010 17:32:49 -0400
Received: from usensfaibisl2e.eng.emc.com (USENSFAIBISL2E.eng.emc.com [10.238.120.57]) by mailhub.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id o9TLWFxn009753; Fri, 29 Oct 2010 17:32:15 -0400
Date: Fri, 29 Oct 2010 17:32:15 -0400
To: Trond Myklebust <trond.myklebust@fys.uio.no>, david.noveck@emc.com
From: sfaibish <sfaibish@emc.com>
Organization: EMC
Content-Type: text/plain; format="flowed"; delsp="yes"; charset="iso-8859-15"
MIME-Version: 1.0
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498D54@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76E0@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498E1D@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com> <4CC7B3AE.8000802@gmail.com> <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com> <1288186821.8477.28.camel@heimdal.trondhjem.org> <BF3BB6D12298F54B89C8DCC1E4073D80028C7A3E@CORPUSMX50A.corp.emc.com> <op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com> <4CC857D5.5010104@panasas.com> <op.vk8vpbldunckof@usensfaibisl2e.eng.emc.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C80AB@CORPUSMX50A.corp.emc.com> <1288373995.3701.35.camel@heimdal.trondhjem.org>
Content-Transfer-Encoding: 8bit
Message-ID: <op.vlcwr1zqunckof@usensfaibisl2e.eng.emc.com>
In-Reply-To: <1288373995.3701.35.camel@heimdal.trondhjem.org>
User-Agent: Opera Mail/9.10 (Win32)
X-EMM-MHVC: 1
Cc: bhalevy@panasas.com, nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 29 Oct 2010 21:31:14 -0000

On Fri, 29 Oct 2010 13:39:55 -0400, Trond Myklebust  
<trond.myklebust@fys.uio.no> wrote:

> On Fri, 2010-10-29 at 13:20 -0400, david.noveck@emc.com wrote:
>> There are two issues here with regard to handling of layout recall.
>>
>> One is with regard to in-flight IO.  As Benny points out, you cannot be  
>> sure that the in-flight IO can be completed in time to avoid the MDS  
>> losing patience.  That should rarely be the case though, if things are  
>> working right.  The client has to be prepared to deal with IO failures  
>> due to layout revocation.  Any IO that was in flight and failed because  
>> of layout revocation will need to be handled by being reissued to the  
>> MDS.  Is there anybody that disagrees with that?
>>
>> The second issue concerns IO not in-flight (in other words, not IO's  
>> yet but potential IO's) when the recall is received.  I just don't see  
>> that it reasonable to start IO's using layout segments being recalled  
>> (whether for dirty buffers or anything else).  Doing IO's to the MDS is  
>> fine but there is no real need for the layout recall to specially  
>> trigger them, whether clora_changed is set or not.
>
> This should be _very_ rare. Any cases where 2 clients are trying to do
> conflicting I/O on the same data is likely to be either a violation of
> the NFS cache consistency rules, or a scenario where it is in any case
> more efficient to go through the MDS (e.g. writing to adjacent records
> that share the same extent).
Well this is a different discussion: what was the reason for the recall in
the first place. This is one usecase but there could be other usecases
for the recall and we discuss here how to implement the protcol more than
how to solve a real problem. My 2c

/Sorin

>
>> There apparently is a disagreement about this latter point and there  
>> are a number of pieces of the spec which assume that, you will or  
>> should start IO of dirty blocks in your buffer cache and issue it using  
>> a layout in the process of being recalled.  I can't understand the  
>> logic behind this.   It is true that that waiting for in-flight IO only  
>> is not an absolute guarantee of success, but to me that doesn't argue  
>> that somehow it is OK to make the situation even worse by starting IO's  
>> using a layout that is being recalled.  There is no way the DS can tell  
>> that you didn't issue these a microsecond before you got the recall, so  
>> it should work (except for the timing issue as far as return) but I  
>> just don't see the spec suggesting that this (starting layout IO using  
>> a recalled layout) is a reasonable way for the client to work.
>
> It isn't...
>
>> It appears that there are some people who disagree about this so tell  
>> me why I'm wrong.
>>
>> -----Original Message-----
>> From: sfaibish [mailto:sfaibish@emc.com]
>> Sent: Wednesday, October 27, 2010 1:19 PM
>> To: Benny Halevy
>> Cc: Noveck, David; trond.myklebust@fys.uio.no; jglasgow@aya.yale.edu;  
>> nfsv4@ietf.org
>> Subject: Re: [nfsv4] Write-behind caching
>>
>> On Wed, 27 Oct 2010 12:48:21 -0400, Benny Halevy <bhalevy@panasas.com>
>> wrote:
>>
>> > On 2010-10-27 18:35, sfaibish wrote:
>> >> On Wed, 27 Oct 2010 10:52:00 -0400, <david.noveck@emc.com> wrote:
>> >>
>> >>> Trond's item 3 seems good to me but I'm not understanding some
>> >>> assumptions that are being made.
>> >>>
>> >>> So suppose you have these partially updated blocks where you are
>> >>> writing.  It to me to be that it is OK to write in various ways  
>> under
>> >>> the following conditions.
>> >>>
>> >>> 1) You have the block layout not recalled:
>> >>>
>> >>>      a) It is OK to do an NFS WRITE of a partial block.
>> >>>
>> >>>      b) It is dangerous to do a block write because of the
>> >>>      possibility of corruption.
>> >>>
>> >>>      c) It would be foolish/erroneous to do an NFS WRITE of
>> >>>      the full block including the modified part and the part
>> >>>      read earlier.
>> >>>
>> >>> 2) When the block layout is being recalled:
>> >>>
>> >>>      a) It is still OK to do an NFS WRITE of the partial
>> >>>      block.
>> >>>
>> >>>      b) It is dangerous to do a block write because of the
>> >>>      possibility of corruption.
>> >>>
>> >>>      c) It would be foolish/erroneous to do an NFS WRITE of
>> >>>      the full block including the modified part and the part
>> >>>      read earlier.
>> >>>
>> >>> 3) When the block layout is not present
>> >>>
>> >>>      a) It is definitely OK to do an NFS WRITE of the partial
>> >>>      block.
>> >>>
>> >>>      b) Block write is not OK.
>> >>>
>> >>>      c) It would be foolish/erroneous to do an NFS WRITE of
>> >>>      the full block including the modified part and the part
>> >>>      read earlier.
>> >>>
>> >>> So to me the message is that when you have partial writes in your
>> >>> buffer
>> >>> cache you should keep track of them (small bit mask in these cases)  
>> and
>> >>> do the partial write.  When you give up your layout, you retain the
>> >>> right do the partial NFS WRITE.
>> >>>
>> >>> So it sounds to me there are two cases.
>> >>>
>> >>> If having the delegation gives you exclusive access (as if you had a
>> >>> delegation), then (2b) causes a possible delay problem and the
>> >>> appropriate warning is to avoid gathering up too many of those and
>> >>> delaying recalls only applies to partial writes.
>> >>>
>> >>> On the other hand, if it doesn't, then (1b) and (2b) are corruption
>> >>> sources anyway and that rather than do (2b), the important thing is  
>> the
>> >>> write when actually done be a (3a) rather than a (3b).
>> >>>
>> >>> I think the fundamental point is the one others have been making,  
>> which
>> >>> is that there is a difference between,
>> >>>
>> >>>      "If there are dirty blocks where specific
>> >>>      circumstances, make it advisable that they be
>> >>>      written before returning a layout, care should
>> >>>      be taken to avoid a significant accumulation of such
>> >>>      blocks which might unduly delay a pending recall
>> >>>      of a layout.  If possible, the client should ensure
>> >>>      that the writes can safely be done after the recall
>> >>>      is completed, so as avoid this sort of delay."
>> >>>
>> >>> And
>> >>>
>> >>>      "If there are dirty blocks, they have to be written
>> >>>      before returning a layout.  Care should be taken to
>> >>>      avoid an undue accumulation of dirty blocks since
>> >>>      writing these before return the layout might unduly
>> >>>      delay a pending recall of layout."
>> >> I can see an additional problem with this statement; depending on the
>> >> amount
>> >> of dirty pages that are written before returning the layout the  
>> server
>> >> might
>> >> get nervous (timeout pending) and either resend the layoutrecall or
>> >> simply
>> >> start fencing the I/Os to DS regardless of the amount of dirty pages
>> >> accumulated at the client and that the client expects to write before
>> >> returning the layout. So, we might want to address to this problem
>> >> in the recommendation.
>> >
>> > Note that this risk exists regardless of flushing any data.
>> > Even when waiting on outstanding I/Os, the amount of I/O
>> > in flight can be large, interconnect can be slow, timeouts can occur,
>> > etc.
>> My point was to recommend ballanced approach and judjement on client  
>> side
>> to minimize this opportunity of scalability impact. I would leave to
>> the discretion of the client implementors to find the right ballance
>> for the writes not in fly. For the infly it is what it is we cannot do
>> much but for the case Dave was talking about for not in fly. And I
>> saw what Dave argument was correct still a recomendation for client
>> implementors might worth beig added.
>>
>> /Sorin
>>
>> >
>> > Benny
>> >
>> >>
>> >> /Sorin
>> >>
>> >>>
>> >>> With the first being more complicated and detailed but also  
>> accurate.
>> >>>
>> >>>
>> >>> -----Original Message-----
>> >>> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On  
>> Behalf
>> >>> Of Trond Myklebust
>> >>> Sent: Wednesday, October 27, 2010 9:40 AM
>> >>> To: Jason Glasgow
>> >>> Cc: nfsv4@ietf.org
>> >>> Subject: Re: [nfsv4] Write-behind caching
>> >>>
>> >>>
>> >>>
>> >>> On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote:
>> >>>> With regards to block layout there are some specific implementation
>> >>>> issues that historically made it important to flush dirty data  
>> before
>> >>>> returning a layout.  Let me describe one that arose when  
>> implementing
>> >>>> MPFS at EMC.  I have not followed the UNIX implementations of the  
>> pNFS
>> >>>> block layout recently enough to know if this is still a concern.
>> >>>>
>> >>>>
>> >>>> Assume a page cache with page size 4K, a file system with block  
>> size
>> >>>> 8K and a file that is 8K long.  In many UNIX like operating systems
>> >>>> NFS3/MPFS implementations, a 2K write to the range 2K-4K would  
>> obtain
>> >>>> a layout for reading and writing, read the first 8K of the file  
>> into 2
>> >>>> pages in the kernel page cache.  It would then copy the 2K of data
>> >>>> from userspace and overwrite the later half of the first page.   
>> When
>> >>>> it was time to write out the page, MPFS would write out the  
>> complete
>> >>>> 8K block.  If a layout were recalled between the time the data was
>> >>>> read from disk, and when it was written, it is possible that  
>> another
>> >>>> client would modify the range of the file from 0K to 2K.  Unless
>> >>>> specific care was taken when flushing the entire 8K block later,  
>> data
>> >>>> corruption would occur.
>> >>>>
>> >>>> There are two ways to avoid the problem.
>> >>>>
>> >>>>
>> >>>> 1. Keep track of the bytes ranges of a page (or file system block)
>> >>>> that are dirty and only perform the read-modify-write cycle while  
>> the
>> >>>> client holds the layout.  This can get messy if a clients writes  
>> every
>> >>>> other byte on a page.
>> >>>> 2.  Do not return a layout containing a dirty page until that page  
>> has
>> >>>> been written.
>> >>>
>> >>> 3. In order to ensure you can safely do read-modify-write in the  
>> block
>> >>> pNFS case, never request layout sizes that are not a multiple of the
>> >>> block size.
>> >>>
>> >>>> Perhaps this sheds some light on the original motivation.
>> >>>
>> >>> Yes, but the problem you are describing does not justify a
>> >>> _requirement_
>> >>> that the client flush out its data. Only that it consider the
>> >>> implications of corner cases such as the above.
>> >>>
>> >>> Trond
>> >>>
>> >>>
>> >>>> Regarding the block layout, I am entirely sympathetic to arguments
>> >>>> that a layout recall should only wait for outstanding writes  
>> complete,
>> >>>> and should not cause the client to initiate new writes.
>> >>>>
>> >>>>
>> >>>> -Jason
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand
>> >>>> <seattleplus@gmail.com> wrote:
>> >>>>         I remember at one time there was a thought that all dirty  
>> data
>> >>>>         would have to be written to disk when it receives a
>> >>>>         layoutrecall.  Once the data was written, it would send a
>> >>>>         layoutreturn.  I think this was the thinking before all the
>> >>>>         timing issues and other such things cropped up.  I assume
>> >>>>         someone wrote that as general advice, somehow thinking that
>> >>>>         responding to a layoutrecall was more important than  
>> actually
>> >>>>         achieving good write performance.
>> >>>>
>> >>>>         In this light, the analogy with delegreturn makes sense if  
>> you
>> >>>>         take a very specific example, but obviously not in general.
>> >>>>
>> >>>>         I would vote to just cut this text, as I think it is simply
>> >>>>         outdated.
>> >>>>         Dean
>> >>>>
>> >>>>
>> >>>>
>> >>>>         On 10/26/2010 3:34 AM, david.noveck@emc.com wrote:
>> >>>>                 That makes sense.  Let me take on this issue with
>> >>>>                 regard to the file
>> >>>>                 layout.  Are there volunteers to address it with
>> >>>>                 regard to block and
>> >>>>                 object?  It would be great if we could get  
>> together in
>> >>>>                 Beijing, discuss
>> >>>>                 this, and come to a joint conclusion to present to  
>> the
>> >>>>                 working group
>> >>>>                 (via email I mean).  I'm not planning trying to do
>> >>>>                 this before the
>> >>>>                 working group meeting.  In any case, I'm pretty  
>> sure
>> >>>>                 there won't be any
>> >>>>                 time during the working group meeting.
>> >>>>
>> >>>>                 -----Original Message-----
>> >>>>                 From: Spencer Shepler  
>> [mailto:sshepler@microsoft.com]
>> >>>>                 Sent: Monday, October 25, 2010 11:34 PM
>> >>>>                 To: Noveck, David; nfsv4@ietf.org
>> >>>>                 Subject: RE: [nfsv4] Write-behind caching
>> >>>>
>> >>>>
>> >>>>                 Fair enough.  I haven't looked to see if the layout
>> >>>>                 types
>> >>>>                 address this specific, needed, behavior.  Obviously
>> >>>>                 the
>> >>>>                 statement you reference and the individual layout
>> >>>>                 descriptions
>> >>>>                 should be tied together.  Again, I don't remember  
>> but
>> >>>>                 there
>> >>>>                 may be layout specific steps needed in the case of
>> >>>>                 handling
>> >>>>                 layoutreturns.
>> >>>>
>> >>>>                 In any case, we can handle the eventual conclusion  
>> as
>> >>>>                 an errata.
>> >>>>
>> >>>>                 Spencer
>> >>>>
>> >>>>
>> >>>>                         -----Original Message-----
>> >>>>                         From: david.noveck@emc.com
>> >>>>                         [mailto:david.noveck@emc.com]
>> >>>>                         Sent: Monday, October 25, 2010 8:25 PM
>> >>>>                         To: Spencer Shepler; nfsv4@ietf.org
>> >>>>                         Subject: RE: [nfsv4] Write-behind caching
>> >>>>
>> >>>>                         I agree that the intent was to cover a  
>> variety
>> >>>>                         of layout types.
>> >>>>
>> >>>>                         I think what you are saying about the  
>> issue of
>> >>>>                         different throughputs
>> >>>>                 for
>> >>>>                         having and not having layouts also makes
>> >>>>                         sense.  It may in some way
>> >>>>                 have
>> >>>>                         led to the statement in RFC5661 but those
>> >>>>                         statements are by no means
>> >>>>                 the
>> >>>>                         same.  They have different consequences.  I
>> >>>>                         take it that you are
>> >>>>                 saying
>> >>>>                         (correctly) something like:
>> >>>>
>> >>>>                              However, write-behind implementations
>> >>>>                         will generally need to
>> >>>>                 bound
>> >>>>                              the amount of unwritten date so that
>> >>>>                         given the bandwidth of the
>> >>>>                              output path, the data can be written  
>> in a
>> >>>>                         reasonable time.
>> >>>>                 Clients
>> >>>>                              which have layouts should avoid  
>> keeping
>> >>>>                         larger amounts to reflect
>> >>>>                 a
>> >>>>                              situation in which a layout provides a
>> >>>>                         write path of higher
>> >>>>                         bandwidth.
>> >>>>                              This is because a CB_LAYOUTRECALL may  
>> be
>> >>>>                         received.  The client
>> >>>>                              should not delay returning the layout  
>> so
>> >>>>                         as to use that higher-
>> >>>>                         bandwidth
>> >>>>                              path, so it is best if it assumes, in
>> >>>>                         limiting the amount of data
>> >>>>                              to be written, that the write  
>> bandwidth
>> >>>>                         is only what is available
>> >>>>                              without the layout, and that it uses  
>> this
>> >>>>                         bandwidth assumption
>> >>>>                 even
>> >>>>                              if it does happen to have a layout.
>> >>>>
>> >>>>                         This differs from the text in RFC5661 in a  
>> few
>> >>>>                         respects.
>> >>>>
>> >>>>                                First it says that the amount of  
>> dirty
>> >>>>                         data should be the same
>> >>>>                 when
>> >>>>                                you have the layout and when you  
>> don't,
>> >>>>                         rather than simply
>> >>>>                 saying it
>> >>>>                                should be small when you have the
>> >>>>                         layout, possibly implying that
>> >>>>                 it
>> >>>>                                should be smaller than when you  
>> don't
>> >>>>                         have a layout.
>> >>>>
>> >>>>                                Second the text now in RFC5661  
>> strongly
>> >>>>                         implies that when you
>> >>>>                 get
>> >>>>                                CB_LAYOUTRECALL, you would normally
>> >>>>                         start new IO's, rather than
>> >>>>                               simply drain the pending IO's and  
>> return
>> >>>>                         the layout ASAP.
>> >>>>
>> >>>>                         So I don't agree that what is in RFC5661 is
>> >>>>                         good implementation
>> >>>>                 advice,
>> >>>>                         particularly in suggesting that clients  
>> should
>> >>>>                         delay the LAYOUTRETURN
>> >>>>                         while doing a bunch of IO, including  
>> starting
>> >>>>                         new IO's.
>> >>>>
>> >>>>
>> >>>>                         -----Original Message-----
>> >>>>                         From: nfsv4-bounces@ietf.org
>> >>>>                         [mailto:nfsv4-bounces@ietf.org] On Behalf
>> >>>>                 Of
>> >>>>                         Spencer Shepler
>> >>>>                         Sent: Monday, October 25, 2010 10:07 PM
>> >>>>                         To: Noveck, David; nfsv4@ietf.org
>> >>>>                         Subject: Re: [nfsv4] Write-behind caching
>> >>>>
>> >>>>
>> >>>>                         Since this description is part of the  
>> general
>> >>>>                         pNFS description, the
>> >>>>                 intent
>> >>>>                         may have been to cover a variety of layout
>> >>>>                         types.  However, I agree
>> >>>>                 that
>> >>>>                         the client is not guaranteed access to the
>> >>>>                         layout and is fully capable
>> >>>>                 of
>> >>>>                         writing the data via the MDS if all else  
>> fails
>> >>>>                         (inability to obtain
>> >>>>                 the
>> >>>>                         layout after a return); it may not be the  
>> most
>> >>>>                         performant path but it
>> >>>>                         should be functional.  And maybe that is  
>> the
>> >>>>                         source of the statement
>> >>>>                 that
>> >>>>                         the client should take care in managing its
>> >>>>                         dirty pages given the lack
>> >>>>                 of
>> >>>>                         guarantee of access to the supposed, higher
>> >>>>                         throughput path for
>> >>>>                 writing
>> >>>>                         data.
>> >>>>
>> >>>>                         As implementation guidance it seems okay  
>> but
>> >>>>                         truly a requirement for
>> >>>>                         correct function.
>> >>>>
>> >>>>                         Spencer
>> >>>>
>> >>>>                                 -----Original Message-----
>> >>>>                                 From: nfsv4-bounces@ietf.org
>> >>>>                                 [mailto:nfsv4-bounces@ietf.org] On
>> >>>>                 Behalf
>> >>>>                         Of
>> >>>>                                 david.noveck@emc.com
>> >>>>                                 Sent: Monday, October 25, 2010  
>> 6:58 PM
>> >>>>                                 To: nfsv4@ietf.org
>> >>>>                                 Subject: [nfsv4] Write-behind  
>> caching
>> >>>>
>> >>>>                                 The following statement appears at  
>> the
>> >>>>                                 bottom of page 292 of
>> >>>>                 RFC5661.
>> >>>>                                    However, write-behind caching  
>> may
>> >>>>                                 negatively
>> >>>>                                    affect the latency in returning  
>> a
>> >>>>                                 layout in response to a
>> >>>>                                    CB_LAYOUTRECALL; this is  
>> similar to
>> >>>>                                 file delegations and the
>> >>>>                 impact
>> >>>>                                    that file data caching has on
>> >>>>                                 DELEGRETURN.  Client
>> >>>>                 implementations
>> >>>>                                    SHOULD limit the amount of
>> >>>>                                 unwritten data they have  
>> outstanding
>> >>>>                 at
>> >>>>                                    any one time in order to prevent
>> >>>>                                 excessively long responses to
>> >>>>                                    CB_LAYOUTRECALL.
>> >>>>
>> >>>>                                 This does not seem to make sense to
>> >>>>                                 me.
>> >>>>
>> >>>>                                 First of all the analogy between
>> >>>>                                 DELEGRETURN and
>> >>>>                                 CB_LAYOUTRECALL/LAYOUTRETURN  
>> doesn't
>> >>>>                                 seem to me to be correct.  In
>> >>>>                 the
>> >>>>                                 case of DELEGRETURN, at least if  
>> the
>> >>>>                                 file in question has been
>> >>>>                 closed,
>> >>>>                                 during the pendency of the  
>> delegation,
>> >>>>                                 you do need to write all of
>> >>>>                 the
>> >>>>                                 dirty data associated with those
>> >>>>                                 previously open files.  Normally,
>> >>>>                         clients
>> >>>>                                 just write all dirty data.
>> >>>>
>> >>>>                                 LAYOUTRETURN does not have that  
>> sort
>> >>>>                                 of requirement.  If it is valid
>> >>>>                         to
>> >>>>                                 hold the dirty data when you do  
>> have
>> >>>>                                 the layout, it is just as valid
>> >>>>                         to
>> >>>>                                 hold it when you don't.  You could
>> >>>>                                 very well return the layout and
>> >>>>                 get
>> >>>>                         it
>> >>>>                                 again before some of those dirty
>> >>>>                                 blocks are written.  Having a
>> >>>>                 layout
>> >>>>                                 grants you the right to do IO  
>> using a
>> >>>>                                 particular means (different
>> >>>>                         based on
>> >>>>                                 the mapping type), but if you don't
>> >>>>                                 have the layout, you still have
>> >>>>                 a
>> >>>>                         way
>> >>>>                                 to do the writeback, and there is  
>> no
>> >>>>                                 particular need to write back
>> >>>>                 all
>> >>>>                         the
>> >>>>                                 data before returning the layout.   
>> As
>> >>>>                                 mentioned above, you may well
>> >>>>                         get
>> >>>>                                 the layout again before there is  
>> any
>> >>>>                                 need to actually do the
>> >>>>                         write-back.
>> >>>>                                 You have to wait until IO's that  
>> are
>> >>>>                                 in flight are completed before
>> >>>>                         you
>> >>>>                                 return the layout.  However, I  
>> don't
>> >>>>                                 see why you would have to or
>> >>>>                 want
>> >>>>                         to
>> >>>>                                 start new IO's using the layout if  
>> you
>> >>>>                                 have received a
>> >>>>                         CB_LAYOUTRECALL..
>> >>>>                                 Am I missing something?  Is there  
>> some
>> >>>>                                 valid reason for this
>> >>>>                         statement?
>> >>>>                                 Or should this be dealt with via  
>> the
>> >>>>                                 errata mechanism?
>> >>>>
>> >>>>                                 What do existing clients actually  
>> do
>> >>>>                                 with pending writeback data
>> >>>>                 when
>> >>>>                         they
>> >>>>                                 get a CB_LAYOUTRECALL?  Do they  
>> start
>> >>>>                                 new IO's using the layout?
>> >>>>                                 If so, is there any other reason  
>> other
>> >>>>                                 than the paragraph above?
>> >>>>
>> >>> _______________________________________________
>> >>>>                                 nfsv4 mailing list
>> >>>>                                 nfsv4@ietf.org
>> >>>>
>> >>> https://www.ietf.org/mailman/listinfo/nfsv4
>> >>>>
>> >>> _______________________________________________
>> >>>>                         nfsv4 mailing list
>> >>>>                         nfsv4@ietf.org
>> >>>>                         https://www.ietf.org/mailman/listinfo/nfsv4
>> >>>>
>> >>>>
>> >>>>                 _______________________________________________
>> >>>>                 nfsv4 mailing list
>> >>>>                 nfsv4@ietf.org
>> >>>>                 https://www.ietf.org/mailman/listinfo/nfsv4
>> >>>>         _______________________________________________
>> >>>>         nfsv4 mailing list
>> >>>>         nfsv4@ietf.org
>> >>>>         https://www.ietf.org/mailman/listinfo/nfsv4
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> nfsv4 mailing list
>> >>>> nfsv4@ietf.org
>> >>>> https://www.ietf.org/mailman/listinfo/nfsv4
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> nfsv4 mailing list
>> >>> nfsv4@ietf.org
>> >>> https://www.ietf.org/mailman/listinfo/nfsv4
>> >>>
>> >>> _______________________________________________
>> >>> nfsv4 mailing list
>> >>> nfsv4@ietf.org
>> >>> https://www.ietf.org/mailman/listinfo/nfsv4
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >
>> >
>>
>>
>>
>
>
>
>
>



-- 
Best Regards

Sorin Faibish
Corporate Distinguished Engineer
Unified Storage Division
         EMC²
where information lives

Phone: 508-249-5745
Cellphone: 617-510-0422
Email : sfaibish@emc.com