Re: [nfsv4] Write-behind caching
Benny Halevy <bhalevy@panasas.com> Wed, 27 October 2010 16:48 UTC
Return-Path: <bhalevy@panasas.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 248D93A685C for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 09:48:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.572
X-Spam-Level:
X-Spam-Status: No, score=-6.572 tagged_above=-999 required=5 tests=[AWL=0.027, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xryG8DL+MHeM for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 09:46:46 -0700 (PDT)
Received: from exprod5og111.obsmtp.com (exprod5og111.obsmtp.com [64.18.0.22]) by core3.amsl.com (Postfix) with SMTP id 8404C3A6A1F for <nfsv4@ietf.org>; Wed, 27 Oct 2010 09:46:35 -0700 (PDT)
Received: from source ([67.152.220.89]) by exprod5ob111.postini.com ([64.18.4.12]) with SMTP ID DSNKTMhX2ELxjgWW08SF2Ioibv6qeOxnTqu7@postini.com; Wed, 27 Oct 2010 09:48:33 PDT
Received: from fs1.bhalevy.com ([172.17.33.166]) by daytona.int.panasas.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 27 Oct 2010 12:48:23 -0400
Message-ID: <4CC857D5.5010104@panasas.com>
Date: Wed, 27 Oct 2010 18:48:21 +0200
From: Benny Halevy <bhalevy@panasas.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100921 Fedora/3.1.4-1.fc13 Thunderbird/3.1.4
MIME-Version: 1.0
To: sfaibish <sfaibish@emc.com>
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498D54@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76E0@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498E1D@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com> <4CC7B3AE.8000802@gmail.com> <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com> <1288186821.8477.28.camel@heimdal.trondhjem.org> <BF3BB6D12298F54B89C8DCC1E4073D80028C7A3E@CORPUSMX50A.corp.emc.com> <op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com>
In-Reply-To: <op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com>
Content-Type: text/plain; charset="ISO-8859-15"
Content-Transfer-Encoding: 7bit
X-OriginalArrivalTime: 27 Oct 2010 16:48:23.0339 (UTC) FILETIME=[C4BE1FB0:01CB75F6]
Cc: nfsv4@ietf.org, trond.myklebust@fys.uio.no
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Oct 2010 16:48:10 -0000
On 2010-10-27 18:35, sfaibish wrote: > On Wed, 27 Oct 2010 10:52:00 -0400, <david.noveck@emc.com> wrote: > >> Trond's item 3 seems good to me but I'm not understanding some >> assumptions that are being made. >> >> So suppose you have these partially updated blocks where you are >> writing. It to me to be that it is OK to write in various ways under >> the following conditions. >> >> 1) You have the block layout not recalled: >> >> a) It is OK to do an NFS WRITE of a partial block. >> >> b) It is dangerous to do a block write because of the >> possibility of corruption. >> >> c) It would be foolish/erroneous to do an NFS WRITE of >> the full block including the modified part and the part >> read earlier. >> >> 2) When the block layout is being recalled: >> >> a) It is still OK to do an NFS WRITE of the partial >> block. >> >> b) It is dangerous to do a block write because of the >> possibility of corruption. >> >> c) It would be foolish/erroneous to do an NFS WRITE of >> the full block including the modified part and the part >> read earlier. >> >> 3) When the block layout is not present >> >> a) It is definitely OK to do an NFS WRITE of the partial >> block. >> >> b) Block write is not OK. >> >> c) It would be foolish/erroneous to do an NFS WRITE of >> the full block including the modified part and the part >> read earlier. >> >> So to me the message is that when you have partial writes in your buffer >> cache you should keep track of them (small bit mask in these cases) and >> do the partial write. When you give up your layout, you retain the >> right do the partial NFS WRITE. >> >> So it sounds to me there are two cases. >> >> If having the delegation gives you exclusive access (as if you had a >> delegation), then (2b) causes a possible delay problem and the >> appropriate warning is to avoid gathering up too many of those and >> delaying recalls only applies to partial writes. >> >> On the other hand, if it doesn't, then (1b) and (2b) are corruption >> sources anyway and that rather than do (2b), the important thing is the >> write when actually done be a (3a) rather than a (3b). >> >> I think the fundamental point is the one others have been making, which >> is that there is a difference between, >> >> "If there are dirty blocks where specific >> circumstances, make it advisable that they be >> written before returning a layout, care should >> be taken to avoid a significant accumulation of such >> blocks which might unduly delay a pending recall >> of a layout. If possible, the client should ensure >> that the writes can safely be done after the recall >> is completed, so as avoid this sort of delay." >> >> And >> >> "If there are dirty blocks, they have to be written >> before returning a layout. Care should be taken to >> avoid an undue accumulation of dirty blocks since >> writing these before return the layout might unduly >> delay a pending recall of layout." > I can see an additional problem with this statement; depending on the > amount > of dirty pages that are written before returning the layout the server > might > get nervous (timeout pending) and either resend the layoutrecall or simply > start fencing the I/Os to DS regardless of the amount of dirty pages > accumulated at the client and that the client expects to write before > returning the layout. So, we might want to address to this problem > in the recommendation. Note that this risk exists regardless of flushing any data. Even when waiting on outstanding I/Os, the amount of I/O in flight can be large, interconnect can be slow, timeouts can occur, etc. Benny > > /Sorin > >> >> With the first being more complicated and detailed but also accurate. >> >> >> -----Original Message----- >> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf >> Of Trond Myklebust >> Sent: Wednesday, October 27, 2010 9:40 AM >> To: Jason Glasgow >> Cc: nfsv4@ietf.org >> Subject: Re: [nfsv4] Write-behind caching >> >> >> >> On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote: >>> With regards to block layout there are some specific implementation >>> issues that historically made it important to flush dirty data before >>> returning a layout. Let me describe one that arose when implementing >>> MPFS at EMC. I have not followed the UNIX implementations of the pNFS >>> block layout recently enough to know if this is still a concern. >>> >>> >>> Assume a page cache with page size 4K, a file system with block size >>> 8K and a file that is 8K long. In many UNIX like operating systems >>> NFS3/MPFS implementations, a 2K write to the range 2K-4K would obtain >>> a layout for reading and writing, read the first 8K of the file into 2 >>> pages in the kernel page cache. It would then copy the 2K of data >>> from userspace and overwrite the later half of the first page. When >>> it was time to write out the page, MPFS would write out the complete >>> 8K block. If a layout were recalled between the time the data was >>> read from disk, and when it was written, it is possible that another >>> client would modify the range of the file from 0K to 2K. Unless >>> specific care was taken when flushing the entire 8K block later, data >>> corruption would occur. >>> >>> There are two ways to avoid the problem. >>> >>> >>> 1. Keep track of the bytes ranges of a page (or file system block) >>> that are dirty and only perform the read-modify-write cycle while the >>> client holds the layout. This can get messy if a clients writes every >>> other byte on a page. >>> 2. Do not return a layout containing a dirty page until that page has >>> been written. >> >> 3. In order to ensure you can safely do read-modify-write in the block >> pNFS case, never request layout sizes that are not a multiple of the >> block size. >> >>> Perhaps this sheds some light on the original motivation. >> >> Yes, but the problem you are describing does not justify a _requirement_ >> that the client flush out its data. Only that it consider the >> implications of corner cases such as the above. >> >> Trond >> >> >>> Regarding the block layout, I am entirely sympathetic to arguments >>> that a layout recall should only wait for outstanding writes complete, >>> and should not cause the client to initiate new writes. >>> >>> >>> -Jason >>> >>> >>> >>> >>> On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand >>> <seattleplus@gmail.com> wrote: >>> I remember at one time there was a thought that all dirty data >>> would have to be written to disk when it receives a >>> layoutrecall. Once the data was written, it would send a >>> layoutreturn. I think this was the thinking before all the >>> timing issues and other such things cropped up. I assume >>> someone wrote that as general advice, somehow thinking that >>> responding to a layoutrecall was more important than actually >>> achieving good write performance. >>> >>> In this light, the analogy with delegreturn makes sense if you >>> take a very specific example, but obviously not in general. >>> >>> I would vote to just cut this text, as I think it is simply >>> outdated. >>> Dean >>> >>> >>> >>> On 10/26/2010 3:34 AM, david.noveck@emc.com wrote: >>> That makes sense. Let me take on this issue with >>> regard to the file >>> layout. Are there volunteers to address it with >>> regard to block and >>> object? It would be great if we could get together in >>> Beijing, discuss >>> this, and come to a joint conclusion to present to the >>> working group >>> (via email I mean). I'm not planning trying to do >>> this before the >>> working group meeting. In any case, I'm pretty sure >>> there won't be any >>> time during the working group meeting. >>> >>> -----Original Message----- >>> From: Spencer Shepler [mailto:sshepler@microsoft.com] >>> Sent: Monday, October 25, 2010 11:34 PM >>> To: Noveck, David; nfsv4@ietf.org >>> Subject: RE: [nfsv4] Write-behind caching >>> >>> >>> Fair enough. I haven't looked to see if the layout >>> types >>> address this specific, needed, behavior. Obviously >>> the >>> statement you reference and the individual layout >>> descriptions >>> should be tied together. Again, I don't remember but >>> there >>> may be layout specific steps needed in the case of >>> handling >>> layoutreturns. >>> >>> In any case, we can handle the eventual conclusion as >>> an errata. >>> >>> Spencer >>> >>> >>> -----Original Message----- >>> From: david.noveck@emc.com >>> [mailto:david.noveck@emc.com] >>> Sent: Monday, October 25, 2010 8:25 PM >>> To: Spencer Shepler; nfsv4@ietf.org >>> Subject: RE: [nfsv4] Write-behind caching >>> >>> I agree that the intent was to cover a variety >>> of layout types. >>> >>> I think what you are saying about the issue of >>> different throughputs >>> for >>> having and not having layouts also makes >>> sense. It may in some way >>> have >>> led to the statement in RFC5661 but those >>> statements are by no means >>> the >>> same. They have different consequences. I >>> take it that you are >>> saying >>> (correctly) something like: >>> >>> However, write-behind implementations >>> will generally need to >>> bound >>> the amount of unwritten date so that >>> given the bandwidth of the >>> output path, the data can be written in a >>> reasonable time. >>> Clients >>> which have layouts should avoid keeping >>> larger amounts to reflect >>> a >>> situation in which a layout provides a >>> write path of higher >>> bandwidth. >>> This is because a CB_LAYOUTRECALL may be >>> received. The client >>> should not delay returning the layout so >>> as to use that higher- >>> bandwidth >>> path, so it is best if it assumes, in >>> limiting the amount of data >>> to be written, that the write bandwidth >>> is only what is available >>> without the layout, and that it uses this >>> bandwidth assumption >>> even >>> if it does happen to have a layout. >>> >>> This differs from the text in RFC5661 in a few >>> respects. >>> >>> First it says that the amount of dirty >>> data should be the same >>> when >>> you have the layout and when you don't, >>> rather than simply >>> saying it >>> should be small when you have the >>> layout, possibly implying that >>> it >>> should be smaller than when you don't >>> have a layout. >>> >>> Second the text now in RFC5661 strongly >>> implies that when you >>> get >>> CB_LAYOUTRECALL, you would normally >>> start new IO's, rather than >>> simply drain the pending IO's and return >>> the layout ASAP. >>> >>> So I don't agree that what is in RFC5661 is >>> good implementation >>> advice, >>> particularly in suggesting that clients should >>> delay the LAYOUTRETURN >>> while doing a bunch of IO, including starting >>> new IO's. >>> >>> >>> -----Original Message----- >>> From: nfsv4-bounces@ietf.org >>> [mailto:nfsv4-bounces@ietf.org] On Behalf >>> Of >>> Spencer Shepler >>> Sent: Monday, October 25, 2010 10:07 PM >>> To: Noveck, David; nfsv4@ietf.org >>> Subject: Re: [nfsv4] Write-behind caching >>> >>> >>> Since this description is part of the general >>> pNFS description, the >>> intent >>> may have been to cover a variety of layout >>> types. However, I agree >>> that >>> the client is not guaranteed access to the >>> layout and is fully capable >>> of >>> writing the data via the MDS if all else fails >>> (inability to obtain >>> the >>> layout after a return); it may not be the most >>> performant path but it >>> should be functional. And maybe that is the >>> source of the statement >>> that >>> the client should take care in managing its >>> dirty pages given the lack >>> of >>> guarantee of access to the supposed, higher >>> throughput path for >>> writing >>> data. >>> >>> As implementation guidance it seems okay but >>> truly a requirement for >>> correct function. >>> >>> Spencer >>> >>> -----Original Message----- >>> From: nfsv4-bounces@ietf.org >>> [mailto:nfsv4-bounces@ietf.org] On >>> Behalf >>> Of >>> david.noveck@emc.com >>> Sent: Monday, October 25, 2010 6:58 PM >>> To: nfsv4@ietf.org >>> Subject: [nfsv4] Write-behind caching >>> >>> The following statement appears at the >>> bottom of page 292 of >>> RFC5661. >>> However, write-behind caching may >>> negatively >>> affect the latency in returning a >>> layout in response to a >>> CB_LAYOUTRECALL; this is similar to >>> file delegations and the >>> impact >>> that file data caching has on >>> DELEGRETURN. Client >>> implementations >>> SHOULD limit the amount of >>> unwritten data they have outstanding >>> at >>> any one time in order to prevent >>> excessively long responses to >>> CB_LAYOUTRECALL. >>> >>> This does not seem to make sense to >>> me. >>> >>> First of all the analogy between >>> DELEGRETURN and >>> CB_LAYOUTRECALL/LAYOUTRETURN doesn't >>> seem to me to be correct. In >>> the >>> case of DELEGRETURN, at least if the >>> file in question has been >>> closed, >>> during the pendency of the delegation, >>> you do need to write all of >>> the >>> dirty data associated with those >>> previously open files. Normally, >>> clients >>> just write all dirty data. >>> >>> LAYOUTRETURN does not have that sort >>> of requirement. If it is valid >>> to >>> hold the dirty data when you do have >>> the layout, it is just as valid >>> to >>> hold it when you don't. You could >>> very well return the layout and >>> get >>> it >>> again before some of those dirty >>> blocks are written. Having a >>> layout >>> grants you the right to do IO using a >>> particular means (different >>> based on >>> the mapping type), but if you don't >>> have the layout, you still have >>> a >>> way >>> to do the writeback, and there is no >>> particular need to write back >>> all >>> the >>> data before returning the layout. As >>> mentioned above, you may well >>> get >>> the layout again before there is any >>> need to actually do the >>> write-back. >>> You have to wait until IO's that are >>> in flight are completed before >>> you >>> return the layout. However, I don't >>> see why you would have to or >>> want >>> to >>> start new IO's using the layout if you >>> have received a >>> CB_LAYOUTRECALL.. >>> Am I missing something? Is there some >>> valid reason for this >>> statement? >>> Or should this be dealt with via the >>> errata mechanism? >>> >>> What do existing clients actually do >>> with pending writeback data >>> when >>> they >>> get a CB_LAYOUTRECALL? Do they start >>> new IO's using the layout? >>> If so, is there any other reason other >>> than the paragraph above? >>> >> _______________________________________________ >>> nfsv4 mailing list >>> nfsv4@ietf.org >>> >> https://www.ietf.org/mailman/listinfo/nfsv4 >>> >> _______________________________________________ >>> nfsv4 mailing list >>> nfsv4@ietf.org >>> https://www.ietf.org/mailman/listinfo/nfsv4 >>> >>> >>> _______________________________________________ >>> nfsv4 mailing list >>> nfsv4@ietf.org >>> https://www.ietf.org/mailman/listinfo/nfsv4 >>> _______________________________________________ >>> nfsv4 mailing list >>> nfsv4@ietf.org >>> https://www.ietf.org/mailman/listinfo/nfsv4 >>> >>> >>> >>> _______________________________________________ >>> nfsv4 mailing list >>> nfsv4@ietf.org >>> https://www.ietf.org/mailman/listinfo/nfsv4 >> >> >> >> _______________________________________________ >> nfsv4 mailing list >> nfsv4@ietf.org >> https://www.ietf.org/mailman/listinfo/nfsv4 >> >> _______________________________________________ >> nfsv4 mailing list >> nfsv4@ietf.org >> https://www.ietf.org/mailman/listinfo/nfsv4 >> >> > > >
- [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Dean Hildebrand
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Thomas Haynes
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck