Re: [nfsv4] Write-behind caching
Trond Myklebust <trond.myklebust@fys.uio.no> Wed, 27 October 2010 13:38 UTC
Return-Path: <trond.myklebust@fys.uio.no>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 07B883A6A15 for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 06:38:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.453
X-Spam-Level:
X-Spam-Status: No, score=-6.453 tagged_above=-999 required=5 tests=[AWL=0.146, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id IVcu1U1cEeXu for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 06:38:38 -0700 (PDT)
Received: from mail-out2.uio.no (mail-out2.uio.no [129.240.10.58]) by core3.amsl.com (Postfix) with ESMTP id 2BC673A6A22 for <nfsv4@ietf.org>; Wed, 27 Oct 2010 06:38:37 -0700 (PDT)
Received: from mail-mx2.uio.no ([129.240.10.30]) by mail-out2.uio.no with esmtp (Exim 4.69) (envelope-from <trond.myklebust@fys.uio.no>) id 1PB6Er-00039v-HA; Wed, 27 Oct 2010 15:40:25 +0200
Received: from c-68-40-206-115.hsd1.mi.comcast.net ([68.40.206.115] helo=[192.168.1.113]) by mail-mx2.uio.no with esmtpsa (SSLv3:CAMELLIA256-SHA:256) user trondmy (Exim 4.69) (envelope-from <trond.myklebust@fys.uio.no>) id 1PB6Eq-0004BF-5b; Wed, 27 Oct 2010 15:40:25 +0200
From: Trond Myklebust <trond.myklebust@fys.uio.no>
To: Jason Glasgow <jglasgow@aya.yale.edu>
In-Reply-To: <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com>
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498D54@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76E0@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498E1D@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com> <4CC7B3AE.8000802@gmail.com> <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Date: Wed, 27 Oct 2010 09:40:21 -0400
Message-ID: <1288186821.8477.28.camel@heimdal.trondhjem.org>
Mime-Version: 1.0
X-Mailer: Evolution 2.30.3 (2.30.3-1.fc13)
Content-Transfer-Encoding: 7bit
X-UiO-Ratelimit-Test: rcpts/h 3 msgs/h 1 sum rcpts/h 3 sum msgs/h 1 total rcpts 1078 max rcpts/h 20 ratelimit 0
X-UiO-Spam-info: not spam, SpamAssassin (score=-5.0, required=5.0, autolearn=disabled, UIO_MAIL_IS_INTERNAL=-5, uiobl=NO, uiouri=NO)
X-UiO-Scanned: 70894A8B54E40CA15D9ECE03F41E7E3E70E2BDBB
X-UiO-SPAM-Test: remote_host: 68.40.206.115 spam_score: -49 maxlevel 80 minaction 2 bait 0 mail/h: 1 total 433 max/h 7 blacklist 0 greylist 0 ratelimit 0
Cc: nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Oct 2010 13:38:44 -0000
On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote: > With regards to block layout there are some specific implementation > issues that historically made it important to flush dirty data before > returning a layout. Let me describe one that arose when implementing > MPFS at EMC. I have not followed the UNIX implementations of the pNFS > block layout recently enough to know if this is still a concern. > > > Assume a page cache with page size 4K, a file system with block size > 8K and a file that is 8K long. In many UNIX like operating systems > NFS3/MPFS implementations, a 2K write to the range 2K-4K would obtain > a layout for reading and writing, read the first 8K of the file into 2 > pages in the kernel page cache. It would then copy the 2K of data > from userspace and overwrite the later half of the first page. When > it was time to write out the page, MPFS would write out the complete > 8K block. If a layout were recalled between the time the data was > read from disk, and when it was written, it is possible that another > client would modify the range of the file from 0K to 2K. Unless > specific care was taken when flushing the entire 8K block later, data > corruption would occur. > > There are two ways to avoid the problem. > > > 1. Keep track of the bytes ranges of a page (or file system block) > that are dirty and only perform the read-modify-write cycle while the > client holds the layout. This can get messy if a clients writes every > other byte on a page. > 2. Do not return a layout containing a dirty page until that page has > been written. 3. In order to ensure you can safely do read-modify-write in the block pNFS case, never request layout sizes that are not a multiple of the block size. > Perhaps this sheds some light on the original motivation. Yes, but the problem you are describing does not justify a _requirement_ that the client flush out its data. Only that it consider the implications of corner cases such as the above. Trond > Regarding the block layout, I am entirely sympathetic to arguments > that a layout recall should only wait for outstanding writes complete, > and should not cause the client to initiate new writes. > > > -Jason > > > > > On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand > <seattleplus@gmail.com> wrote: > I remember at one time there was a thought that all dirty data > would have to be written to disk when it receives a > layoutrecall. Once the data was written, it would send a > layoutreturn. I think this was the thinking before all the > timing issues and other such things cropped up. I assume > someone wrote that as general advice, somehow thinking that > responding to a layoutrecall was more important than actually > achieving good write performance. > > In this light, the analogy with delegreturn makes sense if you > take a very specific example, but obviously not in general. > > I would vote to just cut this text, as I think it is simply > outdated. > Dean > > > > On 10/26/2010 3:34 AM, david.noveck@emc.com wrote: > That makes sense. Let me take on this issue with > regard to the file > layout. Are there volunteers to address it with > regard to block and > object? It would be great if we could get together in > Beijing, discuss > this, and come to a joint conclusion to present to the > working group > (via email I mean). I'm not planning trying to do > this before the > working group meeting. In any case, I'm pretty sure > there won't be any > time during the working group meeting. > > -----Original Message----- > From: Spencer Shepler [mailto:sshepler@microsoft.com] > Sent: Monday, October 25, 2010 11:34 PM > To: Noveck, David; nfsv4@ietf.org > Subject: RE: [nfsv4] Write-behind caching > > > Fair enough. I haven't looked to see if the layout > types > address this specific, needed, behavior. Obviously > the > statement you reference and the individual layout > descriptions > should be tied together. Again, I don't remember but > there > may be layout specific steps needed in the case of > handling > layoutreturns. > > In any case, we can handle the eventual conclusion as > an errata. > > Spencer > > > -----Original Message----- > From: david.noveck@emc.com > [mailto:david.noveck@emc.com] > Sent: Monday, October 25, 2010 8:25 PM > To: Spencer Shepler; nfsv4@ietf.org > Subject: RE: [nfsv4] Write-behind caching > > I agree that the intent was to cover a variety > of layout types. > > I think what you are saying about the issue of > different throughputs > for > having and not having layouts also makes > sense. It may in some way > have > led to the statement in RFC5661 but those > statements are by no means > the > same. They have different consequences. I > take it that you are > saying > (correctly) something like: > > However, write-behind implementations > will generally need to > bound > the amount of unwritten date so that > given the bandwidth of the > output path, the data can be written in a > reasonable time. > Clients > which have layouts should avoid keeping > larger amounts to reflect > a > situation in which a layout provides a > write path of higher > bandwidth. > This is because a CB_LAYOUTRECALL may be > received. The client > should not delay returning the layout so > as to use that higher- > bandwidth > path, so it is best if it assumes, in > limiting the amount of data > to be written, that the write bandwidth > is only what is available > without the layout, and that it uses this > bandwidth assumption > even > if it does happen to have a layout. > > This differs from the text in RFC5661 in a few > respects. > > First it says that the amount of dirty > data should be the same > when > you have the layout and when you don't, > rather than simply > saying it > should be small when you have the > layout, possibly implying that > it > should be smaller than when you don't > have a layout. > > Second the text now in RFC5661 strongly > implies that when you > get > CB_LAYOUTRECALL, you would normally > start new IO's, rather than > simply drain the pending IO's and return > the layout ASAP. > > So I don't agree that what is in RFC5661 is > good implementation > advice, > particularly in suggesting that clients should > delay the LAYOUTRETURN > while doing a bunch of IO, including starting > new IO's. > > > -----Original Message----- > From: nfsv4-bounces@ietf.org > [mailto:nfsv4-bounces@ietf.org] On Behalf > Of > Spencer Shepler > Sent: Monday, October 25, 2010 10:07 PM > To: Noveck, David; nfsv4@ietf.org > Subject: Re: [nfsv4] Write-behind caching > > > Since this description is part of the general > pNFS description, the > intent > may have been to cover a variety of layout > types. However, I agree > that > the client is not guaranteed access to the > layout and is fully capable > of > writing the data via the MDS if all else fails > (inability to obtain > the > layout after a return); it may not be the most > performant path but it > should be functional. And maybe that is the > source of the statement > that > the client should take care in managing its > dirty pages given the lack > of > guarantee of access to the supposed, higher > throughput path for > writing > data. > > As implementation guidance it seems okay but > truly a requirement for > correct function. > > Spencer > > -----Original Message----- > From: nfsv4-bounces@ietf.org > [mailto:nfsv4-bounces@ietf.org] On > Behalf > Of > david.noveck@emc.com > Sent: Monday, October 25, 2010 6:58 PM > To: nfsv4@ietf.org > Subject: [nfsv4] Write-behind caching > > The following statement appears at the > bottom of page 292 of > RFC5661. > However, write-behind caching may > negatively > affect the latency in returning a > layout in response to a > CB_LAYOUTRECALL; this is similar to > file delegations and the > impact > that file data caching has on > DELEGRETURN. Client > implementations > SHOULD limit the amount of > unwritten data they have outstanding > at > any one time in order to prevent > excessively long responses to > CB_LAYOUTRECALL. > > This does not seem to make sense to > me. > > First of all the analogy between > DELEGRETURN and > CB_LAYOUTRECALL/LAYOUTRETURN doesn't > seem to me to be correct. In > the > case of DELEGRETURN, at least if the > file in question has been > closed, > during the pendency of the delegation, > you do need to write all of > the > dirty data associated with those > previously open files. Normally, > clients > just write all dirty data. > > LAYOUTRETURN does not have that sort > of requirement. If it is valid > to > hold the dirty data when you do have > the layout, it is just as valid > to > hold it when you don't. You could > very well return the layout and > get > it > again before some of those dirty > blocks are written. Having a > layout > grants you the right to do IO using a > particular means (different > based on > the mapping type), but if you don't > have the layout, you still have > a > way > to do the writeback, and there is no > particular need to write back > all > the > data before returning the layout. As > mentioned above, you may well > get > the layout again before there is any > need to actually do the > write-back. > You have to wait until IO's that are > in flight are completed before > you > return the layout. However, I don't > see why you would have to or > want > to > start new IO's using the layout if you > have received a > CB_LAYOUTRECALL.. > Am I missing something? Is there some > valid reason for this > statement? > Or should this be dealt with via the > errata mechanism? > > What do existing clients actually do > with pending writeback data > when > they > get a CB_LAYOUTRECALL? Do they start > new IO's using the layout? > If so, is there any other reason other > than the paragraph above? > _______________________________________________ > nfsv4 mailing list > nfsv4@ietf.org > https://www.ietf.org/mailman/listinfo/nfsv4 > _______________________________________________ > nfsv4 mailing list > nfsv4@ietf.org > https://www.ietf.org/mailman/listinfo/nfsv4 > > > _______________________________________________ > nfsv4 mailing list > nfsv4@ietf.org > https://www.ietf.org/mailman/listinfo/nfsv4 > _______________________________________________ > nfsv4 mailing list > nfsv4@ietf.org > https://www.ietf.org/mailman/listinfo/nfsv4 > > > > _______________________________________________ > nfsv4 mailing list > nfsv4@ietf.org > https://www.ietf.org/mailman/listinfo/nfsv4
- [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Dean Hildebrand
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Thomas Haynes
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck