Re: [nfsv4] Write-behind caching
Jason Glasgow <jglasgow@aya.yale.edu> Wed, 27 October 2010 14:26 UTC
Return-Path: <jason.r.glasgow@gmail.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id B715B3A695F for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 07:26:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.976
X-Spam-Level:
X-Spam-Status: No, score=-1.976 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CFSeSzyfewxZ for <nfsv4@core3.amsl.com>; Wed, 27 Oct 2010 07:26:21 -0700 (PDT)
Received: from mail-fx0-f44.google.com (mail-fx0-f44.google.com [209.85.161.44]) by core3.amsl.com (Postfix) with ESMTP id C2F083A6925 for <nfsv4@ietf.org>; Wed, 27 Oct 2010 07:26:20 -0700 (PDT)
Received: by fxm9 with SMTP id 9so770783fxm.31 for <nfsv4@ietf.org>; Wed, 27 Oct 2010 07:28:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type; bh=S1iTjcvfO9IYhsz5VTRUJOXjCUQrDOP6qPCVjAp8ddA=; b=Rnb4/X4zZuRHjxKIVBnp26Y67bw7ktylEbZ0szI99QLfHy8xpQjrhwgwjRyfmAh2Uh lySP46k9T03NhUbpt/73XWRPN5efdG1UANtwdiT7/aaaMwZ3HUeJWBsy/+rcZiwl6Mjr Z5CH5Tet+L98H05xysJaGenOIUnKHL8ye05Hw=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=L93UxR+/FGxcjUSs6i0D2FVsNMmP2DhLtDOlqRwyUDkRIY6tqB9AWEFCaeZ1RJcr+D HfCLVLVSmkyfIN4OLD6aInTlNTr8IjZuXifqziwoB9r5uiClTq4gwhEfvCODfYo/H3HS GulnH5ZB1qtG8JWeKi6rIWkbnPF8lrfeMwLEw=
MIME-Version: 1.0
Received: by 10.223.86.136 with SMTP id s8mr2408684fal.112.1288189689954; Wed, 27 Oct 2010 07:28:09 -0700 (PDT)
Sender: jason.r.glasgow@gmail.com
Received: by 10.223.109.77 with HTTP; Wed, 27 Oct 2010 07:28:09 -0700 (PDT)
In-Reply-To: <1288186821.8477.28.camel@heimdal.trondhjem.org>
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498D54@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76E0@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498E1D@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com> <4CC7B3AE.8000802@gmail.com> <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com> <1288186821.8477.28.camel@heimdal.trondhjem.org>
Date: Wed, 27 Oct 2010 10:28:09 -0400
X-Google-Sender-Auth: wVNV3sDjQ3vSAGkRoA69dkZCyY8
Message-ID: <AANLkTikVwAaBq961xT+YUpU881zAgHsnyqK2n_CFiHMx@mail.gmail.com>
From: Jason Glasgow <jglasgow@aya.yale.edu>
To: Trond Myklebust <trond.myklebust@fys.uio.no>
Content-Type: multipart/alternative; boundary="20cf30433eb8df847704939a08a0"
Cc: nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Oct 2010 14:26:23 -0000
I agree with Trond. This does not justify a requirement. -Jason On Wed, Oct 27, 2010 at 9:40 AM, Trond Myklebust <trond.myklebust@fys.uio.no > wrote: > > > On Wed, 2010-10-27 at 09:05 -0400, Jason Glasgow wrote: > > With regards to block layout there are some specific implementation > > issues that historically made it important to flush dirty data before > > returning a layout. Let me describe one that arose when implementing > > MPFS at EMC. I have not followed the UNIX implementations of the pNFS > > block layout recently enough to know if this is still a concern. > > > > > > Assume a page cache with page size 4K, a file system with block size > > 8K and a file that is 8K long. In many UNIX like operating systems > > NFS3/MPFS implementations, a 2K write to the range 2K-4K would obtain > > a layout for reading and writing, read the first 8K of the file into 2 > > pages in the kernel page cache. It would then copy the 2K of data > > from userspace and overwrite the later half of the first page. When > > it was time to write out the page, MPFS would write out the complete > > 8K block. If a layout were recalled between the time the data was > > read from disk, and when it was written, it is possible that another > > client would modify the range of the file from 0K to 2K. Unless > > specific care was taken when flushing the entire 8K block later, data > > corruption would occur. > > > > There are two ways to avoid the problem. > > > > > > 1. Keep track of the bytes ranges of a page (or file system block) > > that are dirty and only perform the read-modify-write cycle while the > > client holds the layout. This can get messy if a clients writes every > > other byte on a page. > > 2. Do not return a layout containing a dirty page until that page has > > been written. > > 3. In order to ensure you can safely do read-modify-write in the block > pNFS case, never request layout sizes that are not a multiple of the > block size. > > > Perhaps this sheds some light on the original motivation. > > Yes, but the problem you are describing does not justify a _requirement_ > that the client flush out its data. Only that it consider the > implications of corner cases such as the above. > > Trond > > > > Regarding the block layout, I am entirely sympathetic to arguments > > that a layout recall should only wait for outstanding writes complete, > > and should not cause the client to initiate new writes. > > > > > > -Jason > > > > > > > > > > On Wed, Oct 27, 2010 at 1:07 AM, Dean Hildebrand > > <seattleplus@gmail.com> wrote: > > I remember at one time there was a thought that all dirty data > > would have to be written to disk when it receives a > > layoutrecall. Once the data was written, it would send a > > layoutreturn. I think this was the thinking before all the > > timing issues and other such things cropped up. I assume > > someone wrote that as general advice, somehow thinking that > > responding to a layoutrecall was more important than actually > > achieving good write performance. > > > > In this light, the analogy with delegreturn makes sense if you > > take a very specific example, but obviously not in general. > > > > I would vote to just cut this text, as I think it is simply > > outdated. > > Dean > > > > > > > > On 10/26/2010 3:34 AM, david.noveck@emc.com wrote: > > That makes sense. Let me take on this issue with > > regard to the file > > layout. Are there volunteers to address it with > > regard to block and > > object? It would be great if we could get together in > > Beijing, discuss > > this, and come to a joint conclusion to present to the > > working group > > (via email I mean). I'm not planning trying to do > > this before the > > working group meeting. In any case, I'm pretty sure > > there won't be any > > time during the working group meeting. > > > > -----Original Message----- > > From: Spencer Shepler [mailto:sshepler@microsoft.com] > > Sent: Monday, October 25, 2010 11:34 PM > > To: Noveck, David; nfsv4@ietf.org > > Subject: RE: [nfsv4] Write-behind caching > > > > > > Fair enough. I haven't looked to see if the layout > > types > > address this specific, needed, behavior. Obviously > > the > > statement you reference and the individual layout > > descriptions > > should be tied together. Again, I don't remember but > > there > > may be layout specific steps needed in the case of > > handling > > layoutreturns. > > > > In any case, we can handle the eventual conclusion as > > an errata. > > > > Spencer > > > > > > -----Original Message----- > > From: david.noveck@emc.com > > [mailto:david.noveck@emc.com] > > Sent: Monday, October 25, 2010 8:25 PM > > To: Spencer Shepler; nfsv4@ietf.org > > Subject: RE: [nfsv4] Write-behind caching > > > > I agree that the intent was to cover a variety > > of layout types. > > > > I think what you are saying about the issue of > > different throughputs > > for > > having and not having layouts also makes > > sense. It may in some way > > have > > led to the statement in RFC5661 but those > > statements are by no means > > the > > same. They have different consequences. I > > take it that you are > > saying > > (correctly) something like: > > > > However, write-behind implementations > > will generally need to > > bound > > the amount of unwritten date so that > > given the bandwidth of the > > output path, the data can be written in a > > reasonable time. > > Clients > > which have layouts should avoid keeping > > larger amounts to reflect > > a > > situation in which a layout provides a > > write path of higher > > bandwidth. > > This is because a CB_LAYOUTRECALL may be > > received. The client > > should not delay returning the layout so > > as to use that higher- > > bandwidth > > path, so it is best if it assumes, in > > limiting the amount of data > > to be written, that the write bandwidth > > is only what is available > > without the layout, and that it uses this > > bandwidth assumption > > even > > if it does happen to have a layout. > > > > This differs from the text in RFC5661 in a few > > respects. > > > > First it says that the amount of dirty > > data should be the same > > when > > you have the layout and when you don't, > > rather than simply > > saying it > > should be small when you have the > > layout, possibly implying that > > it > > should be smaller than when you don't > > have a layout. > > > > Second the text now in RFC5661 strongly > > implies that when you > > get > > CB_LAYOUTRECALL, you would normally > > start new IO's, rather than > > simply drain the pending IO's and return > > the layout ASAP. > > > > So I don't agree that what is in RFC5661 is > > good implementation > > advice, > > particularly in suggesting that clients should > > delay the LAYOUTRETURN > > while doing a bunch of IO, including starting > > new IO's. > > > > > > -----Original Message----- > > From: nfsv4-bounces@ietf.org > > [mailto:nfsv4-bounces@ietf.org] On Behalf > > Of > > Spencer Shepler > > Sent: Monday, October 25, 2010 10:07 PM > > To: Noveck, David; nfsv4@ietf.org > > Subject: Re: [nfsv4] Write-behind caching > > > > > > Since this description is part of the general > > pNFS description, the > > intent > > may have been to cover a variety of layout > > types. However, I agree > > that > > the client is not guaranteed access to the > > layout and is fully capable > > of > > writing the data via the MDS if all else fails > > (inability to obtain > > the > > layout after a return); it may not be the most > > performant path but it > > should be functional. And maybe that is the > > source of the statement > > that > > the client should take care in managing its > > dirty pages given the lack > > of > > guarantee of access to the supposed, higher > > throughput path for > > writing > > data. > > > > As implementation guidance it seems okay but > > truly a requirement for > > correct function. > > > > Spencer > > > > -----Original Message----- > > From: nfsv4-bounces@ietf.org > > [mailto:nfsv4-bounces@ietf.org] On > > Behalf > > Of > > david.noveck@emc.com > > Sent: Monday, October 25, 2010 6:58 PM > > To: nfsv4@ietf.org > > Subject: [nfsv4] Write-behind caching > > > > The following statement appears at the > > bottom of page 292 of > > RFC5661. > > However, write-behind caching may > > negatively > > affect the latency in returning a > > layout in response to a > > CB_LAYOUTRECALL; this is similar to > > file delegations and the > > impact > > that file data caching has on > > DELEGRETURN. Client > > implementations > > SHOULD limit the amount of > > unwritten data they have outstanding > > at > > any one time in order to prevent > > excessively long responses to > > CB_LAYOUTRECALL. > > > > This does not seem to make sense to > > me. > > > > First of all the analogy between > > DELEGRETURN and > > CB_LAYOUTRECALL/LAYOUTRETURN doesn't > > seem to me to be correct. In > > the > > case of DELEGRETURN, at least if the > > file in question has been > > closed, > > during the pendency of the delegation, > > you do need to write all of > > the > > dirty data associated with those > > previously open files. Normally, > > clients > > just write all dirty data. > > > > LAYOUTRETURN does not have that sort > > of requirement. If it is valid > > to > > hold the dirty data when you do have > > the layout, it is just as valid > > to > > hold it when you don't. You could > > very well return the layout and > > get > > it > > again before some of those dirty > > blocks are written. Having a > > layout > > grants you the right to do IO using a > > particular means (different > > based on > > the mapping type), but if you don't > > have the layout, you still have > > a > > way > > to do the writeback, and there is no > > particular need to write back > > all > > the > > data before returning the layout. As > > mentioned above, you may well > > get > > the layout again before there is any > > need to actually do the > > write-back. > > You have to wait until IO's that are > > in flight are completed before > > you > > return the layout. However, I don't > > see why you would have to or > > want > > to > > start new IO's using the layout if you > > have received a > > CB_LAYOUTRECALL.. > > Am I missing something? Is there some > > valid reason for this > > statement? > > Or should this be dealt with via the > > errata mechanism? > > > > What do existing clients actually do > > with pending writeback data > > when > > they > > get a CB_LAYOUTRECALL? Do they start > > new IO's using the layout? > > If so, is there any other reason other > > than the paragraph above? > > > _______________________________________________ > > nfsv4 mailing list > > nfsv4@ietf.org > > > https://www.ietf.org/mailman/listinfo/nfsv4 > > _______________________________________________ > > nfsv4 mailing list > > nfsv4@ietf.org > > https://www.ietf.org/mailman/listinfo/nfsv4 > > > > > > _______________________________________________ > > nfsv4 mailing list > > nfsv4@ietf.org > > https://www.ietf.org/mailman/listinfo/nfsv4 > > _______________________________________________ > > nfsv4 mailing list > > nfsv4@ietf.org > > https://www.ietf.org/mailman/listinfo/nfsv4 > > > > > > > > _______________________________________________ > > nfsv4 mailing list > > nfsv4@ietf.org > > https://www.ietf.org/mailman/listinfo/nfsv4 > > > >
- [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Dean Hildebrand
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Thomas Haynes
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck