Re: [nfsv4] Write-behind caching
Benny Halevy <bhalevy@panasas.com> Wed, 03 November 2010 07:51 UTC
Return-Path: <bhalevy@panasas.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 4F2A83A6A90 for <nfsv4@core3.amsl.com>; Wed, 3 Nov 2010 00:51:06 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.599
X-Spam-Level:
X-Spam-Status: No, score=-6.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JYDOjil51tLY for <nfsv4@core3.amsl.com>; Wed, 3 Nov 2010 00:51:04 -0700 (PDT)
Received: from exprod5og111.obsmtp.com (exprod5og111.obsmtp.com [64.18.0.22]) by core3.amsl.com (Postfix) with SMTP id 315283A684B for <nfsv4@ietf.org>; Wed, 3 Nov 2010 00:51:03 -0700 (PDT)
Received: from source ([67.152.220.89]) by exprod5ob111.postini.com ([64.18.4.12]) with SMTP ID DSNKTNEUbfiIfXz6BVRk4CkOo9AEP6zbfkz4@postini.com; Wed, 03 Nov 2010 00:51:10 PDT
Received: from fs1.bhalevy.com ([172.17.33.173]) by daytona.int.panasas.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 3 Nov 2010 03:51:08 -0400
Message-ID: <4CD1146A.6070605@panasas.com>
Date: Wed, 03 Nov 2010 09:51:06 +0200
From: Benny Halevy <bhalevy@panasas.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101027 Fedora/3.1.6-1.fc13 Thunderbird/3.1.6
MIME-Version: 1.0
To: sfaibish <sfaibish@emc.com>
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498E1D@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com> <4CC7B3AE.8000802@gmail.com> <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com> <1288186821.8477.28.camel@heimdal.trondhjem.org> <BF3BB6D12298F54B89C8DCC1E4073D80028C7A3E@CORPUSMX50A.corp.emc.com> <op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com> <4CC857D5.5010104@panasas.com> <op.vk8vpbldunckof@usensfaibisl2e.eng.emc.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C80AB@CORPUSMX50A.corp.emc.com> <1288373995.3701.35.camel@heimdal.trondhjem.org> <op.vlcwr1zqunckof@usensfaibisl2e.eng.emc.com> <1288388933.3701.47.camel@heimdal.trondhjem.org> <1288389823.3701.59.camel@heimdal.trondhjem.org> <BF3BB6D12298F54B89C8DCC1E4073D80029446BC@CORPUSMX50A.corp.emc.com> <1288707482.2925.44.camel@heimdal.trondhjem.org> <op.vljy4pqaunckof@usensfaibisl2e.eng.emc.co m>
In-Reply-To: <op.vljy4pqaunckof@usensfaibisl2e.eng.emc.com>
Content-Type: text/plain; charset="ISO-8859-15"
Content-Transfer-Encoding: 7bit
X-OriginalArrivalTime: 03 Nov 2010 07:51:08.0817 (UTC) FILETIME=[E05CA810:01CB7B2B]
Cc: nfsv4@ietf.org, Trond Myklebust <trond.myklebust@fys.uio.no>
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Nov 2010 07:51:06 -0000
On 2010-11-02 19:06, sfaibish wrote: > On Tue, 02 Nov 2010 10:18:02 -0400, Trond Myklebust > <trond.myklebust@fys.uio.no> wrote: > >> Hi Dave, >> >> So, while I largely agree with your points 1-6, I'd like to add >> >> 0) Layouts are not a tool for enforcing cache consistency! >> >> While I agree that doing safe read-modify-write in the block case is an >> important feature, I don't see any agreement anywhere in RFC5661 that we >> should be providing stronger caching semantics than we used to provide >> prior to adding pNFS to the protocol. I have no intention of allowing a >> Linux client implementation that provides such stronger semantics until >> we write that sort of thing into the spec and provide for similar >> stronger semantics in the non-pNFS case. >> >> With that in mind, I have the following comments: >> >> * I see no reason to write data back when the server recalls the >> layout. While I see that you could argue (1) implies that you >> should try to write stuff while you still hold a layout, the >> spec says that clora_changed==FALSE implies you can get that >> layout back later. In the case where clora_changed==TRUE, you >> might expect the file would be unavailable for longer, but the >> spec says you shouldn't write stuff back in that case... >> * While this may lead to layout bouncing between clients and/or >> the server, the clients do have the option of detecting this, >> and choosing write through MDS to improve efficiency. Grabbing >> the layout, and blocking others from accessing the data while >> you write is not a scalable solution even if you do believe >> there is a valid scenario for this behaviour. >> * Basically, it comes down to the fact that I want to write back >> data when my memory management heuristics require it, so that I >> cache data as long as possible. I see no reason why server >> mechanics should dictate when I should stop caching (unless we >> are talking about a true cache consistency mechanism). > OK. Now I think I understand your point and we might still require some > changes in the interpretation and perhaps some language in the 5661. But > I have a basic question about fencing. Do we think there is any possibility > of data corruption when the DS fence the I/Os very fast after the > layoutrecall. I don't think we would let the spec out if we thought so :) > If we can find such possibility we probably need to mention this in the > protocol > and recommend how to prevent such a case. For example MDS sends a > layoutrecall > and immediately (implementation decision of the server) it force the > fencing > on the DS while waiting for the return or after receiving ack from the > client > for the layoutrecall. (I might be out of order here but I just want to be > sure > this is not the case). So? The DS may also die at any point so the client must hold on to dirty data until they are written to stable storage. If the data can't be written to the DS for any reason the client can ask for a new layout and retry or it can always write the data to the MDS. layoutrecall should not cause the client to drop the data on the floor and lose it. That said, the client implementation should tolerate the errors it sees when fenced off from the DS and be able to retry. Otherwise, I can see an issue with the application getting an error on write, fsync, or close which may be semantically correct but certainly undesirable. Benny > > /Sorin > > > >> >> So my choices for Q1 and Q2 are still (A) and (A). >> >> Cheers >> Trond >> >> On Mon, 2010-11-01 at 13:49 -0400, david.noveck@emc.com wrote: >>> I think that want to address this issue without using the words >>> "unrealistic" or "optimal". Things that you think are unrealistic >>> sometimes, in some sorts of environments, turn out to be common. Trying >>> to decide what approaches are optimal are also troubling. In different >>> situations, different approaches may be better or worse. The protocol >>> needs to define the rules that the client and server have to obey and >>> they may make choices that result in results from optimal to pessimal. >>> We can make suggestions on doing things better but in unusual situations >>> the performance considerations may be different. The point is we have >>> to be clear when the client can and can't do A and similarly for B and >>> sometimes the choice of A or B is simply up to the client. >>> >>> So I'm going to put down the following numbered propositions and we'll >>> see where people disagree with me. Please be specific. I'm going to >>> assume the anything you don't argue with numerically below the point of >>> disagreement is something we can agree on. >>> >>> 1) Normally, when a client is writing data covered by a layout, it may >>> write using the layout or to the MDS, but unless there is a particular >>> reason (e.g. slow or inconsistent response using the layout), it SHOULD >>> write using the layout. >>> >>> 2) When a layout is recalled, the protocol itself does not nor should it >>> require that dirty blocks in the cache be written before returning the >>> layout. If a client chooses to do writes using the recalled layout, it >>> is doing so as an attempt to improve performance, given its judgment of >>> the relative performance of IO using the layout and IO through the MDS. >>> >>> 3) Particularly in the case in which clora_changed is 0, clients MAY >>> choose to take advantage of the higher-performance layout path to write >>> that data, while it is available. However, since doing that delays the >>> return of the layout, it is possible that by delaying the return of the >>> layout, performance of others waiting for the layout may be reduced. >>> >>> 4) When writing of dirty blocks is done using a layout being recalled, >>> the possibility exists that the layout will be revoked before all the >>> blocks are successfully written. The client MUST be prepared to rewrite >>> those dirty blocks whose layouts write failed to the MDS in such cases. >>> >>> 5) Clients that want to write dirty blocks associated with recalled >>> layouts MAY choose to restrict the size of the set of dirty blocks they >>> keep in order to make it relatively unlikely that the layout will be >>> revoked during recall. On the other hand, for applications, in which >>> having a large set of dirty blocks in the cache reduces the IO actually >>> done, such restriction may result in poorer performance, even though the >>> specific IO path used is more performant. >>> >>> 6) Note that if a large set of dirty blocks can be kept by the client >>> when a layout is not held, it should be possible to keep a set that at >>> least that size a set of dirty blocks when a layout is held. Even if >>> the client should choose to write those blocks as part of the layout >>> recall, any that it is not able to write in an appropriate time, will be >>> a subset of an amount which, by hypothesis, can be appropriately held >>> when the only means of writing them is to the MDS. >>> >>> Another way of looking at this is that we have the following questions >>> which I'm going to present as multiple choice. I have missed a few >>> choices but, >>> >>> Q1) When a recall of a layout occurs what do you about dirty blocks? >>> A) Nothing. The IO's to write them are like any other IO and >>> you don't do IO using layout under recall. >>> B) You should write all dirty blocks as part of the recall if >>> clora_changed is 0. >>> C) You should do (A), but have the option of doing (B) but you >>> are responsible for the consequences. >>> >>> Q2) How many dirty blocks should you keep covered by a layout? >>> A) As many as you want. It doesn't matter. >>> B) A small number so that you can be sure that they can be >>> written as part of layout recall (Assuming Q1=B). >>> C) If there is a limit, it must be at least as great as the limit >>> that would be in effect if there is no layout present, since >>> that number is OK, once the layout does go back. >>> >>> There are pieces of the spec that are assuming (A) and the answer to >>> these and pieces assuming (B). >>> >>> So I guess I was arguing before that the answers to Q1 and Q2 should be >>> (A). >>> >>> My understanding is that Benny is arguing for (B) as the answer to Q1 >>> and Q2. >>> >>> So I'm now willing to compromise slightly and answer (C) to both of >>> those, but I think that still leaves me and Benny quite a ways apart. >>> >>> I'm not sure what Trond's answer is, but I'd interested in understanding >>> his view in terms of (1)-(6) and Q1 and Q2. >>> >>> >>> -----Original Message----- >>> From: Trond Myklebust [mailto:trond.myklebust@fys.uio.no] >>> Sent: Friday, October 29, 2010 6:04 PM >>> To: faibish, sorin >>> Cc: Noveck, David; bhalevy@panasas.com; jglasgow@aya.yale.edu; >>> nfsv4@ietf.org >>> Subject: Re: [nfsv4] Write-behind caching >>> >>> On Fri, 2010-10-29 at 17:48 -0400, Trond Myklebust wrote: >>>> On Fri, 2010-10-29 at 17:32 -0400, sfaibish wrote: >>>>> On Fri, 29 Oct 2010 13:39:55 -0400, Trond Myklebust >>>>> <trond.myklebust@fys.uio.no> wrote: >>>>> >>>>>> On Fri, 2010-10-29 at 13:20 -0400, david.noveck@emc.com wrote: >>>>>>> There are two issues here with regard to handling of layout >>> recall. >>>>>>> >>>>>>> One is with regard to in-flight IO. As Benny points out, you >>> cannot be >>>>>>> sure that the in-flight IO can be completed in time to avoid the >>> MDS >>>>>>> losing patience. That should rarely be the case though, if >>> things are >>>>>>> working right. The client has to be prepared to deal with IO >>> failures >>>>>>> due to layout revocation. Any IO that was in flight and failed >>> because >>>>>>> of layout revocation will need to be handled by being reissued to >>> the >>>>>>> MDS. Is there anybody that disagrees with that? >>>>>>> >>>>>>> The second issue concerns IO not in-flight (in other words, not >>> IO's >>>>>>> yet but potential IO's) when the recall is received. I just >>> don't see >>>>>>> that it reasonable to start IO's using layout segments being >>> recalled >>>>>>> (whether for dirty buffers or anything else). Doing IO's to the >>> MDS is >>>>>>> fine but there is no real need for the layout recall to specially >>> >>>>>>> trigger them, whether clora_changed is set or not. >>>>>> >>>>>> This should be _very_ rare. Any cases where 2 clients are trying >>> to do >>>>>> conflicting I/O on the same data is likely to be either a >>> violation of >>>>>> the NFS cache consistency rules, or a scenario where it is in any >>> case >>>>>> more efficient to go through the MDS (e.g. writing to adjacent >>> records >>>>>> that share the same extent). >>>>> Well this is a different discussion: what was the reason for the >>> recall in >>>>> the first place. This is one usecase but there could be other >>> usecases >>>>> for the recall and we discuss here how to implement the protcol more >>> than >>>>> how to solve a real problem. My 2c >>>> >>>> I strongly disagree. If this is an unrealistic scenario, then we don't >>>> have to care about devising an optimal strategy for it. The 'there >>> could >>>> be other usecases' scenario needs to be fleshed out before we can deal >>>> with it. >>> >>> To clarify a bit what I mean: we MUST devise optimal strategies for >>> realistic and useful scenarios. It is entirely OPTIONAL to devise >>> optimal strategies for unrealistic ones. >>> >>> If writing back all data before returning the layout causes protocol >>> issues because the server cannot distinguish between a bad client and >>> one that is waiting for I/O to complete, then my argument is that we're >>> in the second case: we don't have to optimise for it, and so it is safe >>> for the server to assume 'bad client'... >>> >>> Trond >>> >>> >> >> >> >> >> >> > > >
- [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Dean Hildebrand
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Thomas Haynes
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck