Re: [nfsv4] Write-behind caching
<david.black@emc.com> Fri, 05 November 2010 14:08 UTC
Return-Path: <david.black@emc.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id F046D3A6452 for <nfsv4@core3.amsl.com>; Fri, 5 Nov 2010 07:08:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -106.599
X-Spam-Level:
X-Spam-Status: No, score=-106.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id azIddx4yczwn for <nfsv4@core3.amsl.com>; Fri, 5 Nov 2010 07:08:24 -0700 (PDT)
Received: from mexforward.lss.emc.com (mexforward.lss.emc.com [128.222.32.20]) by core3.amsl.com (Postfix) with ESMTP id 4A4113A63EB for <nfsv4@ietf.org>; Fri, 5 Nov 2010 07:08:24 -0700 (PDT)
Received: from hop04-l1d11-si01.isus.emc.com (HOP04-L1D11-SI01.isus.emc.com [10.254.111.54]) by mexforward.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oA5E8aiT022160 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for <nfsv4@ietf.org>; Fri, 5 Nov 2010 10:08:36 -0400
Received: from mailhub.lss.emc.com (mailhub.lss.emc.com [10.254.222.226]) by hop04-l1d11-si01.isus.emc.com (RSA Interceptor) for <nfsv4@ietf.org>; Fri, 5 Nov 2010 10:08:28 -0400
Received: from corpussmtp3.corp.emc.com (corpussmtp3.corp.emc.com [10.254.169.196]) by mailhub.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oA5E816U028815 for <nfsv4@ietf.org>; Fri, 5 Nov 2010 10:08:05 -0400
Received: from mxhub02.corp.emc.com ([10.254.141.104]) by corpussmtp3.corp.emc.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 5 Nov 2010 10:07:58 -0400
Received: from mx14a.corp.emc.com ([169.254.1.117]) by mxhub02.corp.emc.com ([10.254.141.104]) with mapi; Fri, 5 Nov 2010 10:07:57 -0400
From: david.black@emc.com
To: david.noveck@emc.com
Date: Fri, 05 Nov 2010 10:07:53 -0400
Thread-Topic: [nfsv4] Write-behind caching
Thread-Index: Act7adOOpbsINeA3QbmSAuvw1DUIRgA0rjxQABElcRAAHCv9YA==
Message-ID: <7C4DFCE962635144B8FAE8CA11D0BF1E03D59C159C@MX14A.corp.emc.com>
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com><BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com><4CC7B3AE.8000802@gmail.com><AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com><1288186821.8477.28.camel@heimdal.trondhjem.org><BF3BB6D12298F54B89C8DCC1E4073D80028C7A3E@CORPUSMX50A.corp.emc.com><op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com><4CC857D5.5010104@panasas.com><op.vk8vpbldunckof@usensfaibisl2e.eng.emc.com><BF3BB6D12298F54B89C8DCC1E4073D80028C80AB@CORPUSMX50A.corp.emc.com><1288373995.3701.35.camel@heimdal.trondhjem.org><op.vlcwr1zqunckof@usensfaibisl2e.eng.emc.com><1288388933.3701.47.camel@heimdal.trondhjem.org><1288389823.3701.59.camel@heimdal.trondhjem.org><BF3BB6D12298F54B89C8DCC1E4073D80029446BC@CORPUSMX50A.corp.emc.com><1288707482.2925.44.camel@heimdal.trondhjem.org><op.vljy4pqaunckof@usensfaibisl2e.eng.emc.c! om><BF3BB6D12298F54B89C8DCC1E4073D8002944A75@CORPUSMX50A.corp.emc.com><4CD17BFE.8000707@panasas.com> <7C4DFC E962635144B8FAE8CA11D0BF1E03D58B97E2@MX14A.corp.emc.com> <BF3BB6D12298F54B89C8DCC1E4073D8002944F1E@CORPUSMX50A.corp.emc.com>
In-Reply-To: <BF3BB6D12298F54B89C8DCC1E4073D8002944F1E@CORPUSMX50A.corp.emc.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-cr-puzzleid: {3D0E4ED7-9AD5-49D3-9962-7B5A313D1437}
x-cr-hashedpuzzle: ACwh A1dE CprK C7tQ DBxo DFSN Ec7k FJW4 FuPb GH3e GOd1 GO6T I/vk JYhG KMf3 LGLJ; 1; bgBmAHMAdgA0AEAAaQBlAHQAZgAuAG8AcgBnAA==; Sosha1_v1; 7; {3D0E4ED7-9AD5-49D3-9962-7B5A313D1437}; ZABhAHYAaQBkAC4AYgBsAGEAYwBrAEAAZQBtAGMALgBjAG8AbQA=; Fri, 05 Nov 2010 14:07:53 GMT; UgBFADoAIABbAG4AZgBzAHYANABdACAAVwByAGkAdABlAC0AYgBlAGgAaQBuAGQAIABjAGEAYwBoAGkAbgBnAA==
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginalArrivalTime: 05 Nov 2010 14:07:58.0409 (UTC) FILETIME=[D98E1B90:01CB7CF2]
X-EMM-MHVC: 1
Cc: nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 05 Nov 2010 14:08:27 -0000
Dave, > But the existing text suggests that this very situation, draining the dirty data using the layout, > should (or might) result in less aggressive caching. The fact that you are doing it in response to a > recall means that you have a higher drain rate times a limited time, rather than a more limited drain > rate times an unlimited time. I agree that there's a problem with the existing text. Keep in mind that for a system in which the metadata server is limited in what it can do, the "more limited drain rate" might be *much* smaller than what's possible via the data servers, and the client might not be interested in an "unlimited" exposure in terms of time to drain the cache. Benny and I are both telling you not to forbid a client from choosing to do writebacks to the data servers in response to a layout recall. > I think the question is about how you consider the case "if you cannot get the layout back". If a > client is allowed to do writes more efficiently despite the recall, why is not allowed to do reads > more efficiently and similarly delay the recall? Writes contain dirty data. Reads don't. There is a difference :-) :-) . > I've proposed that this choice (the one about the write) be made the prerogative of the mapping type > specifically. For pNFS file, I would think that the normal assumption when a layout is being recalled > is that this is part of some sort of restriping and that you will get it back and it is better to > return the layout as soon as you can. I would think that we shouldn't be telling people how to implement systems in this fashion. I guess I can live with the file layout forbidding this if the file layout implementers think it's useful. OTOH, Benny and I are telling you that forbidding this is the wrong approach for both the block and object layouts. Thanks, --David > -----Original Message----- > From: Noveck, David > Sent: Thursday, November 04, 2010 8:47 PM > To: Black, David; bhalevy@panasas.com > Cc: nfsv4@ietf.org > Subject: RE: [nfsv4] Write-behind caching > > > The multiple pNFS data servers can provide much higher > > throughput than the MDS, and hence it should be valid > > for a client to cache writes more aggressively when it > > has a layout because it can drain its dirty cache faster. > > But the existing text suggests that this very situation, draining the dirty data using the layout, > should (or might) result in less aggressive caching. The fact that you are doing it in response to a > recall means that you have a higher drain rate times a limited time, rather than a more limited drain > rate times an unlimited time. > > I think the question is about how you consider the case "if you cannot get the layout back". If a > client is allowed to do writes more efficiently despite the recall, why is not allowed to do reads > more efficiently and similarly delay the recall? It seems that the same performance considerations > would apply. Would you want the client to be able to read through the area in the layout in order to > use it effectively? I wouldn't think so. > > I've proposed that this choice (the one about the write) be made the prerogative of the mapping type > specifically. For pNFS file, I would think that the normal assumption when a layout is being recalled > is that this is part of some sort of restriping and that you will get it back and it is better to > return the layout as soon as you can. > > > > -----Original Message----- > From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf Of david.black@emc.com > Sent: Thursday, November 04, 2010 12:34 PM > To: bhalevy@panasas.com > Cc: nfsv4@ietf.org > Subject: Re: [nfsv4] Write-behind caching > > > > 4*) Clients SHOULD write all dirty data covered by the recalled > > > layout before return it. > > > > > > It may be that you can write faster this way, but it also mean that the server may wait a while to > > get the layout back and this may delay other clients. There is the further problem that it means > > that your set of dirty blocks can be much smaller than it would be otherwise and this can hurt > > performance. I don't think that should be a valid choice. > > > > We can live without it, as long there is no hard requirement for the client to > > not flush any dirty data upon CB_LAYOUTRECALL (option 1 above). > > I basically agree with Benny. > > The multiple pNFS data servers can provide much higher throughput than the MDS, and hence it should be > valid for a client to cache writes more aggressively when it has a layout because it can drain its > dirty cache faster. Such a client may want to pro-actively reduce its amount of dirty cached data in > response to a layout recall, with the goal of provide appropriate behavior if it cannot get the layout > back. For that reason, a prohibition on clients initiating writes in response to a recall would be a > problem (i.e., option 1's prohibition is not a good idea). > > That leaves 2) and 3) which seem to be shades of the same concept: > > > > 2) Say you MAY write some dirty data on layouts being recalled > > > but you should limit this attempt to optimize use of layouts > > > to avoid unduly delaying layout recalls. > > > > > > 3) Say clients MAY write large amounts of dirty data and server > > > will generally accommodate them in using pNFS to do IO this > > > way. > > I think the "avoid unduly delaying" point is important, which suggests that the "large amounts" of > dirty data writes in 3) would only be appropriate when the client has sufficiently high throughput > access to the data servers to write "large amounts" of data without "unduly delaying" the recall. > > Thanks, > --David > > > -----Original Message----- > > From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf Of Benny Halevy > > Sent: Wednesday, November 03, 2010 11:13 AM > > To: Noveck, David > > Cc: nfsv4@ietf.org; trond.myklebust@fys.uio.no > > Subject: Re: [nfsv4] Write-behind caching > > > > On 2010-11-03 15:31, david.noveck@emc.com wrote: > > > 12.5.5 states that the server "MUST wait one lease period before taking further action" so I don't > > think it is allowed to fence the client immediately. > > > > > > I think there some confusion/error that starts with the last paragraph of 12.5.5. > > > > > > Although pNFS does not alter the file data caching capabilities of > > > clients, or their semantics, it recognizes that some clients may > > > perform more aggressive write-behind caching to optimize the benefits > > > provided by pNFS. > > > > > > It you are doing write-behind caching, the primary thing that is going to decide whether you > > should actually write the dirty block is the probability that it will be modified again. If that is > > at all likely, then writing it immediately, just to get the "benefits provided by pNFS" may not be a > > good idea. And if the probabilities of the block being further modified had already reached a low > > level, then you probably should have started writing it, before the CB_LAYOUTRECALL. It may be that > > there are some blocks whose probability is just on the edge, and the CB_LAYOUTRECALL pushed them > > into gee-it-would-be-better-to-write-these-now category. But that is not what is being talked about > > here. > > > > > > Note that it talks about "more aggressive write-behind caching" and then later talks about having > > less dirty data in this case. I think this needs to be rethought. > > > > > > However, write-behind caching may negatively > > > affect the latency in returning a layout in response to a > > > CB_LAYOUTRECALL; > > > > > > Here it seems to assume not that CB_LAYOUTRECALL makes it more desirable to write not just some > > dirty blocks using the recalled layout but that all dirty data is being written (or at least that > > which covered by the recall). > > > > > > this is similar to file delegations and the impact > > > that file data caching has on DELEGRETURN. > > > > > > But that is a very bad analogy. For delegations, there's a semantic reason you have to write all > > the dirty data before returning the delegations. > > > > > > Client implementations > > > SHOULD limit the amount of unwritten data they have outstanding at > > > any one time in order to prevent excessively long responses to > > > CB_LAYOUTRECALL. > > > > > > Again the assumption is not that somebody is writing some amount of data to take advantage of a > > layout going away but that clients in general are writing every single dirty block. As an example, > > take the case of the partial block written sequentially. That's a dirty block you would never write > > as a result of a LAYOUTRECALL. There's probably no benefit in writing it using the layout no matter > > how efficient the pNFS mapping type is. You are probably going to have to write it again anyway. > > > > > > For some environments, limiting the amount of unwritten data, may hurt performance more than > > writing the dirty data to the MDS. If I can write X bytes of dirty blocks to the MDS (if I didn't > > have a layout), why should I keep less than X bytes of dirty blocks if I have a layout which is > > supposedly helping me write more efficiently (and as part of "more aggressive write-behind > > caching"). If anything I should be able to have more dirty data. > > > > > > Note that clora_changed can tell the client not to write the dirty data, but the client has no way > > of predicting what clora_changed will be, so it would seem that they have to limit the amount of > > dirty data, even if they have a server which is never going to ask them to write it as part of > > layout recall. > > > > > > Once a layout is recalled, a server MUST wait one > > > lease period before taking further action. As soon as a lease period > > > has passed, the server may choose to fence the client's access to the > > > storage devices if the server perceives the client has taken too long > > > to return a layout. However, just as in the case of data delegation > > > and DELEGRETURN, the server may choose to wait, given that the client > > > is showing forward progress on its way to returning the layout. > > > > > > Again, these situations are different. A client which is doing this is issuing new IO's using > > recalled layouts. I don't have any objection if a server wants to allow this but I don't think > > treating layouts in the same way as delegations should be encouraged. > > > > > > This > > > forward progress can take the form of successful interaction with the > > > storage devices or of sub-portions of the layout being returned by > > > the client. The server can also limit exposure to these problems by > > > limiting the byte-ranges initially provided in the layouts and thus > > > the amount of outstanding modified data. > > > > > > That adds a lot complexity to the server for no good reason. If you start by telling the client > > to write every single dirty block covered by a layout recall before returning the layout, then you > > are going to run into problems like this. > > > > > > I think there are a number of choices: > > > > > > 1) Say you MUST NOT do IO on layouts being recalled, in which > > > case none of this problem arises. I take it this is what > > > Trond is arguing for. > > > > > > 2) Say you MAY write some dirty data on layouts being recalled > > > but you should limit this attempt to optimize use of layouts > > > to avoid unduly delaying layout recalls. > > > > > > 3) Say clients MAY write large amounts of dirty data and server > > > will generally accommodate them in using pNFS to do IO this > > > way. > > > > > > Maybe the right approach is to have whichever of these is to be in effect be chosen on a per- > > mapping-type basis, perhaps based on clora_changed > > > > I think this is the right approach as the blocks and objects layout types may > > use topologies for which flushing some data to fill, e.g. an allocated block > > on disk or a RAID stripe makes sense. > > > > > > > > I think the real problem is the suggestion that there is some reason that a client has to write > > every single dirty block within the scope of the CB_LAYOUTRECALL, i.e. that this is analogous to > > DELEGRETURN. > > > > > > 4*) Clients SHOULD write all dirty data covered by the recalled > > > layout before return it. > > > > > > It may be that you can write faster this way, but it also mean that the server may wait a while to > > get the layout back and this may delay other clients. There is the further problem that it means > > that your set of dirty blocks can be much smaller than it would be otherwise and this can hurt > > performance. I don't think that should be a valid choice. > > > > We can live without it, as long there is no hard requirement for the client to > > not flush any dirty data upon CB_LAYOUTRECALL (option 1 above). > > > > Benny > > > > > > > > > > > > > > -----Original Message----- > > > From: sfaibish [mailto:sfaibish@emc.com] > > > Sent: Tuesday, November 02, 2010 1:06 PM > > > To: Trond Myklebust; Noveck, David > > > Cc: bhalevy@panasas.com; jglasgow@aya.yale.edu; nfsv4@ietf.org > > > Subject: Re: [nfsv4] Write-behind caching > > > > > > On Tue, 02 Nov 2010 10:18:02 -0400, Trond Myklebust > > > <trond.myklebust@fys.uio.no> wrote: > > > > > >> Hi Dave, > > >> > > >> So, while I largely agree with your points 1-6, I'd like to add > > >> > > >> 0) Layouts are not a tool for enforcing cache consistency! > > >> > > >> While I agree that doing safe read-modify-write in the block case is an > > >> important feature, I don't see any agreement anywhere in RFC5661 that we > > >> should be providing stronger caching semantics than we used to provide > > >> prior to adding pNFS to the protocol. I have no intention of allowing a > > >> Linux client implementation that provides such stronger semantics until > > >> we write that sort of thing into the spec and provide for similar > > >> stronger semantics in the non-pNFS case. > > >> > > >> With that in mind, I have the following comments: > > >> > > >> * I see no reason to write data back when the server recalls the > > >> layout. While I see that you could argue (1) implies that you > > >> should try to write stuff while you still hold a layout, the > > >> spec says that clora_changed==FALSE implies you can get that > > >> layout back later. In the case where clora_changed==TRUE, you > > >> might expect the file would be unavailable for longer, but the > > >> spec says you shouldn't write stuff back in that case... > > >> * While this may lead to layout bouncing between clients and/or > > >> the server, the clients do have the option of detecting this, > > >> and choosing write through MDS to improve efficiency. Grabbing > > >> the layout, and blocking others from accessing the data while > > >> you write is not a scalable solution even if you do believe > > >> there is a valid scenario for this behaviour. > > >> * Basically, it comes down to the fact that I want to write back > > >> data when my memory management heuristics require it, so that I > > >> cache data as long as possible. I see no reason why server > > >> mechanics should dictate when I should stop caching (unless we > > >> are talking about a true cache consistency mechanism). > > > OK. Now I think I understand your point and we might still require some > > > changes in the interpretation and perhaps some language in the 5661. But > > > I have a basic question about fencing. Do we think there is any possibility > > > of data corruption when the DS fence the I/Os very fast after the > > > layoutrecall. > > > If we can find such possibility we probably need to mention this in the > > > protocol > > > and recommend how to prevent such a case. For example MDS sends a > > > layoutrecall > > > and immediately (implementation decision of the server) it force the > > > fencing > > > on the DS while waiting for the return or after receiving ack from the > > > client > > > for the layoutrecall. (I might be out of order here but I just want to be > > > sure > > > this is not the case). > > > > > > /Sorin > > > > > > > > > > > >> > > >> So my choices for Q1 and Q2 are still (A) and (A). > > >> > > >> Cheers > > >> Trond > > >> > > >> On Mon, 2010-11-01 at 13:49 -0400, david.noveck@emc.com wrote: > > >>> I think that want to address this issue without using the words > > >>> "unrealistic" or "optimal". Things that you think are unrealistic > > >>> sometimes, in some sorts of environments, turn out to be common. Trying > > >>> to decide what approaches are optimal are also troubling. In different > > >>> situations, different approaches may be better or worse. The protocol > > >>> needs to define the rules that the client and server have to obey and > > >>> they may make choices that result in results from optimal to pessimal. > > >>> We can make suggestions on doing things better but in unusual situations > > >>> the performance considerations may be different. The point is we have > > >>> to be clear when the client can and can't do A and similarly for B and > > >>> sometimes the choice of A or B is simply up to the client. > > >>> > > >>> So I'm going to put down the following numbered propositions and we'll > > >>> see where people disagree with me. Please be specific. I'm going to > > >>> assume the anything you don't argue with numerically below the point of > > >>> disagreement is something we can agree on. > > >>> > > >>> 1) Normally, when a client is writing data covered by a layout, it may > > >>> write using the layout or to the MDS, but unless there is a particular > > >>> reason (e.g. slow or inconsistent response using the layout), it SHOULD > > >>> write using the layout. > > >>> > > >>> 2) When a layout is recalled, the protocol itself does not nor should it > > >>> require that dirty blocks in the cache be written before returning the > > >>> layout. If a client chooses to do writes using the recalled layout, it > > >>> is doing so as an attempt to improve performance, given its judgment of > > >>> the relative performance of IO using the layout and IO through the MDS. > > >>> > > >>> 3) Particularly in the case in which clora_changed is 0, clients MAY > > >>> choose to take advantage of the higher-performance layout path to write > > >>> that data, while it is available. However, since doing that delays the > > >>> return of the layout, it is possible that by delaying the return of the > > >>> layout, performance of others waiting for the layout may be reduced. > > >>> > > >>> 4) When writing of dirty blocks is done using a layout being recalled, > > >>> the possibility exists that the layout will be revoked before all the > > >>> blocks are successfully written. The client MUST be prepared to rewrite > > >>> those dirty blocks whose layouts write failed to the MDS in such cases. > > >>> > > >>> 5) Clients that want to write dirty blocks associated with recalled > > >>> layouts MAY choose to restrict the size of the set of dirty blocks they > > >>> keep in order to make it relatively unlikely that the layout will be > > >>> revoked during recall. On the other hand, for applications, in which > > >>> having a large set of dirty blocks in the cache reduces the IO actually > > >>> done, such restriction may result in poorer performance, even though the > > >>> specific IO path used is more performant. > > >>> > > >>> 6) Note that if a large set of dirty blocks can be kept by the client > > >>> when a layout is not held, it should be possible to keep a set that at > > >>> least that size a set of dirty blocks when a layout is held. Even if > > >>> the client should choose to write those blocks as part of the layout > > >>> recall, any that it is not able to write in an appropriate time, will be > > >>> a subset of an amount which, by hypothesis, can be appropriately held > > >>> when the only means of writing them is to the MDS. > > >>> > > >>> Another way of looking at this is that we have the following questions > > >>> which I'm going to present as multiple choice. I have missed a few > > >>> choices but, > > >>> > > >>> Q1) When a recall of a layout occurs what do you about dirty blocks? > > >>> A) Nothing. The IO's to write them are like any other IO and > > >>> you don't do IO using layout under recall. > > >>> B) You should write all dirty blocks as part of the recall if > > >>> clora_changed is 0. > > >>> C) You should do (A), but have the option of doing (B) but you > > >>> are responsible for the consequences. > > >>> > > >>> Q2) How many dirty blocks should you keep covered by a layout? > > >>> A) As many as you want. It doesn't matter. > > >>> B) A small number so that you can be sure that they can be > > >>> written as part of layout recall (Assuming Q1=B). > > >>> C) If there is a limit, it must be at least as great as the limit > > >>> that would be in effect if there is no layout present, since > > >>> that number is OK, once the layout does go back. > > >>> > > >>> There are pieces of the spec that are assuming (A) and the answer to > > >>> these and pieces assuming (B). > > >>> > > >>> So I guess I was arguing before that the answers to Q1 and Q2 should be > > >>> (A). > > >>> > > >>> My understanding is that Benny is arguing for (B) as the answer to Q1 > > >>> and Q2. > > >>> > > >>> So I'm now willing to compromise slightly and answer (C) to both of > > >>> those, but I think that still leaves me and Benny quite a ways apart. > > >>> > > >>> I'm not sure what Trond's answer is, but I'd interested in understanding > > >>> his view in terms of (1)-(6) and Q1 and Q2. > > >>> > > >>> > > >>> -----Original Message----- > > >>> From: Trond Myklebust [mailto:trond.myklebust@fys.uio.no] > > >>> Sent: Friday, October 29, 2010 6:04 PM > > >>> To: faibish, sorin > > >>> Cc: Noveck, David; bhalevy@panasas.com; jglasgow@aya.yale.edu; > > >>> nfsv4@ietf.org > > >>> Subject: Re: [nfsv4] Write-behind caching > > >>> > > >>> On Fri, 2010-10-29 at 17:48 -0400, Trond Myklebust wrote: > > >>>> On Fri, 2010-10-29 at 17:32 -0400, sfaibish wrote: > > >>>>> On Fri, 29 Oct 2010 13:39:55 -0400, Trond Myklebust > > >>>>> <trond.myklebust@fys.uio.no> wrote: > > >>>>> > > >>>>>> On Fri, 2010-10-29 at 13:20 -0400, david.noveck@emc.com wrote: > > >>>>>>> There are two issues here with regard to handling of layout > > >>> recall. > > >>>>>>> > > >>>>>>> One is with regard to in-flight IO. As Benny points out, you > > >>> cannot be > > >>>>>>> sure that the in-flight IO can be completed in time to avoid the > > >>> MDS > > >>>>>>> losing patience. That should rarely be the case though, if > > >>> things are > > >>>>>>> working right. The client has to be prepared to deal with IO > > >>> failures > > >>>>>>> due to layout revocation. Any IO that was in flight and failed > > >>> because > > >>>>>>> of layout revocation will need to be handled by being reissued to > > >>> the > > >>>>>>> MDS. Is there anybody that disagrees with that? > > >>>>>>> > > >>>>>>> The second issue concerns IO not in-flight (in other words, not > > >>> IO's > > >>>>>>> yet but potential IO's) when the recall is received. I just > > >>> don't see > > >>>>>>> that it reasonable to start IO's using layout segments being > > >>> recalled > > >>>>>>> (whether for dirty buffers or anything else). Doing IO's to the > > >>> MDS is > > >>>>>>> fine but there is no real need for the layout recall to specially > > >>> > > >>>>>>> trigger them, whether clora_changed is set or not. > > >>>>>> > > >>>>>> This should be _very_ rare. Any cases where 2 clients are trying > > >>> to do > > >>>>>> conflicting I/O on the same data is likely to be either a > > >>> violation of > > >>>>>> the NFS cache consistency rules, or a scenario where it is in any > > >>> case > > >>>>>> more efficient to go through the MDS (e.g. writing to adjacent > > >>> records > > >>>>>> that share the same extent). > > >>>>> Well this is a different discussion: what was the reason for the > > >>> recall in > > >>>>> the first place. This is one usecase but there could be other > > >>> usecases > > >>>>> for the recall and we discuss here how to implement the protcol more > > >>> than > > >>>>> how to solve a real problem. My 2c > > >>>> > > >>>> I strongly disagree. If this is an unrealistic scenario, then we don't > > >>>> have to care about devising an optimal strategy for it. The 'there > > >>> could > > >>>> be other usecases' scenario needs to be fleshed out before we can deal > > >>>> with it. > > >>> > > >>> To clarify a bit what I mean: we MUST devise optimal strategies for > > >>> realistic and useful scenarios. It is entirely OPTIONAL to devise > > >>> optimal strategies for unrealistic ones. > > >>> > > >>> If writing back all data before returning the layout causes protocol > > >>> issues because the server cannot distinguish between a bad client and > > >>> one that is waiting for I/O to complete, then my argument is that we're > > >>> in the second case: we don't have to optimise for it, and so it is safe > > >>> for the server to assume 'bad client'... > > >>> > > >>> Trond > > >>> > > >>> > > >> > > >> > > >> > > >> > > >> > > >> > > > > > > > > > > > _______________________________________________ > > nfsv4 mailing list > > nfsv4@ietf.org > > https://www.ietf.org/mailman/listinfo/nfsv4 > > _______________________________________________ > nfsv4 mailing list > nfsv4@ietf.org > https://www.ietf.org/mailman/listinfo/nfsv4
- [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Spencer Shepler
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Dean Hildebrand
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Jason Glasgow
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching Thomas Haynes
- Re: [nfsv4] Write-behind caching sfaibish
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.black
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Trond Myklebust
- Re: [nfsv4] Write-behind caching david.noveck
- Re: [nfsv4] Write-behind caching Benny Halevy
- Re: [nfsv4] Write-behind caching david.noveck