Re: [nfsv4] Write-behind caching

<david.black@emc.com> Fri, 05 November 2010 14:08 UTC

Return-Path: <david.black@emc.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id F046D3A6452 for <nfsv4@core3.amsl.com>; Fri, 5 Nov 2010 07:08:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -106.599
X-Spam-Level:
X-Spam-Status: No, score=-106.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id azIddx4yczwn for <nfsv4@core3.amsl.com>; Fri, 5 Nov 2010 07:08:24 -0700 (PDT)
Received: from mexforward.lss.emc.com (mexforward.lss.emc.com [128.222.32.20]) by core3.amsl.com (Postfix) with ESMTP id 4A4113A63EB for <nfsv4@ietf.org>; Fri, 5 Nov 2010 07:08:24 -0700 (PDT)
Received: from hop04-l1d11-si01.isus.emc.com (HOP04-L1D11-SI01.isus.emc.com [10.254.111.54]) by mexforward.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oA5E8aiT022160 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for <nfsv4@ietf.org>; Fri, 5 Nov 2010 10:08:36 -0400
Received: from mailhub.lss.emc.com (mailhub.lss.emc.com [10.254.222.226]) by hop04-l1d11-si01.isus.emc.com (RSA Interceptor) for <nfsv4@ietf.org>; Fri, 5 Nov 2010 10:08:28 -0400
Received: from corpussmtp3.corp.emc.com (corpussmtp3.corp.emc.com [10.254.169.196]) by mailhub.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oA5E816U028815 for <nfsv4@ietf.org>; Fri, 5 Nov 2010 10:08:05 -0400
Received: from mxhub02.corp.emc.com ([10.254.141.104]) by corpussmtp3.corp.emc.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 5 Nov 2010 10:07:58 -0400
Received: from mx14a.corp.emc.com ([169.254.1.117]) by mxhub02.corp.emc.com ([10.254.141.104]) with mapi; Fri, 5 Nov 2010 10:07:57 -0400
From: david.black@emc.com
To: david.noveck@emc.com
Date: Fri, 05 Nov 2010 10:07:53 -0400
Thread-Topic: [nfsv4] Write-behind caching
Thread-Index: Act7adOOpbsINeA3QbmSAuvw1DUIRgA0rjxQABElcRAAHCv9YA==
Message-ID: <7C4DFCE962635144B8FAE8CA11D0BF1E03D59C159C@MX14A.corp.emc.com>
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com><BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com><4CC7B3AE.8000802@gmail.com><AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com><1288186821.8477.28.camel@heimdal.trondhjem.org><BF3BB6D12298F54B89C8DCC1E4073D80028C7A3E@CORPUSMX50A.corp.emc.com><op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com><4CC857D5.5010104@panasas.com><op.vk8vpbldunckof@usensfaibisl2e.eng.emc.com><BF3BB6D12298F54B89C8DCC1E4073D80028C80AB@CORPUSMX50A.corp.emc.com><1288373995.3701.35.camel@heimdal.trondhjem.org><op.vlcwr1zqunckof@usensfaibisl2e.eng.emc.com><1288388933.3701.47.camel@heimdal.trondhjem.org><1288389823.3701.59.camel@heimdal.trondhjem.org><BF3BB6D12298F54B89C8DCC1E4073D80029446BC@CORPUSMX50A.corp.emc.com><1288707482.2925.44.camel@heimdal.trondhjem.org><op.vljy4pqaunckof@usensfaibisl2e.eng.emc.c! om><BF3BB6D12298F54B89C8DCC1E4073D8002944A75@CORPUSMX50A.corp.emc.com><4CD17BFE.8000707@panasas.com> <7C4DFC E962635144B8FAE8CA11D0BF1E03D58B97E2@MX14A.corp.emc.com> <BF3BB6D12298F54B89C8DCC1E4073D8002944F1E@CORPUSMX50A.corp.emc.com>
In-Reply-To: <BF3BB6D12298F54B89C8DCC1E4073D8002944F1E@CORPUSMX50A.corp.emc.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-cr-puzzleid: {3D0E4ED7-9AD5-49D3-9962-7B5A313D1437}
x-cr-hashedpuzzle: ACwh A1dE CprK C7tQ DBxo DFSN Ec7k FJW4 FuPb GH3e GOd1 GO6T I/vk JYhG KMf3 LGLJ; 1; bgBmAHMAdgA0AEAAaQBlAHQAZgAuAG8AcgBnAA==; Sosha1_v1; 7; {3D0E4ED7-9AD5-49D3-9962-7B5A313D1437}; ZABhAHYAaQBkAC4AYgBsAGEAYwBrAEAAZQBtAGMALgBjAG8AbQA=; Fri, 05 Nov 2010 14:07:53 GMT; UgBFADoAIABbAG4AZgBzAHYANABdACAAVwByAGkAdABlAC0AYgBlAGgAaQBuAGQAIABjAGEAYwBoAGkAbgBnAA==
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginalArrivalTime: 05 Nov 2010 14:07:58.0409 (UTC) FILETIME=[D98E1B90:01CB7CF2]
X-EMM-MHVC: 1
Cc: nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 05 Nov 2010 14:08:27 -0000

Dave,

> But the existing text suggests that this very situation, draining the dirty data using the layout,
> should (or might) result in less aggressive caching.  The fact that you are doing it in response to a
> recall means that you have a higher drain rate times a limited time, rather than a more limited drain
> rate times an unlimited time.

I agree that there's a problem with the existing text.  Keep in mind that for a system in which the metadata server is limited in what it can do, the "more limited drain rate" might be *much* smaller than what's possible via the data servers, and the client might not be interested in an "unlimited" exposure in terms of time to drain the cache.  Benny and I are both telling you not to forbid a client from choosing to do writebacks to the data servers in response to a layout recall.

> I think the question is about how you consider the case "if you cannot get the layout back".  If a
> client is allowed to do writes more efficiently despite the recall, why is not allowed to do reads
> more efficiently and similarly delay the recall?

Writes contain dirty data.  Reads don't.  There is a difference :-) :-) .

> I've proposed that this choice (the one about the write) be made the prerogative of the mapping type
> specifically.  For pNFS file, I would think that the normal assumption when a layout is being recalled
> is that this is part of some sort of restriping and that you will get it back and it is better to
> return the layout as soon as you can.

I would think that we shouldn't be telling people how to implement systems in this fashion.  I guess I can live with the file layout forbidding this if the file layout implementers think it's useful.  OTOH, Benny and I are telling you that forbidding this is the wrong approach for both the block and object layouts.

Thanks,
--David

> -----Original Message-----
> From: Noveck, David
> Sent: Thursday, November 04, 2010 8:47 PM
> To: Black, David; bhalevy@panasas.com
> Cc: nfsv4@ietf.org
> Subject: RE: [nfsv4] Write-behind caching
>
> > The multiple pNFS data servers can provide much higher
> > throughput than the MDS, and hence it should be valid
> > for a client to cache writes more aggressively when it
> > has a layout because it can drain its dirty cache faster.
>
> But the existing text suggests that this very situation, draining the dirty data using the layout,
> should (or might) result in less aggressive caching.  The fact that you are doing it in response to a
> recall means that you have a higher drain rate times a limited time, rather than a more limited drain
> rate times an unlimited time.
>
> I think the question is about how you consider the case "if you cannot get the layout back".  If a
> client is allowed to do writes more efficiently despite the recall, why is not allowed to do reads
> more efficiently and similarly delay the recall?  It seems that the same performance considerations
> would apply.  Would you want the client to be able to read through the area in the layout in order to
> use it effectively?  I wouldn't think so.
>
> I've proposed that this choice (the one about the write) be made the prerogative of the mapping type
> specifically.  For pNFS file, I would think that the normal assumption when a layout is being recalled
> is that this is part of some sort of restriping and that you will get it back and it is better to
> return the layout as soon as you can.
>
>
>
> -----Original Message-----
> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf Of david.black@emc.com
> Sent: Thursday, November 04, 2010 12:34 PM
> To: bhalevy@panasas.com
> Cc: nfsv4@ietf.org
> Subject: Re: [nfsv4] Write-behind caching
>
> > >    4*) Clients SHOULD write all dirty data covered by the recalled
> > >        layout before return it.
> > >
> > > It may be that you can write faster this way, but it also mean that the server may wait a while to
> > get the layout back and this may delay other clients.  There is the further problem that it means
> > that your set of dirty blocks can be much smaller than it would be otherwise and this can hurt
> > performance.  I don't think that should be a valid choice.
> >
> > We can live without it, as long there is no hard requirement for the client to
> > not flush any dirty data upon CB_LAYOUTRECALL (option 1 above).
>
> I basically agree with Benny.
>
> The multiple pNFS data servers can provide much higher throughput than the MDS, and hence it should be
> valid for a client to cache writes more aggressively when it has a layout because it can drain its
> dirty cache faster.  Such a client may want to pro-actively reduce its amount of dirty cached data in
> response to a layout recall, with the goal of provide appropriate behavior if it cannot get the layout
> back.  For that reason, a prohibition on clients initiating writes in response to a recall would be a
> problem (i.e., option 1's prohibition is not a good idea).
>
> That leaves 2) and 3) which seem to be shades of the same concept:
>
> > >    2) Say you MAY write some dirty data on layouts being recalled
> > >       but you should limit this attempt to optimize use of layouts
> > >       to avoid unduly delaying layout recalls.
> > >
> > >    3) Say clients MAY write large amounts of dirty data and server
> > >       will generally accommodate them in using pNFS to do IO this
> > >       way.
>
> I think the "avoid unduly delaying" point is important, which suggests that the "large amounts" of
> dirty data writes in 3) would only be appropriate when the client has sufficiently high throughput
> access to the data servers to write "large amounts" of data without "unduly delaying" the recall.
>
> Thanks,
> --David
>
> > -----Original Message-----
> > From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf Of Benny Halevy
> > Sent: Wednesday, November 03, 2010 11:13 AM
> > To: Noveck, David
> > Cc: nfsv4@ietf.org; trond.myklebust@fys.uio.no
> > Subject: Re: [nfsv4] Write-behind caching
> >
> > On 2010-11-03 15:31, david.noveck@emc.com wrote:
> > > 12.5.5 states that the server "MUST wait one lease period before taking further action" so I don't
> > think it is allowed to fence the client immediately.
> > >
> > > I think there some confusion/error that starts with the last paragraph of 12.5.5.
> > >
> > >    Although pNFS does not alter the file data caching capabilities of
> > >    clients, or their semantics, it recognizes that some clients may
> > >    perform more aggressive write-behind caching to optimize the benefits
> > >    provided by pNFS.
> > >
> > > It you are doing write-behind caching, the primary thing that is going to decide whether you
> > should actually write the dirty block is the probability that it will be modified again.  If that is
> > at all likely, then writing it immediately, just to get the "benefits provided by pNFS" may not be a
> > good idea.  And if the probabilities of the block being further modified had already reached a low
> > level, then you probably should have started writing it, before the CB_LAYOUTRECALL.  It may be that
> > there are some blocks whose probability is just on the edge, and the CB_LAYOUTRECALL pushed them
> > into gee-it-would-be-better-to-write-these-now category.  But that is not what is being talked about
> > here.
> > >
> > > Note that it talks about "more aggressive write-behind caching" and then later talks about having
> > less dirty data in this case.  I think this needs to be rethought.
> > >
> > >    However, write-behind caching may negatively
> > >    affect the latency in returning a layout in response to a
> > >    CB_LAYOUTRECALL;
> > >
> > > Here it seems to assume not that CB_LAYOUTRECALL makes it more desirable to write not just some
> > dirty blocks using the recalled layout but that all dirty data is being written (or at least that
> > which covered by the recall).
> > >
> > >    this is similar to file delegations and the impact
> > >    that file data caching has on DELEGRETURN.
> > >
> > > But that is a very bad analogy.  For delegations, there's a semantic reason you have to write all
> > the dirty data before returning the delegations.
> > >
> > >    Client implementations
> > >    SHOULD limit the amount of unwritten data they have outstanding at
> > >    any one time in order to prevent excessively long responses to
> > >    CB_LAYOUTRECALL.
> > >
> > > Again the assumption is not that somebody is writing some amount of data to take advantage of a
> > layout going away but that clients in general are writing every single dirty block.  As an example,
> > take the case of the partial block written sequentially.  That's a dirty block you would never write
> > as a result of a LAYOUTRECALL.  There's probably no benefit in writing it using the layout no matter
> > how efficient the pNFS mapping type is.  You are probably going to have to write it again anyway.
> > >
> > > For some environments, limiting the amount of unwritten data, may hurt performance more than
> > writing the dirty data to the MDS.  If I can write X bytes of dirty blocks to the MDS (if I didn't
> > have a layout), why should I keep less than X bytes of dirty blocks if I have a layout which is
> > supposedly helping me write more efficiently (and as part of "more aggressive write-behind
> > caching").  If anything I should be able to have more dirty data.
> > >
> > > Note that clora_changed can tell the client not to write the dirty data, but the client has no way
> > of predicting what clora_changed will be, so it would seem that they have to limit the amount of
> > dirty data, even if they have a server which is never going to ask them to write it as part of
> > layout recall.
> > >
> > >    Once a layout is recalled, a server MUST wait one
> > >    lease period before taking further action.  As soon as a lease period
> > >    has passed, the server may choose to fence the client's access to the
> > >    storage devices if the server perceives the client has taken too long
> > >    to return a layout.  However, just as in the case of data delegation
> > >    and DELEGRETURN, the server may choose to wait, given that the client
> > >    is showing forward progress on its way to returning the layout.
> > >
> > > Again, these situations are different.  A client which is doing this is issuing new IO's using
> > recalled layouts.  I don't have any objection if a server wants to allow this but I don't think
> > treating layouts in the same way as delegations should be encouraged.
> > >
> > >    This
> > >    forward progress can take the form of successful interaction with the
> > >    storage devices or of sub-portions of the layout being returned by
> > >    the client.  The server can also limit exposure to these problems by
> > >    limiting the byte-ranges initially provided in the layouts and thus
> > >    the amount of outstanding modified data.
> > >
> > > That adds a lot complexity to the server for no good reason.  If you start by telling the client
> > to write every single dirty block covered by a layout recall before returning the layout, then you
> > are going to run into problems like this.
> > >
> > > I think there are a number of choices:
> > >
> > >    1) Say you MUST NOT do IO on layouts being recalled, in which
> > >       case none of this problem arises.  I take it this is what
> > >       Trond is arguing for.
> > >
> > >    2) Say you MAY write some dirty data on layouts being recalled
> > >       but you should limit this attempt to optimize use of layouts
> > >       to avoid unduly delaying layout recalls.
> > >
> > >    3) Say clients MAY write large amounts of dirty data and server
> > >       will generally accommodate them in using pNFS to do IO this
> > >       way.
> > >
> > > Maybe the right approach is to have whichever of these is to be in effect be chosen on a per-
> > mapping-type basis, perhaps based on clora_changed
> >
> > I think this is the right approach as the blocks and objects layout types may
> > use topologies for which flushing some data to fill, e.g. an allocated block
> > on disk or a RAID stripe makes sense.
> >
> > >
> > > I think the real problem is the suggestion that there is some reason that a client has to write
> > every single dirty block within the scope of the CB_LAYOUTRECALL, i.e. that this is analogous to
> > DELEGRETURN.
> > >
> > >    4*) Clients SHOULD write all dirty data covered by the recalled
> > >        layout before return it.
> > >
> > > It may be that you can write faster this way, but it also mean that the server may wait a while to
> > get the layout back and this may delay other clients.  There is the further problem that it means
> > that your set of dirty blocks can be much smaller than it would be otherwise and this can hurt
> > performance.  I don't think that should be a valid choice.
> >
> > We can live without it, as long there is no hard requirement for the client to
> > not flush any dirty data upon CB_LAYOUTRECALL (option 1 above).
> >
> > Benny
> >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: sfaibish [mailto:sfaibish@emc.com]
> > > Sent: Tuesday, November 02, 2010 1:06 PM
> > > To: Trond Myklebust; Noveck, David
> > > Cc: bhalevy@panasas.com; jglasgow@aya.yale.edu; nfsv4@ietf.org
> > > Subject: Re: [nfsv4] Write-behind caching
> > >
> > > On Tue, 02 Nov 2010 10:18:02 -0400, Trond Myklebust
> > > <trond.myklebust@fys.uio.no> wrote:
> > >
> > >> Hi Dave,
> > >>
> > >> So, while I largely agree with your points 1-6, I'd like to add
> > >>
> > >> 0) Layouts are not a tool for enforcing cache consistency!
> > >>
> > >> While I agree that doing safe read-modify-write in the block case is an
> > >> important feature, I don't see any agreement anywhere in RFC5661 that we
> > >> should be providing stronger caching semantics than we used to provide
> > >> prior to adding pNFS to the protocol. I have no intention of allowing a
> > >> Linux client implementation that provides such stronger semantics until
> > >> we write that sort of thing into the spec and provide for similar
> > >> stronger semantics in the non-pNFS case.
> > >>
> > >> With that in mind, I have the following comments:
> > >>
> > >>       * I see no reason to write data back when the server recalls the
> > >>         layout. While I see that you could argue (1) implies that you
> > >>         should try to write stuff while you still hold a layout, the
> > >>         spec says that clora_changed==FALSE implies you can get that
> > >>         layout back later. In the case where clora_changed==TRUE, you
> > >>         might expect the file would be unavailable for longer, but the
> > >>         spec says you shouldn't write stuff back in that case...
> > >>       * While this may lead to layout bouncing between clients and/or
> > >>         the server, the clients do have the option of detecting this,
> > >>         and choosing write through MDS to improve efficiency. Grabbing
> > >>         the layout, and blocking others from accessing the data while
> > >>         you write is not a scalable solution even if you do believe
> > >>         there is a valid scenario for this behaviour.
> > >>       * Basically, it comes down to the fact that I want to write back
> > >>         data when my memory management heuristics require it, so that I
> > >>         cache data as long as possible. I see no reason why server
> > >>         mechanics should dictate when I should stop caching (unless we
> > >>         are talking about a true cache consistency mechanism).
> > > OK. Now I think I understand your point and we might still require some
> > > changes in the interpretation and perhaps some language in the 5661. But
> > > I have a basic question about fencing. Do we think there is any possibility
> > > of data corruption when the DS fence the I/Os very fast after the
> > > layoutrecall.
> > > If we can find such possibility we probably need to mention this in the
> > > protocol
> > > and recommend how to prevent such a case. For example MDS sends a
> > > layoutrecall
> > > and immediately (implementation decision of the server) it force the
> > > fencing
> > > on the DS while waiting for the return or after receiving ack from the
> > > client
> > > for the layoutrecall. (I might be out of order here but I just want to be
> > > sure
> > > this is not the case).
> > >
> > > /Sorin
> > >
> > >
> > >
> > >>
> > >> So my choices for Q1 and Q2 are still (A) and (A).
> > >>
> > >> Cheers
> > >>   Trond
> > >>
> > >> On Mon, 2010-11-01 at 13:49 -0400, david.noveck@emc.com wrote:
> > >>> I think that want to address this issue without using the words
> > >>> "unrealistic" or "optimal".  Things that you think are unrealistic
> > >>> sometimes, in some sorts of environments, turn out to be common.  Trying
> > >>> to decide what approaches are optimal are also troubling.  In different
> > >>> situations, different approaches may be better or worse.  The protocol
> > >>> needs to define the rules that the client and server have to obey and
> > >>> they may make choices that result in results from optimal to pessimal.
> > >>> We can make suggestions on doing things better but in unusual situations
> > >>> the performance considerations may be different.  The point is we have
> > >>> to be clear when the client can and can't do A and similarly for B and
> > >>> sometimes the choice of A or B is simply up to the client.
> > >>>
> > >>> So I'm going to put down the following numbered propositions and we'll
> > >>> see where people disagree  with me.  Please be specific.  I'm going to
> > >>> assume the anything you don't argue with numerically below the point of
> > >>> disagreement is something we can agree on.
> > >>>
> > >>> 1) Normally, when a client is writing data covered by a layout, it may
> > >>> write using the layout or to the MDS, but unless there is a particular
> > >>> reason (e.g. slow or inconsistent response using the layout), it SHOULD
> > >>> write using the layout.
> > >>>
> > >>> 2) When a layout is recalled, the protocol itself does not nor should it
> > >>> require that dirty blocks in the cache be written before returning the
> > >>> layout.  If a client chooses to do writes using the recalled layout, it
> > >>> is doing so as an attempt to improve performance, given its judgment of
> > >>> the relative performance of IO using the layout and IO through the MDS.
> > >>>
> > >>> 3) Particularly in the case in which clora_changed is 0, clients MAY
> > >>> choose to take advantage of the higher-performance layout path to write
> > >>> that data, while it is available.  However, since doing that delays the
> > >>> return of the layout, it is possible that by delaying the return of the
> > >>> layout, performance of others waiting for the layout may be reduced.
> > >>>
> > >>> 4) When writing of dirty blocks is done using a layout being recalled,
> > >>> the possibility exists that the layout will be revoked before all the
> > >>> blocks are successfully written.  The client MUST be prepared to rewrite
> > >>> those dirty blocks whose layouts write failed to the MDS in such cases.
> > >>>
> > >>> 5) Clients that want to write dirty blocks associated with recalled
> > >>> layouts MAY choose to restrict the size of the set of dirty blocks they
> > >>> keep in order to make it relatively unlikely that the layout will be
> > >>> revoked during recall.  On the other hand, for applications, in which
> > >>> having a large set of dirty blocks in the cache reduces the IO actually
> > >>> done, such restriction may result in poorer performance, even though the
> > >>> specific IO path used is more performant.
> > >>>
> > >>> 6) Note that if a large set of dirty blocks can be kept by the client
> > >>> when a layout is not held, it should be possible to keep a set that at
> > >>> least that size a set of dirty blocks when a layout is held.  Even if
> > >>> the client should choose to write those blocks as part of the layout
> > >>> recall, any that it is not able to write in an appropriate time, will be
> > >>> a subset of an amount which, by hypothesis, can be appropriately held
> > >>> when the only means of writing them is to the MDS.
> > >>>
> > >>> Another way of looking at this is that we have the following questions
> > >>> which I'm going to present as multiple choice.  I have missed a few
> > >>> choices but,
> > >>>
> > >>> Q1) When a recall of a layout occurs what do you about dirty blocks?
> > >>>     A) Nothing.  The IO's to write them are like any other IO and
> > >>>        you don't do IO using layout under recall.
> > >>>     B) You should write all dirty blocks as part of the recall if
> > >>>        clora_changed is 0.
> > >>>     C) You should do (A), but have the option of doing (B) but you
> > >>>        are responsible for the consequences.
> > >>>
> > >>> Q2) How many dirty blocks should you keep covered by a layout?
> > >>>     A) As many as you want.  It doesn't matter.
> > >>>     B) A small number so that you can be sure that they can be
> > >>>        written as part of layout recall (Assuming Q1=B).
> > >>>     C) If there is a limit, it must be at least as great as the limit
> > >>>        that would be in effect if there is no layout present, since
> > >>>        that number is OK, once the layout does go back.
> > >>>
> > >>> There are pieces of the spec that are assuming (A) and the answer to
> > >>> these and pieces assuming (B).
> > >>>
> > >>> So I guess I was arguing before that the answers to Q1 and Q2 should be
> > >>> (A).
> > >>>
> > >>> My understanding is that Benny is arguing for (B) as the answer to Q1
> > >>> and Q2.
> > >>>
> > >>> So I'm now willing to compromise slightly and answer (C) to both of
> > >>> those, but I think that still leaves me and Benny quite a ways apart.
> > >>>
> > >>> I'm not sure what Trond's answer is, but I'd interested in understanding
> > >>> his view in terms of (1)-(6) and Q1 and Q2.
> > >>>
> > >>>
> > >>> -----Original Message-----
> > >>> From: Trond Myklebust [mailto:trond.myklebust@fys.uio.no]
> > >>> Sent: Friday, October 29, 2010 6:04 PM
> > >>> To: faibish, sorin
> > >>> Cc: Noveck, David; bhalevy@panasas.com; jglasgow@aya.yale.edu;
> > >>> nfsv4@ietf.org
> > >>> Subject: Re: [nfsv4] Write-behind caching
> > >>>
> > >>> On Fri, 2010-10-29 at 17:48 -0400, Trond Myklebust wrote:
> > >>>> On Fri, 2010-10-29 at 17:32 -0400, sfaibish wrote:
> > >>>>> On Fri, 29 Oct 2010 13:39:55 -0400, Trond Myklebust
> > >>>>> <trond.myklebust@fys.uio.no> wrote:
> > >>>>>
> > >>>>>> On Fri, 2010-10-29 at 13:20 -0400, david.noveck@emc.com wrote:
> > >>>>>>> There are two issues here with regard to handling of layout
> > >>> recall.
> > >>>>>>>
> > >>>>>>> One is with regard to in-flight IO.  As Benny points out, you
> > >>> cannot be
> > >>>>>>> sure that the in-flight IO can be completed in time to avoid the
> > >>> MDS
> > >>>>>>> losing patience.  That should rarely be the case though, if
> > >>> things are
> > >>>>>>> working right.  The client has to be prepared to deal with IO
> > >>> failures
> > >>>>>>> due to layout revocation.  Any IO that was in flight and failed
> > >>> because
> > >>>>>>> of layout revocation will need to be handled by being reissued to
> > >>> the
> > >>>>>>> MDS.  Is there anybody that disagrees with that?
> > >>>>>>>
> > >>>>>>> The second issue concerns IO not in-flight (in other words, not
> > >>> IO's
> > >>>>>>> yet but potential IO's) when the recall is received.  I just
> > >>> don't see
> > >>>>>>> that it reasonable to start IO's using layout segments being
> > >>> recalled
> > >>>>>>> (whether for dirty buffers or anything else).  Doing IO's to the
> > >>> MDS is
> > >>>>>>> fine but there is no real need for the layout recall to specially
> > >>>
> > >>>>>>> trigger them, whether clora_changed is set or not.
> > >>>>>>
> > >>>>>> This should be _very_ rare. Any cases where 2 clients are trying
> > >>> to do
> > >>>>>> conflicting I/O on the same data is likely to be either a
> > >>> violation of
> > >>>>>> the NFS cache consistency rules, or a scenario where it is in any
> > >>> case
> > >>>>>> more efficient to go through the MDS (e.g. writing to adjacent
> > >>> records
> > >>>>>> that share the same extent).
> > >>>>> Well this is a different discussion: what was the reason for the
> > >>> recall in
> > >>>>> the first place. This is one usecase but there could be other
> > >>> usecases
> > >>>>> for the recall and we discuss here how to implement the protcol more
> > >>> than
> > >>>>> how to solve a real problem. My 2c
> > >>>>
> > >>>> I strongly disagree. If this is an unrealistic scenario, then we don't
> > >>>> have to care about devising an optimal strategy for it. The 'there
> > >>> could
> > >>>> be other usecases' scenario needs to be fleshed out before we can deal
> > >>>> with it.
> > >>>
> > >>> To clarify a bit what I mean: we MUST devise optimal strategies for
> > >>> realistic and useful scenarios. It is entirely OPTIONAL to devise
> > >>> optimal strategies for unrealistic ones.
> > >>>
> > >>> If writing back all data before returning the layout causes protocol
> > >>> issues because the server cannot distinguish between a bad client and
> > >>> one that is waiting for I/O to complete, then my argument is that we're
> > >>> in the second case: we don't have to optimise for it, and so it is safe
> > >>> for the server to assume 'bad client'...
> > >>>
> > >>>    Trond
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> > >
> > _______________________________________________
> > nfsv4 mailing list
> > nfsv4@ietf.org
> > https://www.ietf.org/mailman/listinfo/nfsv4
>
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4