Re: [nfsv4] Write-behind caching

<david.noveck@emc.com> Fri, 05 November 2010 00:46 UTC

Return-Path: <david.noveck@emc.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id A4C673A690E for <nfsv4@core3.amsl.com>; Thu, 4 Nov 2010 17:46:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.599
X-Spam-Level:
X-Spam-Status: No, score=-6.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id czcfyRYcal0p for <nfsv4@core3.amsl.com>; Thu, 4 Nov 2010 17:46:55 -0700 (PDT)
Received: from mexforward.lss.emc.com (mexforward.lss.emc.com [128.222.32.20]) by core3.amsl.com (Postfix) with ESMTP id 383D23A6A64 for <nfsv4@ietf.org>; Thu, 4 Nov 2010 17:46:54 -0700 (PDT)
Received: from hop04-l1d11-si01.isus.emc.com (HOP04-L1D11-SI01.isus.emc.com [10.254.111.54]) by mexforward.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oA50l5ff009990 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 4 Nov 2010 20:47:05 -0400
Received: from mailhub.lss.emc.com (mailhub.lss.emc.com [10.254.221.253]) by hop04-l1d11-si01.isus.emc.com (RSA Interceptor); Thu, 4 Nov 2010 20:46:59 -0400
Received: from corpussmtp4.corp.emc.com (corpussmtp4.corp.emc.com [10.254.169.197]) by mailhub.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oA50ks7b019076; Thu, 4 Nov 2010 20:46:55 -0400
Received: from CORPUSMX50A.corp.emc.com ([128.221.62.39]) by corpussmtp4.corp.emc.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 4 Nov 2010 20:46:50 -0400
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Date: Thu, 04 Nov 2010 20:46:46 -0400
Message-ID: <BF3BB6D12298F54B89C8DCC1E4073D8002944F1E@CORPUSMX50A.corp.emc.com>
In-Reply-To: <7C4DFCE962635144B8FAE8CA11D0BF1E03D58B97E2@MX14A.corp.emc.com>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: [nfsv4] Write-behind caching
Thread-Index: Act7adOOpbsINeA3QbmSAuvw1DUIRgA0rjxQABElcRA=
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com><BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com><4CC7B3AE.8000802@gmail.com><AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com><1288186821.8477.28.camel@heimdal.trondhjem.org><BF3BB6D12298F54B89C8DCC1E4073D80028C7A3E@CORPUSMX50A.corp.emc.com><op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com><4CC857D5.5010104@panasas.com><op.vk8vpbldunckof@usensfaibisl2e.eng.emc.com><BF3BB6D12298F54B89C8DCC1E4073D80028C80AB@CORPUSMX50A.corp.emc.com><1288373995.3701.35.camel@heimdal.trondhjem.org><op.vlcwr1zqunckof@usensfaibisl2e.eng.emc.com><1288388933.3701.47.camel@heimdal.trondhjem.org><1288389823.3701.59.camel@heimdal.trondhjem.org><BF3BB6D12298F54B89C8DCC1E4073D80029446BC@CORPUSMX50A.corp.emc.com><1288707482.2925.44.camel@heimdal.trondhjem.org><op.vljy4pqaunckof@usensfaibisl2e.eng.emc.c! om><BF3BB6D12298F54B89C8DCC1E4073D8002944A75@CORPUSMX50A.corp.emc.com><4CD17BFE.8000707@panasas.com! > <7C4DFC E962635144B8FAE8CA11D0BF1E03D58B97E2@MX14A.corp.emc.com>
From: david.noveck@emc.com
To: david.black@emc.com, bhalevy@panasas.com
X-OriginalArrivalTime: 05 Nov 2010 00:46:50.0521 (UTC) FILETIME=[EEDC7890:01CB7C82]
X-EMM-MHVC: 1
Cc: nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 05 Nov 2010 00:46:57 -0000

> The multiple pNFS data servers can provide much higher 
> throughput than the MDS, and hence it should be valid 
> for a client to cache writes more aggressively when it 
> has a layout because it can drain its dirty cache faster.

But the existing text suggests that this very situation, draining the
dirty data using the layout, should (or might) result in less aggressive
caching.  The fact that you are doing it in response to a recall means
that you have a higher drain rate times a limited time, rather than a
more limited drain rate times an unlimited time.

I think the question is about how you consider the case "if you cannot
get the layout back".  If a client is allowed to do writes more
efficiently despite the recall, why is not allowed to do reads more
efficiently and similarly delay the recall?  It seems that the same
performance considerations would apply.  Would you want the client to be
able to read through the area in the layout in order to use it
effectively?  I wouldn't think so.

I've proposed that this choice (the one about the write) be made the
prerogative of the mapping type specifically.  For pNFS file, I would
think that the normal assumption when a layout is being recalled is that
this is part of some sort of restriping and that you will get it back
and it is better to return the layout as soon as you can.

  

-----Original Message-----
From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf
Of david.black@emc.com
Sent: Thursday, November 04, 2010 12:34 PM
To: bhalevy@panasas.com
Cc: nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching

> >    4*) Clients SHOULD write all dirty data covered by the recalled
> >        layout before return it.
> >
> > It may be that you can write faster this way, but it also mean that
the server may wait a while to
> get the layout back and this may delay other clients.  There is the
further problem that it means
> that your set of dirty blocks can be much smaller than it would be
otherwise and this can hurt
> performance.  I don't think that should be a valid choice.
>
> We can live without it, as long there is no hard requirement for the
client to
> not flush any dirty data upon CB_LAYOUTRECALL (option 1 above).

I basically agree with Benny.

The multiple pNFS data servers can provide much higher throughput than
the MDS, and hence it should be valid for a client to cache writes more
aggressively when it has a layout because it can drain its dirty cache
faster.  Such a client may want to pro-actively reduce its amount of
dirty cached data in response to a layout recall, with the goal of
provide appropriate behavior if it cannot get the layout back.  For that
reason, a prohibition on clients initiating writes in response to a
recall would be a problem (i.e., option 1's prohibition is not a good
idea).

That leaves 2) and 3) which seem to be shades of the same concept:

> >    2) Say you MAY write some dirty data on layouts being recalled
> >       but you should limit this attempt to optimize use of layouts
> >       to avoid unduly delaying layout recalls.
> >
> >    3) Say clients MAY write large amounts of dirty data and server
> >       will generally accommodate them in using pNFS to do IO this
> >       way.

I think the "avoid unduly delaying" point is important, which suggests
that the "large amounts" of dirty data writes in 3) would only be
appropriate when the client has sufficiently high throughput access to
the data servers to write "large amounts" of data without "unduly
delaying" the recall.

Thanks,
--David

> -----Original Message-----
> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf
Of Benny Halevy
> Sent: Wednesday, November 03, 2010 11:13 AM
> To: Noveck, David
> Cc: nfsv4@ietf.org; trond.myklebust@fys.uio.no
> Subject: Re: [nfsv4] Write-behind caching
>
> On 2010-11-03 15:31, david.noveck@emc.com wrote:
> > 12.5.5 states that the server "MUST wait one lease period before
taking further action" so I don't
> think it is allowed to fence the client immediately.
> >
> > I think there some confusion/error that starts with the last
paragraph of 12.5.5.
> >
> >    Although pNFS does not alter the file data caching capabilities
of
> >    clients, or their semantics, it recognizes that some clients may
> >    perform more aggressive write-behind caching to optimize the
benefits
> >    provided by pNFS.
> >
> > It you are doing write-behind caching, the primary thing that is
going to decide whether you
> should actually write the dirty block is the probability that it will
be modified again.  If that is
> at all likely, then writing it immediately, just to get the "benefits
provided by pNFS" may not be a
> good idea.  And if the probabilities of the block being further
modified had already reached a low
> level, then you probably should have started writing it, before the
CB_LAYOUTRECALL.  It may be that
> there are some blocks whose probability is just on the edge, and the
CB_LAYOUTRECALL pushed them
> into gee-it-would-be-better-to-write-these-now category.  But that is
not what is being talked about
> here.
> >
> > Note that it talks about "more aggressive write-behind caching" and
then later talks about having
> less dirty data in this case.  I think this needs to be rethought.
> >
> >    However, write-behind caching may negatively
> >    affect the latency in returning a layout in response to a
> >    CB_LAYOUTRECALL;
> >
> > Here it seems to assume not that CB_LAYOUTRECALL makes it more
desirable to write not just some
> dirty blocks using the recalled layout but that all dirty data is
being written (or at least that
> which covered by the recall).
> >
> >    this is similar to file delegations and the impact
> >    that file data caching has on DELEGRETURN.
> >
> > But that is a very bad analogy.  For delegations, there's a semantic
reason you have to write all
> the dirty data before returning the delegations.
> >
> >    Client implementations
> >    SHOULD limit the amount of unwritten data they have outstanding
at
> >    any one time in order to prevent excessively long responses to
> >    CB_LAYOUTRECALL.
> >
> > Again the assumption is not that somebody is writing some amount of
data to take advantage of a
> layout going away but that clients in general are writing every single
dirty block.  As an example,
> take the case of the partial block written sequentially.  That's a
dirty block you would never write
> as a result of a LAYOUTRECALL.  There's probably no benefit in writing
it using the layout no matter
> how efficient the pNFS mapping type is.  You are probably going to
have to write it again anyway.
> >
> > For some environments, limiting the amount of unwritten data, may
hurt performance more than
> writing the dirty data to the MDS.  If I can write X bytes of dirty
blocks to the MDS (if I didn't
> have a layout), why should I keep less than X bytes of dirty blocks if
I have a layout which is
> supposedly helping me write more efficiently (and as part of "more
aggressive write-behind
> caching").  If anything I should be able to have more dirty data.
> >
> > Note that clora_changed can tell the client not to write the dirty
data, but the client has no way
> of predicting what clora_changed will be, so it would seem that they
have to limit the amount of
> dirty data, even if they have a server which is never going to ask
them to write it as part of
> layout recall.
> >
> >    Once a layout is recalled, a server MUST wait one
> >    lease period before taking further action.  As soon as a lease
period
> >    has passed, the server may choose to fence the client's access to
the
> >    storage devices if the server perceives the client has taken too
long
> >    to return a layout.  However, just as in the case of data
delegation
> >    and DELEGRETURN, the server may choose to wait, given that the
client
> >    is showing forward progress on its way to returning the layout.
> >
> > Again, these situations are different.  A client which is doing this
is issuing new IO's using
> recalled layouts.  I don't have any objection if a server wants to
allow this but I don't think
> treating layouts in the same way as delegations should be encouraged.
> >
> >    This
> >    forward progress can take the form of successful interaction with
the
> >    storage devices or of sub-portions of the layout being returned
by
> >    the client.  The server can also limit exposure to these problems
by
> >    limiting the byte-ranges initially provided in the layouts and
thus
> >    the amount of outstanding modified data.
> >
> > That adds a lot complexity to the server for no good reason.  If you
start by telling the client
> to write every single dirty block covered by a layout recall before
returning the layout, then you
> are going to run into problems like this.
> >
> > I think there are a number of choices:
> >
> >    1) Say you MUST NOT do IO on layouts being recalled, in which
> >       case none of this problem arises.  I take it this is what
> >       Trond is arguing for.
> >
> >    2) Say you MAY write some dirty data on layouts being recalled
> >       but you should limit this attempt to optimize use of layouts
> >       to avoid unduly delaying layout recalls.
> >
> >    3) Say clients MAY write large amounts of dirty data and server
> >       will generally accommodate them in using pNFS to do IO this
> >       way.
> >
> > Maybe the right approach is to have whichever of these is to be in
effect be chosen on a per-
> mapping-type basis, perhaps based on clora_changed
>
> I think this is the right approach as the blocks and objects layout
types may
> use topologies for which flushing some data to fill, e.g. an allocated
block
> on disk or a RAID stripe makes sense.
>
> >
> > I think the real problem is the suggestion that there is some reason
that a client has to write
> every single dirty block within the scope of the CB_LAYOUTRECALL, i.e.
that this is analogous to
> DELEGRETURN.
> >
> >    4*) Clients SHOULD write all dirty data covered by the recalled
> >        layout before return it.
> >
> > It may be that you can write faster this way, but it also mean that
the server may wait a while to
> get the layout back and this may delay other clients.  There is the
further problem that it means
> that your set of dirty blocks can be much smaller than it would be
otherwise and this can hurt
> performance.  I don't think that should be a valid choice.
>
> We can live without it, as long there is no hard requirement for the
client to
> not flush any dirty data upon CB_LAYOUTRECALL (option 1 above).
>
> Benny
>
> >
> >
> >
> > -----Original Message-----
> > From: sfaibish [mailto:sfaibish@emc.com]
> > Sent: Tuesday, November 02, 2010 1:06 PM
> > To: Trond Myklebust; Noveck, David
> > Cc: bhalevy@panasas.com; jglasgow@aya.yale.edu; nfsv4@ietf.org
> > Subject: Re: [nfsv4] Write-behind caching
> >
> > On Tue, 02 Nov 2010 10:18:02 -0400, Trond Myklebust
> > <trond.myklebust@fys.uio.no> wrote:
> >
> >> Hi Dave,
> >>
> >> So, while I largely agree with your points 1-6, I'd like to add
> >>
> >> 0) Layouts are not a tool for enforcing cache consistency!
> >>
> >> While I agree that doing safe read-modify-write in the block case
is an
> >> important feature, I don't see any agreement anywhere in RFC5661
that we
> >> should be providing stronger caching semantics than we used to
provide
> >> prior to adding pNFS to the protocol. I have no intention of
allowing a
> >> Linux client implementation that provides such stronger semantics
until
> >> we write that sort of thing into the spec and provide for similar
> >> stronger semantics in the non-pNFS case.
> >>
> >> With that in mind, I have the following comments:
> >>
> >>       * I see no reason to write data back when the server recalls
the
> >>         layout. While I see that you could argue (1) implies that
you
> >>         should try to write stuff while you still hold a layout,
the
> >>         spec says that clora_changed==FALSE implies you can get
that
> >>         layout back later. In the case where clora_changed==TRUE,
you
> >>         might expect the file would be unavailable for longer, but
the
> >>         spec says you shouldn't write stuff back in that case...
> >>       * While this may lead to layout bouncing between clients
and/or
> >>         the server, the clients do have the option of detecting
this,
> >>         and choosing write through MDS to improve efficiency.
Grabbing
> >>         the layout, and blocking others from accessing the data
while
> >>         you write is not a scalable solution even if you do believe
> >>         there is a valid scenario for this behaviour.
> >>       * Basically, it comes down to the fact that I want to write
back
> >>         data when my memory management heuristics require it, so
that I
> >>         cache data as long as possible. I see no reason why server
> >>         mechanics should dictate when I should stop caching (unless
we
> >>         are talking about a true cache consistency mechanism).
> > OK. Now I think I understand your point and we might still require
some
> > changes in the interpretation and perhaps some language in the 5661.
But
> > I have a basic question about fencing. Do we think there is any
possibility
> > of data corruption when the DS fence the I/Os very fast after the
> > layoutrecall.
> > If we can find such possibility we probably need to mention this in
the
> > protocol
> > and recommend how to prevent such a case. For example MDS sends a
> > layoutrecall
> > and immediately (implementation decision of the server) it force the
> > fencing
> > on the DS while waiting for the return or after receiving ack from
the
> > client
> > for the layoutrecall. (I might be out of order here but I just want
to be
> > sure
> > this is not the case).
> >
> > /Sorin
> >
> >
> >
> >>
> >> So my choices for Q1 and Q2 are still (A) and (A).
> >>
> >> Cheers
> >>   Trond
> >>
> >> On Mon, 2010-11-01 at 13:49 -0400, david.noveck@emc.com wrote:
> >>> I think that want to address this issue without using the words
> >>> "unrealistic" or "optimal".  Things that you think are unrealistic
> >>> sometimes, in some sorts of environments, turn out to be common.
Trying
> >>> to decide what approaches are optimal are also troubling.  In
different
> >>> situations, different approaches may be better or worse.  The
protocol
> >>> needs to define the rules that the client and server have to obey
and
> >>> they may make choices that result in results from optimal to
pessimal.
> >>> We can make suggestions on doing things better but in unusual
situations
> >>> the performance considerations may be different.  The point is we
have
> >>> to be clear when the client can and can't do A and similarly for B
and
> >>> sometimes the choice of A or B is simply up to the client.
> >>>
> >>> So I'm going to put down the following numbered propositions and
we'll
> >>> see where people disagree  with me.  Please be specific.  I'm
going to
> >>> assume the anything you don't argue with numerically below the
point of
> >>> disagreement is something we can agree on.
> >>>
> >>> 1) Normally, when a client is writing data covered by a layout, it
may
> >>> write using the layout or to the MDS, but unless there is a
particular
> >>> reason (e.g. slow or inconsistent response using the layout), it
SHOULD
> >>> write using the layout.
> >>>
> >>> 2) When a layout is recalled, the protocol itself does not nor
should it
> >>> require that dirty blocks in the cache be written before returning
the
> >>> layout.  If a client chooses to do writes using the recalled
layout, it
> >>> is doing so as an attempt to improve performance, given its
judgment of
> >>> the relative performance of IO using the layout and IO through the
MDS.
> >>>
> >>> 3) Particularly in the case in which clora_changed is 0, clients
MAY
> >>> choose to take advantage of the higher-performance layout path to
write
> >>> that data, while it is available.  However, since doing that
delays the
> >>> return of the layout, it is possible that by delaying the return
of the
> >>> layout, performance of others waiting for the layout may be
reduced.
> >>>
> >>> 4) When writing of dirty blocks is done using a layout being
recalled,
> >>> the possibility exists that the layout will be revoked before all
the
> >>> blocks are successfully written.  The client MUST be prepared to
rewrite
> >>> those dirty blocks whose layouts write failed to the MDS in such
cases.
> >>>
> >>> 5) Clients that want to write dirty blocks associated with
recalled
> >>> layouts MAY choose to restrict the size of the set of dirty blocks
they
> >>> keep in order to make it relatively unlikely that the layout will
be
> >>> revoked during recall.  On the other hand, for applications, in
which
> >>> having a large set of dirty blocks in the cache reduces the IO
actually
> >>> done, such restriction may result in poorer performance, even
though the
> >>> specific IO path used is more performant.
> >>>
> >>> 6) Note that if a large set of dirty blocks can be kept by the
client
> >>> when a layout is not held, it should be possible to keep a set
that at
> >>> least that size a set of dirty blocks when a layout is held.  Even
if
> >>> the client should choose to write those blocks as part of the
layout
> >>> recall, any that it is not able to write in an appropriate time,
will be
> >>> a subset of an amount which, by hypothesis, can be appropriately
held
> >>> when the only means of writing them is to the MDS.
> >>>
> >>> Another way of looking at this is that we have the following
questions
> >>> which I'm going to present as multiple choice.  I have missed a
few
> >>> choices but,
> >>>
> >>> Q1) When a recall of a layout occurs what do you about dirty
blocks?
> >>>     A) Nothing.  The IO's to write them are like any other IO and
> >>>        you don't do IO using layout under recall.
> >>>     B) You should write all dirty blocks as part of the recall if
> >>>        clora_changed is 0.
> >>>     C) You should do (A), but have the option of doing (B) but you
> >>>        are responsible for the consequences.
> >>>
> >>> Q2) How many dirty blocks should you keep covered by a layout?
> >>>     A) As many as you want.  It doesn't matter.
> >>>     B) A small number so that you can be sure that they can be
> >>>        written as part of layout recall (Assuming Q1=B).
> >>>     C) If there is a limit, it must be at least as great as the
limit
> >>>        that would be in effect if there is no layout present,
since
> >>>        that number is OK, once the layout does go back.
> >>>
> >>> There are pieces of the spec that are assuming (A) and the answer
to
> >>> these and pieces assuming (B).
> >>>
> >>> So I guess I was arguing before that the answers to Q1 and Q2
should be
> >>> (A).
> >>>
> >>> My understanding is that Benny is arguing for (B) as the answer to
Q1
> >>> and Q2.
> >>>
> >>> So I'm now willing to compromise slightly and answer (C) to both
of
> >>> those, but I think that still leaves me and Benny quite a ways
apart.
> >>>
> >>> I'm not sure what Trond's answer is, but I'd interested in
understanding
> >>> his view in terms of (1)-(6) and Q1 and Q2.
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Trond Myklebust [mailto:trond.myklebust@fys.uio.no]
> >>> Sent: Friday, October 29, 2010 6:04 PM
> >>> To: faibish, sorin
> >>> Cc: Noveck, David; bhalevy@panasas.com; jglasgow@aya.yale.edu;
> >>> nfsv4@ietf.org
> >>> Subject: Re: [nfsv4] Write-behind caching
> >>>
> >>> On Fri, 2010-10-29 at 17:48 -0400, Trond Myklebust wrote:
> >>>> On Fri, 2010-10-29 at 17:32 -0400, sfaibish wrote:
> >>>>> On Fri, 29 Oct 2010 13:39:55 -0400, Trond Myklebust
> >>>>> <trond.myklebust@fys.uio.no> wrote:
> >>>>>
> >>>>>> On Fri, 2010-10-29 at 13:20 -0400, david.noveck@emc.com wrote:
> >>>>>>> There are two issues here with regard to handling of layout
> >>> recall.
> >>>>>>>
> >>>>>>> One is with regard to in-flight IO.  As Benny points out, you
> >>> cannot be
> >>>>>>> sure that the in-flight IO can be completed in time to avoid
the
> >>> MDS
> >>>>>>> losing patience.  That should rarely be the case though, if
> >>> things are
> >>>>>>> working right.  The client has to be prepared to deal with IO
> >>> failures
> >>>>>>> due to layout revocation.  Any IO that was in flight and
failed
> >>> because
> >>>>>>> of layout revocation will need to be handled by being reissued
to
> >>> the
> >>>>>>> MDS.  Is there anybody that disagrees with that?
> >>>>>>>
> >>>>>>> The second issue concerns IO not in-flight (in other words,
not
> >>> IO's
> >>>>>>> yet but potential IO's) when the recall is received.  I just
> >>> don't see
> >>>>>>> that it reasonable to start IO's using layout segments being
> >>> recalled
> >>>>>>> (whether for dirty buffers or anything else).  Doing IO's to
the
> >>> MDS is
> >>>>>>> fine but there is no real need for the layout recall to
specially
> >>>
> >>>>>>> trigger them, whether clora_changed is set or not.
> >>>>>>
> >>>>>> This should be _very_ rare. Any cases where 2 clients are
trying
> >>> to do
> >>>>>> conflicting I/O on the same data is likely to be either a
> >>> violation of
> >>>>>> the NFS cache consistency rules, or a scenario where it is in
any
> >>> case
> >>>>>> more efficient to go through the MDS (e.g. writing to adjacent
> >>> records
> >>>>>> that share the same extent).
> >>>>> Well this is a different discussion: what was the reason for the
> >>> recall in
> >>>>> the first place. This is one usecase but there could be other
> >>> usecases
> >>>>> for the recall and we discuss here how to implement the protcol
more
> >>> than
> >>>>> how to solve a real problem. My 2c
> >>>>
> >>>> I strongly disagree. If this is an unrealistic scenario, then we
don't
> >>>> have to care about devising an optimal strategy for it. The
'there
> >>> could
> >>>> be other usecases' scenario needs to be fleshed out before we can
deal
> >>>> with it.
> >>>
> >>> To clarify a bit what I mean: we MUST devise optimal strategies
for
> >>> realistic and useful scenarios. It is entirely OPTIONAL to devise
> >>> optimal strategies for unrealistic ones.
> >>>
> >>> If writing back all data before returning the layout causes
protocol
> >>> issues because the server cannot distinguish between a bad client
and
> >>> one that is waiting for I/O to complete, then my argument is that
we're
> >>> in the second case: we don't have to optimise for it, and so it is
safe
> >>> for the server to assume 'bad client'...
> >>>
> >>>    Trond
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4

_______________________________________________
nfsv4 mailing list
nfsv4@ietf.org
https://www.ietf.org/mailman/listinfo/nfsv4