Re: [nfsv4] Write-behind caching

<david.noveck@emc.com> Fri, 12 November 2010 06:19 UTC

Return-Path: <david.noveck@emc.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id E140C3A68C8 for <nfsv4@core3.amsl.com>; Thu, 11 Nov 2010 22:19:56 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.599
X-Spam-Level:
X-Spam-Status: No, score=-6.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CCQFDcMB5MQk for <nfsv4@core3.amsl.com>; Thu, 11 Nov 2010 22:19:46 -0800 (PST)
Received: from mexforward.lss.emc.com (mexforward.lss.emc.com [128.222.32.20]) by core3.amsl.com (Postfix) with ESMTP id 0C6063A67AD for <nfsv4@ietf.org>; Thu, 11 Nov 2010 22:19:45 -0800 (PST)
Received: from hop04-l1d11-si04.isus.emc.com (HOP04-L1D11-SI04.isus.emc.com [10.254.111.24]) by mexforward.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oAC6KFwd020083 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 12 Nov 2010 01:20:15 -0500
Received: from mailhub.lss.emc.com (mailhub.lss.emc.com [10.254.222.226]) by hop04-l1d11-si04.isus.emc.com (RSA Interceptor); Fri, 12 Nov 2010 01:20:12 -0500
Received: from corpussmtp5.corp.emc.com (corpussmtp5.corp.emc.com [128.221.166.229]) by mailhub.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oAC6JgRF007758; Fri, 12 Nov 2010 01:19:42 -0500
Received: from CORPUSMX50A.corp.emc.com ([128.221.62.39]) by corpussmtp5.corp.emc.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 12 Nov 2010 01:19:41 -0500
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Date: Fri, 12 Nov 2010 01:19:37 -0500
Message-ID: <BF3BB6D12298F54B89C8DCC1E4073D80029D1EE1@CORPUSMX50A.corp.emc.com>
In-Reply-To: <4CDBFF18.9030306@panasas.com>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: [nfsv4] Write-behind caching
Thread-Index: AcuBrtjs8gFFvb7XQfmJF7TszAxcrAAfAZww
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com><op.vlcwr1zqunckof@usensfaibisl2e.eng.emc.com><1288388933.3701.47.camel@heimdal.trondhjem.org><1288389823.3701.59.camel@heimdal.trondhjem.org><BF3BB6D12298F54B89C8DCC1E4073D80029446BC@CORPUSMX50A.corp.emc.com><1288707482.2925.44.camel@heimdal.trondhjem.org><op.vljy4pqaunckof@usensfaibisl2e.eng.emc.c! om><BF3BB6D12298F54B89C8DCC1E4073D8002944A75@CORPUSMX50A.corp.emc.com><4CD17BFE.8000707@panasas.com! ><7C4DFC E962635144B8FAE8CA11D0BF1E03D58B97E2@MX14A.corp.emc.com><BF3BB6D12298F54B89C8DCC1E4073D8002944F1E@CORPUSMX50A.corp.emc.com><7C4DFCE962635144B8FAE8CA11D0BF1E03D59C159C@MX14A.corp.emc.com><BF3BB6D12298F54B89C8DCC1E4073D8002945151@CORPUSMX50A.corp.emc.com><4C! DA9! CDD.60 10 406@panasas. com><BF3BB6D12298F54B89C8DCC1E4073D80029D1CFC@CORPUSMX50A.corp.emc.com> <1289448155.2001.25.camel@heimdal.trondhjem.org> <BF3BB6D12298F54B89C8DCC1E4073D80029D1D25@CORPUSMX50A.corp.emc.com> <4CDBFF18.9030306@panas! as.com>
From: david.noveck@emc.com
To: bhalevy@panasas.com
X-OriginalArrivalTime: 12 Nov 2010 06:19:41.0479 (UTC) FILETIME=[975F1370:01CB8231]
X-EMM-MHVC: 1
X-EMM-MFVC: 1
Cc: Trond.Myklebust@netapp.com, nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Nov 2010 06:19:57 -0000

I don't see that you are amortizing the cost of the layout recall.  If
you don't have to write the data with the recall, then the cost of
layout recall is just kept low.  The issue then is reducing the cost of
writing the dirty data, and be amortizing the cost of doing that.

> This is good, but it seems to be going a bit outside the scope 
> of the protocol definition. I'd rather define the generic framework 
> and let the implementations make their choices within them without 
> having to constraint both the client and the server differently 
> per layout-type.  This will be an interoperability nightmare.

I don't see the interoperability nightmare.  Each layout type would have
clear rules (in the spec or at least errata) for the client and server
and the client and server have to know and agree on what the layout type
is.

The problem is that if you don't have separate per-layout-type rules,
then you have to have one set of rules for all layout types, and I'm not
sure we can do that and satisfy everybody.  If we can, that would be
great.  

The problem is that if let the client and server implementation make
their choices in this regard, you have a difficulty.  If the rules are
tight, you may run into the issue of doing something for one layout type
that doesn't make sense for others.  Of particular concern is pNFS file
where server and client implementations would like to treat the layout
as a unity and not mess with layout ranges.  Other layout types, with
good reason, take a different approach to this issue and may have
different trade-offs.  Another issue where layout characteristics have a
major effect of choices in this regard consists of write layouts, which
can be easily granted to multiple clients simultaneously.

The real issue is that if you, as a result of this and similar issues,
create a loose set of rules for the client and server, for each of them
to make their own independent choices, that is when you have an
interoperability nightmare.  As you describe how you would like client
and server to interact, you have to realize that if the rules are loose,
that vision may not be realized, as each of client and server make their
own choices on various MAY's and SHOULD's/  That's why I feel, it is
most likely that having per-layout-type rules will enable you to have
sufficiently tight rules for each layout type to avoid interoperability
issues, without forcing a layout type into patterns inappropriate for
it.

> The generic rules as I see them are:

> The client SHOULD [MUST?] return the layout in 
> response to CB_LAYOUTRECALL within the specified 
> a lease period.

SHOULD or MUST?

Given what you say below about showing forward progress, you are clearly
not saying "MUST".  In fact, I don't see that you are really saying
"SHOULD", since you indicate that the server should be constructed to
work with someone not doing this and in fact not doing it for a reason,
to optimize IO in a specific environment.

> The server MAY revoke the layout after one lease 
> period and fence off the client from the respective 
> storage devices in case it didn't do so.

But if you look below, it sure seems like you are assuming that the
server should not do this.  

> The client MUST NOT return the layout while IO 
> requested started using that area of the layout 
> are in flight.  Therefore the client SHOULD limit 
> its use of the layout so as to make sure that the 
> recalled layout will be reliably returned within 
> the specific period (such as the lease period) 
> under normal hardware conditions.

"the specific period (such as the lease period)"?

How is that different (and why) from simply "the lease period"?  How are
the client and the server to agree on this "specific period."

> In case clora_changed is true, the client SHOULD 
> NOT initiate new I/O operations before returning the 
> layout.  Otherwise, the client MAY initiate new I/O 
> operations in the recalled range, e.g., in the interest 
> of a more efficient handling of IO associated with data 
> caching, particularly the writing of dirty blocks generated
> as part of a policy aggressive write-behind caching. The 
> policy of doing so is outside the scope of this specification.

I think you are tying yourself in knots here.

It says above (and presumably within this specification) that the layout
SHOULD be returned within the lease period (or the specific period)
  
> For example, the layout type, the particular file layout, 
> and possibly implementation specific tunable parameters 
> may determine the client's behavior.

I thought you said I was creating an interoperability problem, by having
rules by layout type.  Now you are having separate rules based, on
layout type, and many other things; they aren't in the spec but rather
in some sort of definition of implementation-specific parameters.
That's the recipe for an interoperability nightmare.

> In the case above, the client SHOULD show forward 
> progress by returning sub-ranges of the recalled range 
> as I/O operations in the returned range complete.
> This may allow the server to start using the returned 
> ranges early on.  In case the client does not return 
> the recalled layout within one lease period it MUST 
> send LAYOUTRETURN for the range no longer in use.

> The server SHOULD NOT revoke the layout while the 
> client is showing forward progress in returning the 
> recalled layout.

I think the basic problem is that you are trying to accommodate multiple
inconsistent rules:

    The client should return the layout immediately
    or at least within the lease period and the server 
    will revoke it if it doesn't.

    The client will write all the data and do partial
    returns, at least one per lease period and the 
    server will accept this sort of progress for an
    an unbounded period of time.

And pretending you have one set of rules with various SHOULD's to try to
jam them together.  I think it is just simpler to allow the mapping
types to each have the rules they want and need.

One thing I note is that there is no longer any mention of the server
being responsible for limiting the set of dirty data by limiting the
layouts it grants.  And that's a positive thing.


-----Original Message-----
From: Benny Halevy [mailto:bhalevy@panasas.com] 
Sent: Thursday, November 11, 2010 9:35 AM
To: Noveck, David
Cc: Trond.Myklebust@netapp.com; nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching

On 2010-11-11 09:39, david.noveck@emc.com wrote:
>> The problem here from my perspective is the more general issue that
we
>> have normative text that prescribes a particular cache consistency
>> policy for use with pNFS. 
> 
> My understanding is that David Black's and Benny Halevy's position is
> that there isn't such text and that the text I'm referring to is not
> saying that all of the cached data has to be written before returning
> the layout.  As I understand their position, they are saying that
> writing the data with the recalled layout before it is returned is
> merely a performance-based choice that a client may make in the
> interests of its own performance, 

Correct. It's not about cache consistency, but rather amortizing
the cost of layoutrecall.

> although I think you may find that the
> efficiency of other clients may be hurt by such a client policy.

Sacrificing some latency for throughput (by reducing the context switch
overhead) can pay off in aggregate throughput, pretty much like doing
read
ahead on the disk platter.  Using this example, you want to tune your
read-
ahead size dynamically in response to the disk queue length, but would
you interrupt a read-ahead on every read request that enters the queue?
The latter choice may render your I/O pattern completely random.

> 
> The text certainly seemed to me to go farther than simply saying one
> might want to write some of the data with the layout being recalled,
> even starting a considerable amount of new IO's.  As I noted, the
> mention of delegations, where we do have a true cache consistency
issue
> suggests that the text is either a normative description of a cache
> consistency policy with pNFS or something that can too easily be
> confused with such.
> 
> The first question is whether we agree that there is no cache
> consistency issue.  In this case dirty data within the area where the
> layout is recalled may be written after the recall, either through the
> MDS or after regetting a layout for the area in question.  

Agreed.

> If we do
> agree on this, we have the issue of how to clean up and clarify the
text
> so that we have something that is not an interoperability nightmare.
> That last is not trivial but I think we can work the issue out.
> 
> So assuming that we do agree on this, how about the following for the
> final paragraph of 12.5.5?
> 
>    The specific rules governing the specified process by which a
client 
>    should proceed to return recalled layouts is specified by text
> specific 
>    to the layout type for the recalled layout.  Clients may not return

>    the layout while IO requested started using that area of the layout

>    are in flight.  In addition, layout types may allow use of the
> recalled
>    layout in the interests of a more efficient handling of IO
associated
>    with data caching, particularly the writing of dirty blocks
generated
>    as part of a policy aggressive write-behind caching.  Such rules
will
> 
>    define:
>    
>    o  Whether the generation of new IO's to write dirty data is
expected
>       or whether, the low-latency return of the recalled layout is of
>       primary significance.
> 
>    o  Whether the client should limit its use of the layout so as to
>       make sure that the recalled layout will be reliably returned
>       within a specific period (such as the lease period) under normal
>       hardware conditions (or whether it should make maximum use of 
>       the layout to efficiently write the dirty data the primary
>       consideration).
> 
>    o  In the case above in which maximum use of the layout is the 
>       primary consideration, whether it is the job of the client to
>       limit the amount of dirty data (and start writing before too
>       much dirty data accumulates) or whether it is the job of the 
>       server to limit the dirty data, by limiting the layout ranges
>       granted for writing.
>    
>    o  Whether the server SHOULD or MUST accept progress toward
returning
>       the layout when recall goes beyond the lease and what measures 
>       should be accepted in this regard (e.g. partial layout returns,
>       data server IO's).
> 
>    The layout type may make some choices listed above conditional on 
>    the value of clora_changed in the CB_LAYOUTRECALL4args structure.
>    Whether the choices selected are REQUIRED or RECOMMENDED is also
>    to be decided by the mapping type.

This is good, but it seems to be going a bit outside the scope of the
protocol
definition. I'd rather define the generic framework and let the
implementations
make their choices within them without having to constraint both the
client and
the server differently per layout-type.  This will be an
interoperability nightmare.

The generic rules as I see them are:

The client SHOULD [MUST?] return the layout in response to
CB_LAYOUTRECALL within the
specified a lease period.

The server MAY revoke the layout after one lease period and fence off
the client
from the respective storage devices in case it didn't do so.

The client MUST NOT return the layout while IO requested started using
that area
of the layout are in flight.  Therefore the client SHOULD limit its use
of the
layout so as to make sure that the recalled layout will be reliably
returned
within the specific period (such as the lease period) under normal
hardware
conditions.

In case clora_changed is true, the client SHOULD NOT initiate new I/O
operations
before returning the layout.  Otherwise, the client MAY initiate new I/O
operations
in the recalled range, e.g., in the interest of a more efficient
handling of
IO associated with data caching, particularly the writing of dirty
blocks generated
as part of a policy aggressive write-behind caching. The policy of
doing so is outside the scope of this specification.  For example,
the layout type, the particular file layout, and possibly implementation
specific tunable parameters may determine the client's behavior.

In the case above, the client SHOULD show forward progress by returning
sub-ranges
of the recalled range as I/O operations in the returned range complete.
This may allow the server to start using the returned ranges early on.
In case the client does not return the recalled layout within one lease
period
it MUST send LAYOUTRETURN for the range no longer in use.

The server SHOULD NOT revoke the layout while the client is showing
forward progress in returning the recalled layout.

[I think it is reasonable to require the client to fully return the
recalled layout
within one lease period and simplify the "layout lease extension" case,
but I'd still leave the recommendation to show forward progress within
the lease
period (which may be pretty long) to provide for getting a pipeline
going]

Benny

>    
> Within this framework, what is in the spec now is very hand-wavy and
> essentially says:
> 
>    For all mapping types the rules are:
> 
>    o  Whether the generation of new IO's to write dirty data is  
>       expected or the prompt return of layout is of primary 
>       significance is a choice to be made independently by each
>       client.
> 
>    o  Whether the client should limit its use of the layout so as to
>       make sure that the recalled layout will be reliably returned
>       within a specific period (such as the lease period) under normal
>       hardware conditions (or whether it should make maximum use of 
>       the layout to efficiently write the dirty data the primary
>       consideration) is a choice to be made independently by each
>       client.
> 
>    o  In the case above in which maximum use of the layout is the 
>       primary consideration, whether it is the job of the client to
>       limit the amount of dirty data (and start writing before too
>       much dirty data accumulates) or whether it is the job of the 
>       server to limit the dirty data, by limiting the layout ranges
>       granted for writing is a choice to be made independently by 
>       the client and server, leaving open the possibility that neither
>       will accept responsibility.
>    
>    o  There is no specification that the server SHOULD or MUST accept 
>       progress toward returning the layout when recall goes beyond 
>       a single lease period.  Instead, each server may independently 
>       decide to do so and may choose on its own what measures should
>       be accepted in this regard (e.g. partial layout returns, data
>       server IO's).
> 
> -----Original Message-----
> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On Behalf
> Of Trond Myklebust
> Sent: Wednesday, November 10, 2010 11:03 PM
> To: Noveck, David
> Cc: bhalevy@panasas.com; nfsv4@ietf.org
> Subject: Re: [nfsv4] Write-behind caching
> 
> 
> The problem here from my perspective is the more general issue that we
> have normative text that prescribes a particular cache consistency
> policy for use with pNFS. That not only makes it unnecessarily harder
to
> switch between non-pNFS and pNFS mode, but it also adds unnecessary
> complexity for no good reason.
> The fact that there may be contention for a layout range is not
telling
> you that you need to change caching policy; it is telling you that you
> need to add locking, and be more careful about what ranges you ask
for.
> 
> I'd be quite happy with a section describing what to do when conflicts
> occur in the presence of locks due to clients requesting overly large
> layout ranges. That seems to be missing.
> However any discussion of cache consistency policy changes when using
> layouts should not be normative and should therefore really have been
> weeded out when we did the pNFS review.
> 
> Trond
> 
> 
> On Wed, 2010-11-10 at 21:16 -0500, david.noveck@emc.com wrote:
>>> Maybe the spec is not coherent about this, 
>>
>> Exactly.  That's the problem.
>>
>>> but we specifically say "may" in section 20.3, not "must".
>>
>> Section 12.5.5 doesn't say "must" either, but it seems that it
assumes
>> something pretty close.
>>
>>>  Although pNFS does not alter the file data caching capabilities of
>>>  clients, or their semantics, it recognizes that some clients may
>>>  perform more aggressive write-behind caching to optimize the
> benefits
>>>  provided by pNFS.  However, write-behind caching may negatively
>>>  affect the latency in returning a layout in response to a
>>>  CB_LAYOUTRECALL; this is similar to file delegations and the impact
>>>  that file data caching has on DELEGRETURN.  
>>
>> The word "MUST" or "must" is not used but those would apply to the
> case
>> of DELEGRETEURN.  That definitely gives the impression that you are
>> claiming that someone has to write all the dirty data before
returning
>> the layout.  If everybody accepts that that is not the case, then we
> can
>> move on. 
>>
>>>  Client implementations
>>>  SHOULD limit the amount of unwritten data they have outstanding at
>>>  any one time in order to prevent excessively long responses to
>>>  CB_LAYOUTRECALL.  
>>
>> First it says "[pNFS] clients may perform more aggressive
write-behind
>> caching" but then says they "SHOULD limit the amount of unwritten
> data".
>> This is contradictory.  More aggressive write -behind caching means
> you
>> keep more unwritten data but then it says you should limit it,
>> presumably to a quantity that is less than non-PNFS clients can have.
>> This is where this is getting incoherent.
>>
>> But the big thing is that "SHOULD limit" only makes sense if the
> client
>> "will" or "SHOULD" write the dirty data before returning the layout.
> If
>> it was only "may", you would say something like:
>>
>>     Client implementations SHOULD limit the amount of unwritten 
>>     data they write using a pending recalled layout.
>>
>> The fact that it doesn't, suggests to me that this paragraph is going
>> beyond saying the client "may" write the data with the recalled
layout
>> and assuming that it will in fact do so, as is also suggested by the
>> DELEGRETURN reference.  If it were assumed clients could indeed
really
>> not write this dirty data when recalled, it would be important that
>> whose clients were not hurt by those that did.
>>
>>>  Once a layout is recalled, a server MUST wait one
>>>  lease period before taking further action.  As soon as a lease
> period
>>>  has passed, the server may choose to fence the client's access to
> the
>>>  storage devices if the server perceives the client has taken too
> long
>>>  to return a layout.  However, just as in the case of data
> delegation
>>>  and DELEGRETURN, the server may choose to wait, given that the
> client
>>>  is showing forward progress on its way to returning the layout.  
>>
>> Now we are creating interoperability problems.  The client "may" do
> all
>> this IO and the server  "may" choose to wait.  Now we have four
>> possibilities to test and some of them, such as that the client is
>> writing a lot of data and the server is not choosing to wait, are not
>> very pleasant.  I think the problem is that if you say the client may
> do
>> something, you are essentially saying that the server has to deal
with
>> it.
>>
>>>  This
>>>  forward progress can take the form of successful interaction with
> the
>>>  storage devices or of sub-portions of the layout being returned by
>>>  the client.  
>>
>> Does that mean the server can choose that either to neither of those
>> will delay the revoke or must it allow either?
>>
>>>  The server can also limit exposure to these problems by
>>>  limiting the byte-ranges initially provided in the layouts and thus
>>>  the amount of outstanding modified data.
>>
>> Wonder what happened to the "more aggressive write-behind" that we
>> started with? :-)
>>
>> Also, if this were really a "may", then there would be a concern
about
>> clients who did not do this.
>>
>> In other words, if there were clients that wanted to cache large
> amounts
>> of dirty data and not write it when recalled, i.e. wanted more
>> aggressive write-behind caching, then the server would, by not
> allowing
>> them large byte-ranges, would be undercutting them, on behalf of the
>> clients who "may" be delaying the layout return.
>>
>> Despite the specific modal auxiliaries, the whole structure of this
>> paragraph is written assuming that the normal behavior is one in
which
>> all of the data is written using the layout, and there are problems
> with
>> that, especially for pNFS file which doesn't want to be managing
> layout
>> byte-ranges in an attempt to limit client dirty data that a client
> "may"
>> decide to write before returning the layout.
>>  
>>
>> -----Original Message-----
>> From: Benny Halevy [mailto:bhalevy@panasas.com] 
>> Sent: Wednesday, November 10, 2010 8:24 AM
>> To: Noveck, David
>> Cc: Black, David; nfsv4@ietf.org
>> Subject: Re: [nfsv4] Write-behind caching
>>
>> On 2010-11-07 06:38, david.noveck@emc.com wrote:
>>>> I agree that there's a problem with the existing text. 
>>>
>>>> Benny and I are both telling you not to forbid a 
>>>> client from choosing to do writebacks to the data 
>>>> servers in response to a layout recall.
>>>
>>> That's OK with me.  I can live with this this being a client choice,
>> but
>>> what we have right now, assumes that the client always must write
> all
>> of
>>
>> Maybe the spec is not coherent about this, but we specifically say
> "may"
>> in
>> section 20.3, not "must".
>>
>>    In processing the layout recall request, the client also varies
its
>>    behavior based on the value of the clora_changed field.  This
field
>>    is used by the server to provide additional context for the reason
>>    why the layout is being recalled.  A FALSE value for clora_changed
>>    indicates that no change in the layout is expected and the client
> may
>>
> ^^^
>>    write modified data to the storage devices involved;
>> maybe this may should be capitalized.
>>
>> We give the client a way to keep the server happy while possibly
>> flushing
>> large amounts of data in section 12.5.5.1.:
>>
>>    Note, the full recalled layout range need not be returned as
>>    part of a single operation, but may be returned in portions.  This
>>    allows the client to stage the flushing of dirty data and commits
> and
>>    returns of layouts.  Also, it indicates to the metadata server
that
>>    the client is making progress.
>>
>>
>>> the dirty data covered by the recalled area.  It further says that
>>> because of that the client must limit the amount of dirty data it
> can
>>> have.  In other words, it suggests it has to be prepared for the
>> server
>>> to recall all of the layout and be able to write all of that data
>> using
>>> the layout being recalled, without the server getting impatient.  It
>>> sounds to me like your desire for efficiency in doing the writing,
> may
>>> lead to poorer performance if it unduly limits the size of dirty
> data
>> in
>>> the cache.  
>>>
>>> If you or Benny can suggest what this should say that is compatible
>> with
>>> block and object, I'm pretty sure we can work out something that
> make
>>> sense for all mapping types.  If not, we can make this something
>>> specified per-mapping-type, which would be more of a drag since it
>>> involves interlocked errata for multiple RFC's.
>>
>> I agree.  But let's try to locate the incoherent sections
>> and clarify whatever needs clarification.
>>
>> Benny
>>
>>>  
>>>
>>> -----Original Message-----
>>> From: Black, David 
>>> Sent: Friday, November 05, 2010 10:08 AM
>>> To: Noveck, David
>>> Cc: nfsv4@ietf.org
>>> Subject: RE: [nfsv4] Write-behind caching
>>>
>>> Dave,
>>>
>>>> But the existing text suggests that this very situation, draining
> the
>>> dirty data using the layout,
>>>> should (or might) result in less aggressive caching.  The fact that
>>> you are doing it in response to a
>>>> recall means that you have a higher drain rate times a limited
> time,
>>> rather than a more limited drain
>>>> rate times an unlimited time.
>>>
>>> I agree that there's a problem with the existing text.  Keep in mind
>>> that for a system in which the metadata server is limited in what it
>> can
>>> do, the "more limited drain rate" might be *much* smaller than
> what's
>>> possible via the data servers, and the client might not be
> interested
>> in
>>> an "unlimited" exposure in terms of time to drain the cache.  Benny
>> and
>>> I are both telling you not to forbid a client from choosing to do
>>> writebacks to the data servers in response to a layout recall.
>>>
>>>> I think the question is about how you consider the case "if you
>> cannot
>>> get the layout back".  If a
>>>> client is allowed to do writes more efficiently despite the recall,
>>> why is not allowed to do reads
>>>> more efficiently and similarly delay the recall?
>>>
>>> Writes contain dirty data.  Reads don't.  There is a difference :-)
>> :-)
>>> .
>>>
>>>> I've proposed that this choice (the one about the write) be made
> the
>>> prerogative of the mapping type
>>>> specifically.  For pNFS file, I would think that the normal
>> assumption
>>> when a layout is being recalled
>>>> is that this is part of some sort of restriping and that you will
> get
>>> it back and it is better to
>>>> return the layout as soon as you can.
>>>
>>> I would think that we shouldn't be telling people how to implement
>>> systems in this fashion.  I guess I can live with the file layout
>>> forbidding this if the file layout implementers think it's useful.
>>> OTOH, Benny and I are telling you that forbidding this is the wrong
>>> approach for both the block and object layouts.
>>>
>>> Thanks,
>>> --David
>>>
>>>> -----Original Message-----
>>>> From: Noveck, David
>>>> Sent: Thursday, November 04, 2010 8:47 PM
>>>> To: Black, David; bhalevy@panasas.com
>>>> Cc: nfsv4@ietf.org
>>>> Subject: RE: [nfsv4] Write-behind caching
>>>>
>>>>> The multiple pNFS data servers can provide much higher
>>>>> throughput than the MDS, and hence it should be valid
>>>>> for a client to cache writes more aggressively when it
>>>>> has a layout because it can drain its dirty cache faster.
>>>>
>>>> But the existing text suggests that this very situation, draining
> the
>>> dirty data using the layout,
>>>> should (or might) result in less aggressive caching.  The fact that
>>> you are doing it in response to a
>>>> recall means that you have a higher drain rate times a limited
> time,
>>> rather than a more limited drain
>>>> rate times an unlimited time.
>>>>
>>>> I think the question is about how you consider the case "if you
>> cannot
>>> get the layout back".  If a
>>>> client is allowed to do writes more efficiently despite the recall,
>>> why is not allowed to do reads
>>>> more efficiently and similarly delay the recall?  It seems that the
>>> same performance considerations
>>>> would apply.  Would you want the client to be able to read through
>> the
>>> area in the layout in order to
>>>> use it effectively?  I wouldn't think so.
>>>>
>>>> I've proposed that this choice (the one about the write) be made
> the
>>> prerogative of the mapping type
>>>> specifically.  For pNFS file, I would think that the normal
>> assumption
>>> when a layout is being recalled
>>>> is that this is part of some sort of restriping and that you will
> get
>>> it back and it is better to
>>>> return the layout as soon as you can.
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On
>> Behalf
>>> Of david.black@emc.com
>>>> Sent: Thursday, November 04, 2010 12:34 PM
>>>> To: bhalevy@panasas.com
>>>> Cc: nfsv4@ietf.org
>>>> Subject: Re: [nfsv4] Write-behind caching
>>>>
>>>>>>    4*) Clients SHOULD write all dirty data covered by the
> recalled
>>>>>>        layout before return it.
>>>>>>
>>>>>> It may be that you can write faster this way, but it also mean
>>> that the server may wait a while to
>>>>> get the layout back and this may delay other clients.  There is
> the
>>> further problem that it means
>>>>> that your set of dirty blocks can be much smaller than it would be
>>> otherwise and this can hurt
>>>>> performance.  I don't think that should be a valid choice.
>>>>>
>>>>> We can live without it, as long there is no hard requirement for
> the
>>> client to
>>>>> not flush any dirty data upon CB_LAYOUTRECALL (option 1 above).
>>>>
>>>> I basically agree with Benny.
>>>>
>>>> The multiple pNFS data servers can provide much higher throughput
>> than
>>> the MDS, and hence it should be
>>>> valid for a client to cache writes more aggressively when it has a
>>> layout because it can drain its
>>>> dirty cache faster.  Such a client may want to pro-actively reduce
>> its
>>> amount of dirty cached data in
>>>> response to a layout recall, with the goal of provide appropriate
>>> behavior if it cannot get the layout
>>>> back.  For that reason, a prohibition on clients initiating writes
> in
>>> response to a recall would be a
>>>> problem (i.e., option 1's prohibition is not a good idea).
>>>>
>>>> That leaves 2) and 3) which seem to be shades of the same concept:
>>>>
>>>>>>    2) Say you MAY write some dirty data on layouts being recalled
>>>>>>       but you should limit this attempt to optimize use of
> layouts
>>>>>>       to avoid unduly delaying layout recalls.
>>>>>>
>>>>>>    3) Say clients MAY write large amounts of dirty data and
> server
>>>>>>       will generally accommodate them in using pNFS to do IO this
>>>>>>       way.
>>>>
>>>> I think the "avoid unduly delaying" point is important, which
>> suggests
>>> that the "large amounts" of
>>>> dirty data writes in 3) would only be appropriate when the client
> has
>>> sufficiently high throughput
>>>> access to the data servers to write "large amounts" of data without
>>> "unduly delaying" the recall.
>>>>
>>>> Thanks,
>>>> --David
>>>>
>>>>> -----Original Message-----
>>>>> From: nfsv4-bounces@ietf.org [mailto:nfsv4-bounces@ietf.org] On
>>> Behalf Of Benny Halevy
>>>>> Sent: Wednesday, November 03, 2010 11:13 AM
>>>>> To: Noveck, David
>>>>> Cc: nfsv4@ietf.org; trond.myklebust@fys.uio.no
>>>>> Subject: Re: [nfsv4] Write-behind caching
>>>>>
>>>>> On 2010-11-03 15:31, david.noveck@emc.com wrote:
>>>>>> 12.5.5 states that the server "MUST wait one lease period before
>>> taking further action" so I don't
>>>>> think it is allowed to fence the client immediately.
>>>>>>
>>>>>> I think there some confusion/error that starts with the last
>>> paragraph of 12.5.5.
>>>>>>
>>>>>>    Although pNFS does not alter the file data caching
> capabilities
>>> of
>>>>>>    clients, or their semantics, it recognizes that some clients
>>> may
>>>>>>    perform more aggressive write-behind caching to optimize the
>>> benefits
>>>>>>    provided by pNFS.
>>>>>>
>>>>>> It you are doing write-behind caching, the primary thing that is
>>> going to decide whether you
>>>>> should actually write the dirty block is the probability that it
>>> will be modified again.  If that is
>>>>> at all likely, then writing it immediately, just to get the
>>> "benefits provided by pNFS" may not be a
>>>>> good idea.  And if the probabilities of the block being further
>>> modified had already reached a low
>>>>> level, then you probably should have started writing it, before
> the
>>> CB_LAYOUTRECALL.  It may be that
>>>>> there are some blocks whose probability is just on the edge, and
> the
>>> CB_LAYOUTRECALL pushed them
>>>>> into gee-it-would-be-better-to-write-these-now category.  But that
>>> is not what is being talked about
>>>>> here.
>>>>>>
>>>>>> Note that it talks about "more aggressive write-behind caching"
>>> and then later talks about having
>>>>> less dirty data in this case.  I think this needs to be rethought.
>>>>>>
>>>>>>    However, write-behind caching may negatively
>>>>>>    affect the latency in returning a layout in response to a
>>>>>>    CB_LAYOUTRECALL;
>>>>>>
>>>>>> Here it seems to assume not that CB_LAYOUTRECALL makes it more
>>> desirable to write not just some
>>>>> dirty blocks using the recalled layout but that all dirty data is
>>> being written (or at least that
>>>>> which covered by the recall).
>>>>>>
>>>>>>    this is similar to file delegations and the impact
>>>>>>    that file data caching has on DELEGRETURN.
>>>>>>
>>>>>> But that is a very bad analogy.  For delegations, there's a
>>> semantic reason you have to write all
>>>>> the dirty data before returning the delegations.
>>>>>>
>>>>>>    Client implementations
>>>>>>    SHOULD limit the amount of unwritten data they have
> outstanding
>>> at
>>>>>>    any one time in order to prevent excessively long responses to
>>>>>>    CB_LAYOUTRECALL.
>>>>>>
>>>>>> Again the assumption is not that somebody is writing some amount
>>> of data to take advantage of a
>>>>> layout going away but that clients in general are writing every
>>> single dirty block.  As an example,
>>>>> take the case of the partial block written sequentially.  That's a
>>> dirty block you would never write
>>>>> as a result of a LAYOUTRECALL.  There's probably no benefit in
>>> writing it using the layout no matter
>>>>> how efficient the pNFS mapping type is.  You are probably going to
>>> have to write it again anyway.
>>>>>>
>>>>>> For some environments, limiting the amount of unwritten data, may
>>> hurt performance more than
>>>>> writing the dirty data to the MDS.  If I can write X bytes of
> dirty
>>> blocks to the MDS (if I didn't
>>>>> have a layout), why should I keep less than X bytes of dirty
> blocks
>>> if I have a layout which is
>>>>> supposedly helping me write more efficiently (and as part of "more
>>> aggressive write-behind
>>>>> caching").  If anything I should be able to have more dirty data.
>>>>>>
>>>>>> Note that clora_changed can tell the client not to write the
> dirty
>>> data, but the client has no way
>>>>> of predicting what clora_changed will be, so it would seem that
> they
>>> have to limit the amount of
>>>>> dirty data, even if they have a server which is never going to ask
>>> them to write it as part of
>>>>> layout recall.
>>>>>>
>>>>>>    Once a layout is recalled, a server MUST wait one
>>>>>>    lease period before taking further action.  As soon as a lease
>>> period
>>>>>>    has passed, the server may choose to fence the client's access
>>> to the
>>>>>>    storage devices if the server perceives the client has taken
>>> too long
>>>>>>    to return a layout.  However, just as in the case of data
>>> delegation
>>>>>>    and DELEGRETURN, the server may choose to wait, given that the
>>> client
>>>>>>    is showing forward progress on its way to returning the
> layout.
>>>>>>
>>>>>> Again, these situations are different.  A client which is doing
>>> this is issuing new IO's using
>>>>> recalled layouts.  I don't have any objection if a server wants to
>>> allow this but I don't think
>>>>> treating layouts in the same way as delegations should be
>>> encouraged.
>>>>>>
>>>>>>    This
>>>>>>    forward progress can take the form of successful interaction
>>> with the
>>>>>>    storage devices or of sub-portions of the layout being
> returned
>>> by
>>>>>>    the client.  The server can also limit exposure to these
>>> problems by
>>>>>>    limiting the byte-ranges initially provided in the layouts and
>>> thus
>>>>>>    the amount of outstanding modified data.
>>>>>>
>>>>>> That adds a lot complexity to the server for no good reason.  If
>>> you start by telling the client
>>>>> to write every single dirty block covered by a layout recall
> before
>>> returning the layout, then you
>>>>> are going to run into problems like this.
>>>>>>
>>>>>> I think there are a number of choices:
>>>>>>
>>>>>>    1) Say you MUST NOT do IO on layouts being recalled, in which
>>>>>>       case none of this problem arises.  I take it this is what
>>>>>>       Trond is arguing for.
>>>>>>
>>>>>>    2) Say you MAY write some dirty data on layouts being recalled
>>>>>>       but you should limit this attempt to optimize use of
> layouts
>>>>>>       to avoid unduly delaying layout recalls.
>>>>>>
>>>>>>    3) Say clients MAY write large amounts of dirty data and
> server
>>>>>>       will generally accommodate them in using pNFS to do IO this
>>>>>>       way.
>>>>>>
>>>>>> Maybe the right approach is to have whichever of these is to be
> in
>>> effect be chosen on a per-
>>>>> mapping-type basis, perhaps based on clora_changed
>>>>>
>>>>> I think this is the right approach as the blocks and objects
> layout
>>> types may
>>>>> use topologies for which flushing some data to fill, e.g. an
>>> allocated block
>>>>> on disk or a RAID stripe makes sense.
>>>>>
>>>>>>
>>>>>> I think the real problem is the suggestion that there is some
>>> reason that a client has to write
>>>>> every single dirty block within the scope of the CB_LAYOUTRECALL,
>>> i.e. that this is analogous to
>>>>> DELEGRETURN.
>>>>>>
>>>>>>    4*) Clients SHOULD write all dirty data covered by the
> recalled
>>>>>>        layout before return it.
>>>>>>
>>>>>> It may be that you can write faster this way, but it also mean
>>> that the server may wait a while to
>>>>> get the layout back and this may delay other clients.  There is
> the
>>> further problem that it means
>>>>> that your set of dirty blocks can be much smaller than it would be
>>> otherwise and this can hurt
>>>>> performance.  I don't think that should be a valid choice.
>>>>>
>>>>> We can live without it, as long there is no hard requirement for
> the
>>> client to
>>>>> not flush any dirty data upon CB_LAYOUTRECALL (option 1 above).
>>>>>
>>>>> Benny
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: sfaibish [mailto:sfaibish@emc.com]
>>>>>> Sent: Tuesday, November 02, 2010 1:06 PM
>>>>>> To: Trond Myklebust; Noveck, David
>>>>>> Cc: bhalevy@panasas.com; jglasgow@aya.yale.edu; nfsv4@ietf.org
>>>>>> Subject: Re: [nfsv4] Write-behind caching
>>>>>>
>>>>>> On Tue, 02 Nov 2010 10:18:02 -0400, Trond Myklebust
>>>>>> <trond.myklebust@fys.uio.no> wrote:
>>>>>>
>>>>>>> Hi Dave,
>>>>>>>
>>>>>>> So, while I largely agree with your points 1-6, I'd like to add
>>>>>>>
>>>>>>> 0) Layouts are not a tool for enforcing cache consistency!
>>>>>>>
>>>>>>> While I agree that doing safe read-modify-write in the block
> case
>>> is an
>>>>>>> important feature, I don't see any agreement anywhere in RFC5661
>>> that we
>>>>>>> should be providing stronger caching semantics than we used to
>>> provide
>>>>>>> prior to adding pNFS to the protocol. I have no intention of
>>> allowing a
>>>>>>> Linux client implementation that provides such stronger
> semantics
>>> until
>>>>>>> we write that sort of thing into the spec and provide for
> similar
>>>>>>> stronger semantics in the non-pNFS case.
>>>>>>>
>>>>>>> With that in mind, I have the following comments:
>>>>>>>
>>>>>>>       * I see no reason to write data back when the server
>>> recalls the
>>>>>>>         layout. While I see that you could argue (1) implies
> that
>>> you
>>>>>>>         should try to write stuff while you still hold a layout,
>>> the
>>>>>>>         spec says that clora_changed==FALSE implies you can get
>>> that
>>>>>>>         layout back later. In the case where
> clora_changed==TRUE,
>>> you
>>>>>>>         might expect the file would be unavailable for longer,
>>> but the
>>>>>>>         spec says you shouldn't write stuff back in that case...
>>>>>>>       * While this may lead to layout bouncing between clients
>>> and/or
>>>>>>>         the server, the clients do have the option of detecting
>>> this,
>>>>>>>         and choosing write through MDS to improve efficiency.
>>> Grabbing
>>>>>>>         the layout, and blocking others from accessing the data
>>> while
>>>>>>>         you write is not a scalable solution even if you do
>>> believe
>>>>>>>         there is a valid scenario for this behaviour.
>>>>>>>       * Basically, it comes down to the fact that I want to
> write
>>> back
>>>>>>>         data when my memory management heuristics require it, so
>>> that I
>>>>>>>         cache data as long as possible. I see no reason why
>>> server
>>>>>>>         mechanics should dictate when I should stop caching
>>> (unless we
>>>>>>>         are talking about a true cache consistency mechanism).
>>>>>> OK. Now I think I understand your point and we might still
> require
>>> some
>>>>>> changes in the interpretation and perhaps some language in the
>>> 5661. But
>>>>>> I have a basic question about fencing. Do we think there is any
>>> possibility
>>>>>> of data corruption when the DS fence the I/Os very fast after the
>>>>>> layoutrecall.
>>>>>> If we can find such possibility we probably need to mention this
>>> in the
>>>>>> protocol
>>>>>> and recommend how to prevent such a case. For example MDS sends a
>>>>>> layoutrecall
>>>>>> and immediately (implementation decision of the server) it force
>>> the
>>>>>> fencing
>>>>>> on the DS while waiting for the return or after receiving ack
> from
>>> the
>>>>>> client
>>>>>> for the layoutrecall. (I might be out of order here but I just
>>> want to be
>>>>>> sure
>>>>>> this is not the case).
>>>>>>
>>>>>> /Sorin
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> So my choices for Q1 and Q2 are still (A) and (A).
>>>>>>>
>>>>>>> Cheers
>>>>>>>   Trond
>>>>>>>
>>>>>>> On Mon, 2010-11-01 at 13:49 -0400, david.noveck@emc.com wrote:
>>>>>>>> I think that want to address this issue without using the words
>>>>>>>> "unrealistic" or "optimal".  Things that you think are
>>> unrealistic
>>>>>>>> sometimes, in some sorts of environments, turn out to be
> common.
>>> Trying
>>>>>>>> to decide what approaches are optimal are also troubling.  In
>>> different
>>>>>>>> situations, different approaches may be better or worse.  The
>>> protocol
>>>>>>>> needs to define the rules that the client and server have to
>>> obey and
>>>>>>>> they may make choices that result in results from optimal to
>>> pessimal.
>>>>>>>> We can make suggestions on doing things better but in unusual
>>> situations
>>>>>>>> the performance considerations may be different.  The point is
>>> we have
>>>>>>>> to be clear when the client can and can't do A and similarly
> for
>>> B and
>>>>>>>> sometimes the choice of A or B is simply up to the client.
>>>>>>>>
>>>>>>>> So I'm going to put down the following numbered propositions
> and
>>> we'll
>>>>>>>> see where people disagree  with me.  Please be specific.  I'm
>>> going to
>>>>>>>> assume the anything you don't argue with numerically below the
>>> point of
>>>>>>>> disagreement is something we can agree on.
>>>>>>>>
>>>>>>>> 1) Normally, when a client is writing data covered by a layout,
>>> it may
>>>>>>>> write using the layout or to the MDS, but unless there is a
>>> particular
>>>>>>>> reason (e.g. slow or inconsistent response using the layout),
> it
>>> SHOULD
>>>>>>>> write using the layout.
>>>>>>>>
>>>>>>>> 2) When a layout is recalled, the protocol itself does not nor
>>> should it
>>>>>>>> require that dirty blocks in the cache be written before
>>> returning the
>>>>>>>> layout.  If a client chooses to do writes using the recalled
>>> layout, it
>>>>>>>> is doing so as an attempt to improve performance, given its
>>> judgment of
>>>>>>>> the relative performance of IO using the layout and IO through
>>> the MDS.
>>>>>>>>
>>>>>>>> 3) Particularly in the case in which clora_changed is 0,
> clients
>>> MAY
>>>>>>>> choose to take advantage of the higher-performance layout path
>>> to write
>>>>>>>> that data, while it is available.  However, since doing that
>>> delays the
>>>>>>>> return of the layout, it is possible that by delaying the
> return
>>> of the
>>>>>>>> layout, performance of others waiting for the layout may be
>>> reduced.
>>>>>>>>
>>>>>>>> 4) When writing of dirty blocks is done using a layout being
>>> recalled,
>>>>>>>> the possibility exists that the layout will be revoked before
>>> all the
>>>>>>>> blocks are successfully written.  The client MUST be prepared
> to
>>> rewrite
>>>>>>>> those dirty blocks whose layouts write failed to the MDS in
> such
>>> cases.
>>>>>>>>
>>>>>>>> 5) Clients that want to write dirty blocks associated with
>>> recalled
>>>>>>>> layouts MAY choose to restrict the size of the set of dirty
>>> blocks they
>>>>>>>> keep in order to make it relatively unlikely that the layout
>>> will be
>>>>>>>> revoked during recall.  On the other hand, for applications, in
>>> which
>>>>>>>> having a large set of dirty blocks in the cache reduces the IO
>>> actually
>>>>>>>> done, such restriction may result in poorer performance, even
>>> though the
>>>>>>>> specific IO path used is more performant.
>>>>>>>>
>>>>>>>> 6) Note that if a large set of dirty blocks can be kept by the
>>> client
>>>>>>>> when a layout is not held, it should be possible to keep a set
>>> that at
>>>>>>>> least that size a set of dirty blocks when a layout is held.
>>> Even if
>>>>>>>> the client should choose to write those blocks as part of the
>>> layout
>>>>>>>> recall, any that it is not able to write in an appropriate
> time,
>>> will be
>>>>>>>> a subset of an amount which, by hypothesis, can be
> appropriately
>>> held
>>>>>>>> when the only means of writing them is to the MDS.
>>>>>>>>
>>>>>>>> Another way of looking at this is that we have the following
>>> questions
>>>>>>>> which I'm going to present as multiple choice.  I have missed a
>>> few
>>>>>>>> choices but,
>>>>>>>>
>>>>>>>> Q1) When a recall of a layout occurs what do you about dirty
>>> blocks?
>>>>>>>>     A) Nothing.  The IO's to write them are like any other IO
>>> and
>>>>>>>>        you don't do IO using layout under recall.
>>>>>>>>     B) You should write all dirty blocks as part of the recall
>>> if
>>>>>>>>        clora_changed is 0.
>>>>>>>>     C) You should do (A), but have the option of doing (B) but
>>> you
>>>>>>>>        are responsible for the consequences.
>>>>>>>>
>>>>>>>> Q2) How many dirty blocks should you keep covered by a layout?
>>>>>>>>     A) As many as you want.  It doesn't matter.
>>>>>>>>     B) A small number so that you can be sure that they can be
>>>>>>>>        written as part of layout recall (Assuming Q1=B).
>>>>>>>>     C) If there is a limit, it must be at least as great as the
>>> limit
>>>>>>>>        that would be in effect if there is no layout present,
>>> since
>>>>>>>>        that number is OK, once the layout does go back.
>>>>>>>>
>>>>>>>> There are pieces of the spec that are assuming (A) and the
>>> answer to
>>>>>>>> these and pieces assuming (B).
>>>>>>>>
>>>>>>>> So I guess I was arguing before that the answers to Q1 and Q2
>>> should be
>>>>>>>> (A).
>>>>>>>>
>>>>>>>> My understanding is that Benny is arguing for (B) as the answer
>>> to Q1
>>>>>>>> and Q2.
>>>>>>>>
>>>>>>>> So I'm now willing to compromise slightly and answer (C) to
> both
>>> of
>>>>>>>> those, but I think that still leaves me and Benny quite a ways
>>> apart.
>>>>>>>>
>>>>>>>> I'm not sure what Trond's answer is, but I'd interested in
>>> understanding
>>>>>>>> his view in terms of (1)-(6) and Q1 and Q2.
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Trond Myklebust [mailto:trond.myklebust@fys.uio.no]
>>>>>>>> Sent: Friday, October 29, 2010 6:04 PM
>>>>>>>> To: faibish, sorin
>>>>>>>> Cc: Noveck, David; bhalevy@panasas.com; jglasgow@aya.yale.edu;
>>>>>>>> nfsv4@ietf.org
>>>>>>>> Subject: Re: [nfsv4] Write-behind caching
>>>>>>>>
>>>>>>>> On Fri, 2010-10-29 at 17:48 -0400, Trond Myklebust wrote:
>>>>>>>>> On Fri, 2010-10-29 at 17:32 -0400, sfaibish wrote:
>>>>>>>>>> On Fri, 29 Oct 2010 13:39:55 -0400, Trond Myklebust
>>>>>>>>>> <trond.myklebust@fys.uio.no> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Fri, 2010-10-29 at 13:20 -0400, david.noveck@emc.com
>>> wrote:
>>>>>>>>>>>> There are two issues here with regard to handling of layout
>>>>>>>> recall.
>>>>>>>>>>>>
>>>>>>>>>>>> One is with regard to in-flight IO.  As Benny points out,
>>> you
>>>>>>>> cannot be
>>>>>>>>>>>> sure that the in-flight IO can be completed in time to
> avoid
>>> the
>>>>>>>> MDS
>>>>>>>>>>>> losing patience.  That should rarely be the case though, if
>>>>>>>> things are
>>>>>>>>>>>> working right.  The client has to be prepared to deal with
>>> IO
>>>>>>>> failures
>>>>>>>>>>>> due to layout revocation.  Any IO that was in flight and
>>> failed
>>>>>>>> because
>>>>>>>>>>>> of layout revocation will need to be handled by being
>>> reissued to
>>>>>>>> the
>>>>>>>>>>>> MDS.  Is there anybody that disagrees with that?
>>>>>>>>>>>>
>>>>>>>>>>>> The second issue concerns IO not in-flight (in other words,
>>> not
>>>>>>>> IO's
>>>>>>>>>>>> yet but potential IO's) when the recall is received.  I
> just
>>>>>>>> don't see
>>>>>>>>>>>> that it reasonable to start IO's using layout segments
> being
>>>>>>>> recalled
>>>>>>>>>>>> (whether for dirty buffers or anything else).  Doing IO's
> to
>>> the
>>>>>>>> MDS is
>>>>>>>>>>>> fine but there is no real need for the layout recall to
>>> specially
>>>>>>>>
>>>>>>>>>>>> trigger them, whether clora_changed is set or not.
>>>>>>>>>>>
>>>>>>>>>>> This should be _very_ rare. Any cases where 2 clients are
>>> trying
>>>>>>>> to do
>>>>>>>>>>> conflicting I/O on the same data is likely to be either a
>>>>>>>> violation of
>>>>>>>>>>> the NFS cache consistency rules, or a scenario where it is
> in
>>> any
>>>>>>>> case
>>>>>>>>>>> more efficient to go through the MDS (e.g. writing to
>>> adjacent
>>>>>>>> records
>>>>>>>>>>> that share the same extent).
>>>>>>>>>> Well this is a different discussion: what was the reason for
>>> the
>>>>>>>> recall in
>>>>>>>>>> the first place. This is one usecase but there could be other
>>>>>>>> usecases
>>>>>>>>>> for the recall and we discuss here how to implement the
>>> protcol more
>>>>>>>> than
>>>>>>>>>> how to solve a real problem. My 2c
>>>>>>>>>
>>>>>>>>> I strongly disagree. If this is an unrealistic scenario, then
>>> we don't
>>>>>>>>> have to care about devising an optimal strategy for it. The
>>> 'there
>>>>>>>> could
>>>>>>>>> be other usecases' scenario needs to be fleshed out before we
>>> can deal
>>>>>>>>> with it.
>>>>>>>>
>>>>>>>> To clarify a bit what I mean: we MUST devise optimal strategies
>>> for
>>>>>>>> realistic and useful scenarios. It is entirely OPTIONAL to
>>> devise
>>>>>>>> optimal strategies for unrealistic ones.
>>>>>>>>
>>>>>>>> If writing back all data before returning the layout causes
>>> protocol
>>>>>>>> issues because the server cannot distinguish between a bad
>>> client and
>>>>>>>> one that is waiting for I/O to complete, then my argument is
>>> that we're
>>>>>>>> in the second case: we don't have to optimise for it, and so it
>>> is safe
>>>>>>>> for the server to assume 'bad client'...
>>>>>>>>
>>>>>>>>    Trond
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> nfsv4 mailing list
>>>>> nfsv4@ietf.org
>>>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>>
>>>> _______________________________________________
>>>> nfsv4 mailing list
>>>> nfsv4@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>>> _______________________________________________
>>> nfsv4 mailing list
>>> nfsv4@ietf.org
>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>
>> _______________________________________________
>> nfsv4 mailing list
>> nfsv4@ietf.org
>> https://www.ietf.org/mailman/listinfo/nfsv4
> 
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4
> 
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4