Re: [nfsv4] Write-behind caching

<david.noveck@emc.com> Mon, 01 November 2010 17:49 UTC

Return-Path: <david.noveck@emc.com>
X-Original-To: nfsv4@core3.amsl.com
Delivered-To: nfsv4@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 9291D3A6A4D for <nfsv4@core3.amsl.com>; Mon, 1 Nov 2010 10:49:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.599
X-Spam-Level:
X-Spam-Status: No, score=-6.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tKvZxybRsKl4 for <nfsv4@core3.amsl.com>; Mon, 1 Nov 2010 10:49:43 -0700 (PDT)
Received: from mexforward.lss.emc.com (mexforward.lss.emc.com [128.222.32.20]) by core3.amsl.com (Postfix) with ESMTP id D88883A6A49 for <nfsv4@ietf.org>; Mon, 1 Nov 2010 10:49:42 -0700 (PDT)
Received: from hop04-l1d11-si01.isus.emc.com (HOP04-L1D11-SI01.isus.emc.com [10.254.111.54]) by mexforward.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oA1HnaKp014698 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 1 Nov 2010 13:49:36 -0400
Received: from mailhub.lss.emc.com (mailhub.lss.emc.com [10.254.221.251]) by hop04-l1d11-si01.isus.emc.com (RSA Interceptor); Mon, 1 Nov 2010 13:49:24 -0400
Received: from corpussmtp4.corp.emc.com (corpussmtp4.corp.emc.com [10.254.169.197]) by mailhub.lss.emc.com (Switch-3.4.3/Switch-3.4.3) with ESMTP id oA1HnC4B018031; Mon, 1 Nov 2010 13:49:18 -0400
Received: from CORPUSMX50A.corp.emc.com ([128.221.62.45]) by corpussmtp4.corp.emc.com with Microsoft SMTPSVC(6.0.3790.4675); Mon, 1 Nov 2010 13:49:09 -0400
x-mimeole: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 01 Nov 2010 13:49:13 -0400
Message-ID: <BF3BB6D12298F54B89C8DCC1E4073D80029446BC@CORPUSMX50A.corp.emc.com>
In-Reply-To: <1288389823.3701.59.camel@heimdal.trondhjem.org>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: [nfsv4] Write-behind caching
Thread-Index: Act3tdsL/GWlKFbGSN+sNKjluBYeJQCISMOw
References: <BF3BB6D12298F54B89C8DCC1E4073D80028C76DB@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498D54@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76E0@CORPUSMX50A.corp.emc.com> <E043D9D8EE3B5743B8B174A814FD584F0D498E1D@TK5EX14MBXC126.redmond.corp.microsoft.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C76EA@CORPUSMX50A.corp.emc.com> <4CC7B3AE.8000802@gmail.com> <AANLkTi=gD+qr-OhJuf19miV60w9t9TbJiopNS6y4-YVA@mail.gmail.com> <1288186821.8477.28.camel@heimdal.trondhjem.org> <BF3BB6D12298F54B89C8DCC1E4073D80028C7A3E@CORPUSMX50A.corp.emc.com> <op.vk8tpuc5unckof@usensfaibisl2e.eng.emc.com> <4CC857D5.5010104@panasas.com> <op.vk8vpbldunckof@usensfaibisl2e.eng.emc.com> <BF3BB6D12298F54B89C8DCC1E4073D80028C80AB@CORPUSMX50A.corp.emc.com> <1288373995.3701.35.camel@heimdal.trondhjem.org> <op.vlcwr1zqunckof@usensfaibisl2e.eng.emc.com> <1288388933.3701.47.camel@heimdal.trondhjem.org> <1288389823.3701.59.camel@heimdal.trondhjem.org>
From: david.noveck@emc.com
To: trond.myklebust@fys.uio.no, sfaibish@popimap.lss.emc.com
X-OriginalArrivalTime: 01 Nov 2010 17:49:09.0970 (UTC) FILETIME=[165E9B20:01CB79ED]
X-EMM-MHVC: 1
Cc: bhalevy@panasas.com, nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/nfsv4>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Nov 2010 17:49:44 -0000

I think that want to address this issue without using the words
"unrealistic" or "optimal".  Things that you think are unrealistic
sometimes, in some sorts of environments, turn out to be common.  Trying
to decide what approaches are optimal are also troubling.  In different
situations, different approaches may be better or worse.  The protocol
needs to define the rules that the client and server have to obey and
they may make choices that result in results from optimal to pessimal.
We can make suggestions on doing things better but in unusual situations
the performance considerations may be different.  The point is we have
to be clear when the client can and can't do A and similarly for B and
sometimes the choice of A or B is simply up to the client.

So I'm going to put down the following numbered propositions and we'll
see where people disagree  with me.  Please be specific.  I'm going to
assume the anything you don't argue with numerically below the point of
disagreement is something we can agree on.

1) Normally, when a client is writing data covered by a layout, it may
write using the layout or to the MDS, but unless there is a particular
reason (e.g. slow or inconsistent response using the layout), it SHOULD
write using the layout.

2) When a layout is recalled, the protocol itself does not nor should it
require that dirty blocks in the cache be written before returning the
layout.  If a client chooses to do writes using the recalled layout, it
is doing so as an attempt to improve performance, given its judgment of
the relative performance of IO using the layout and IO through the MDS.

3) Particularly in the case in which clora_changed is 0, clients MAY
choose to take advantage of the higher-performance layout path to write
that data, while it is available.  However, since doing that delays the
return of the layout, it is possible that by delaying the return of the
layout, performance of others waiting for the layout may be reduced.   

4) When writing of dirty blocks is done using a layout being recalled,
the possibility exists that the layout will be revoked before all the
blocks are successfully written.  The client MUST be prepared to rewrite
those dirty blocks whose layouts write failed to the MDS in such cases. 

5) Clients that want to write dirty blocks associated with recalled
layouts MAY choose to restrict the size of the set of dirty blocks they
keep in order to make it relatively unlikely that the layout will be
revoked during recall.  On the other hand, for applications, in which
having a large set of dirty blocks in the cache reduces the IO actually
done, such restriction may result in poorer performance, even though the
specific IO path used is more performant. 

6) Note that if a large set of dirty blocks can be kept by the client
when a layout is not held, it should be possible to keep a set that at
least that size a set of dirty blocks when a layout is held.  Even if
the client should choose to write those blocks as part of the layout
recall, any that it is not able to write in an appropriate time, will be
a subset of an amount which, by hypothesis, can be appropriately held
when the only means of writing them is to the MDS. 

Another way of looking at this is that we have the following questions
which I'm going to present as multiple choice.  I have missed a few
choices but,

Q1) When a recall of a layout occurs what do you about dirty blocks?
    A) Nothing.  The IO's to write them are like any other IO and
       you don't do IO using layout under recall.
    B) You should write all dirty blocks as part of the recall if
       clora_changed is 0.
    C) You should do (A), but have the option of doing (B) but you 
       are responsible for the consequences.

Q2) How many dirty blocks should you keep covered by a layout?
    A) As many as you want.  It doesn't matter.
    B) A small number so that you can be sure that they can be
       written as part of layout recall (Assuming Q1=B).
    C) If there is a limit, it must be at least as great as the limit 
       that would be in effect if there is no layout present, since
       that number is OK, once the layout does go back.

There are pieces of the spec that are assuming (A) and the answer to
these and pieces assuming (B).

So I guess I was arguing before that the answers to Q1 and Q2 should be
(A).

My understanding is that Benny is arguing for (B) as the answer to Q1
and Q2.

So I'm now willing to compromise slightly and answer (C) to both of
those, but I think that still leaves me and Benny quite a ways apart.

I'm not sure what Trond's answer is, but I'd interested in understanding
his view in terms of (1)-(6) and Q1 and Q2. 


-----Original Message-----
From: Trond Myklebust [mailto:trond.myklebust@fys.uio.no] 
Sent: Friday, October 29, 2010 6:04 PM
To: faibish, sorin
Cc: Noveck, David; bhalevy@panasas.com; jglasgow@aya.yale.edu;
nfsv4@ietf.org
Subject: Re: [nfsv4] Write-behind caching

On Fri, 2010-10-29 at 17:48 -0400, Trond Myklebust wrote:
> On Fri, 2010-10-29 at 17:32 -0400, sfaibish wrote:
> > On Fri, 29 Oct 2010 13:39:55 -0400, Trond Myklebust  
> > <trond.myklebust@fys.uio.no> wrote:
> > 
> > > On Fri, 2010-10-29 at 13:20 -0400, david.noveck@emc.com wrote:
> > >> There are two issues here with regard to handling of layout
recall.
> > >>
> > >> One is with regard to in-flight IO.  As Benny points out, you
cannot be  
> > >> sure that the in-flight IO can be completed in time to avoid the
MDS  
> > >> losing patience.  That should rarely be the case though, if
things are  
> > >> working right.  The client has to be prepared to deal with IO
failures  
> > >> due to layout revocation.  Any IO that was in flight and failed
because  
> > >> of layout revocation will need to be handled by being reissued to
the  
> > >> MDS.  Is there anybody that disagrees with that?
> > >>
> > >> The second issue concerns IO not in-flight (in other words, not
IO's  
> > >> yet but potential IO's) when the recall is received.  I just
don't see  
> > >> that it reasonable to start IO's using layout segments being
recalled  
> > >> (whether for dirty buffers or anything else).  Doing IO's to the
MDS is  
> > >> fine but there is no real need for the layout recall to specially

> > >> trigger them, whether clora_changed is set or not.
> > >
> > > This should be _very_ rare. Any cases where 2 clients are trying
to do
> > > conflicting I/O on the same data is likely to be either a
violation of
> > > the NFS cache consistency rules, or a scenario where it is in any
case
> > > more efficient to go through the MDS (e.g. writing to adjacent
records
> > > that share the same extent).
> > Well this is a different discussion: what was the reason for the
recall in
> > the first place. This is one usecase but there could be other
usecases
> > for the recall and we discuss here how to implement the protcol more
than
> > how to solve a real problem. My 2c
> 
> I strongly disagree. If this is an unrealistic scenario, then we don't
> have to care about devising an optimal strategy for it. The 'there
could
> be other usecases' scenario needs to be fleshed out before we can deal
> with it.

To clarify a bit what I mean: we MUST devise optimal strategies for
realistic and useful scenarios. It is entirely OPTIONAL to devise
optimal strategies for unrealistic ones.

If writing back all data before returning the layout causes protocol
issues because the server cannot distinguish between a bad client and
one that is waiting for I/O to complete, then my argument is that we're
in the second case: we don't have to optimise for it, and so it is safe
for the server to assume 'bad client'...

   Trond