Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM

Chuck Lever <chuck.lever@oracle.com> Wed, 19 April 2017 16:16 UTC

Return-Path: <chuck.lever@oracle.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 312361287A7 for <nfsv4@ietfa.amsl.com>; Wed, 19 Apr 2017 09:16:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.22
X-Spam-Level:
X-Spam-Status: No, score=-4.22 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tXGTOqA3WgrX for <nfsv4@ietfa.amsl.com>; Wed, 19 Apr 2017 09:16:10 -0700 (PDT)
Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4230F128896 for <nfsv4@ietf.org>; Wed, 19 Apr 2017 09:09:19 -0700 (PDT)
Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id v3JG9HG1006007 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 19 Apr 2017 16:09:17 GMT
Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id v3JG9HxU021001 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 19 Apr 2017 16:09:17 GMT
Received: from abhmp0017.oracle.com (abhmp0017.oracle.com [141.146.116.23]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id v3JG9GsD026814; Wed, 19 Apr 2017 16:09:17 GMT
Received: from anon-dhcp-171.1015granger.net (/68.46.169.226) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 19 Apr 2017 09:09:16 -0700
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jeuPQfUNM-QH2jEKuMRSjsEwWQU_0ZBVJH0KDU7-vipmQ@mail.gmail.com>
Date: Wed, 19 Apr 2017 12:09:15 -0400
Cc: NFSv4 <nfsv4@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <0C02F2F9-A202-45A7-844B-55CA6D308CCC@oracle.com>
References: <ED0D48EE-4618-4E07-B97F-8320C77CF1EC@oracle.com> <CADaq8jfHAF6+2AfuRGr2a=D0FX96YAVu=gGJTqGwSvhXCbntfg@mail.gmail.com> <BADB9283-3B08-49DD-A2E2-5C0C20054C45@oracle.com> <CADaq8jfhxcGqJd0U40JkyT9eT+hj24dovMmqFUtkSic_b08jDg@mail.gmail.com> <213DBA50-FB48-4B48-A177-22C2B6879D49@oracle.com> <CADaq8jeuPQfUNM-QH2jEKuMRSjsEwWQU_0ZBVJH0KDU7-vipmQ@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
X-Mailer: Apple Mail (2.3124)
X-Source-IP: userv0021.oracle.com [156.151.31.71]
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/D30WINMYWGOURbbQKLUE_AEHwT4>
Subject: Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 19 Apr 2017 16:16:12 -0000

> On Apr 18, 2017, at 3:30 PM, David Noveck <davenoveck@gmail.com> wrote:
> 
> To avoid burying the lead, I'll respond waaay out of order:
> 
> > In sum, here are some options we've considered:
> 
> > 1a. destination asserts CONF_R, but client uses the
> > returned contrived slot sequence anyway
> 
> Clearly, migration-issues-13 needs to take account of this issue.
> I believe 1a is the best choice and will explain why below.
> 
> > We've explored this mechanism a little more, and now we seem
> > to be hamstrung on the first paragraph of RFC 5661, Section
> > 18.35.3:
> 
> Like much of RFCs 5661 and 7530, the paragraph in question
> was written without any consideration of issues related to 
> transparent state migration.  Sigh!
> 
> I think I have not been as explicit about this as I should have been
> but, as we have started to address a number of migration-related issues for
> NFSv4.1 it appears that we will need a standards-track document,
> updating RFC5661, paralleling RFC7931 for NFSv4.0.
> 
> RFC7931 had a replacement SETCLIENTID description and it now 
> appears that the new document will need a replacement for the
> EXCHANGE_ID  description.   Coincidence? ....  or something 
> else? :-)
> 
> >   The client uses the EXCHANGE_ID operation to register a particular
> >   client owner with the server.  
> 
> However, when the client_owner has been already been registered
> by other means (e.g. transparent state migration), the client may still
> use EXCHANGE_ID to obtain the clientid assigned.previously.
> 
> >  The client ID returned from this
> >   operation will be necessary for requests that create state on the
> >   server and will serve as a parent object to sessions created by the
> >   client.  
> 
> OK.
> 
> >   In order to confirm the client ID it must first be used,
> >   along with the returned eir_sequenceid, as arguments to
> >   CREATE_SESSION.  
> 
> In situations in which the registration of the clent_owner has not occurred 
> previously ...
> 
> > If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the
> > result, eir_flags, then eir_sequenceid MUST be ignored, as it has no
> >  relevancy.
> 
> It is not central to my argument but I believe that the "MUST" is not in
> accord with RFC2119.  The point here is that if a CREATE_SESSION 
> has been one, which this text assume is the case, that has already 
> established the proper sequence id to use.  This is explaning how the
> protocol works and not establishing an interoperability REQUIREMENT>
> 
> Anyway, I would go on:
> 
> If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the
> result, eir_flags, then it is an indication that the the registration
> of the client_owner has already occurred and that a further
> CREATE_SESSION is not  needed to confirm it.  Of course,
> subsequent CREATE_SESSIONs may be needed for other
> reasons.
> 
> The value eir_seqenceid is used to establish an initial
> sequence value associate with the clientid returned.  In cases
> in which a CREATE_SESSION has already been done, there
> is no need for this value, since sequencing of such request has 
> already been established and the client has no need for this
> value and will ignore it.
> 
> > The problem is the normative requirement in the last
> > sentence. 
> 
> I agree that this REQUIREMENT is a problem.
> 
> > If CONFIRMED_R is set, a client is required
> > to ignore the returned contrived slot sequence number.
> 
> According to RFC5661, yes.
> 
> > [ This is likely the reason that clients react to this
> > flag by purging the lease and starting over ].
> 
> I believe that the reason is that if the client thinks/knows a
> clientid is unconfirmed and he finds out that the server thinks
> otherwise, he figure something is really messed up, in which
> punting like this is to be preferred to a panic :-(
> 
> > Our first thought was to use the sequence number in the
> > migrated lease (in other words, the sequence number that
> > was established on the source server). A client knows
> > that sequence number by looking at the lease it has with
> > the source server.
> 
> In this case, you are moving over the single-slot 
> quasi-session associated with the client.
> 
> > However, there's no guarantee that the client will not
> > perform additional CREATE_SESSION operations against
> > the source server between the time the source server
> > copies the migrating lease and the client starts trunking
> > discovery for the destination server.
> 
> True.  The client can hack around this by starting with 
> the last sequence he has and backing up  if it is too
> high.  I'd rather not go there, though.
> 
> > Another thought was to use a well-known constant, like
> > zero, when creating the first session on the destination
> > server.
> 
> That is probably the best option if you is committed to respect
> the language of the current RFC5661 requirement.  However,
> you would still need to revise the EXCHANGE_ID description
> to provide a coherent description, in any case.
> 
> > We've established that trunking discovery works correctly
> > when the destination server does not assert CONFIRMED_R.
> 
> :-)
> 
> > This is because the client may use the contrived slot
> > sequence number provided by the destination server.
> 
> It's because he does use that slot sequence number.  To me, that is
> an additional reason he should be allowed to do so.
> 
> > If the destination server does not assert CONFIRMED_R,
> > then how does the client determine whether Transparent
> > State Migration has occurred? Can it simply start using
> > the open and lock state it has, and deal with BAD_STATEID
> > if the servers did not perform TSM?
> 
> I think that may be possible.  However, it is is ugly.  if you have another way
> to find out that state was transferred, you can look at the status bits from SEQUENCE,
> and only worry about lost stateids when there is an indication  that some
> state was lost.
> 
> > Confirmation of the lease means more than just that it
> > is present on the queried server
> 
> Confirmation means that the client is not uncertain about 
> whether it is exists.  Confirmation by means of CREATE_SESSION
> is one way to do that, by making sure that the EXCHANGE_ID was
> not a random retry, since abandoned. 
> 
> > it also means there is
> > now a cached CREATE_SESSION reply, and a known good
> > contrived slot sequence number. 
> 
> In many situations, that is a reasonable inference bu it doesn't
> change the meaning of "confirmation".
> 
> > Neither of those is true
> > for a migrated lease when the client's sessions were not
> > also migrated.
> 
> Yes, but I'm not certain that both of these are true in the case in which the client's
> session are migrated.
> 
> > Thus we should consider that a migrated lease is really
> > not confirmed at all, 
> 
> I consider it confirmed, but confirmed in a different manner.
> 
> > but in some intermediate state.
> 
> if it it were an intermediate state, there would be some way to get into 
> what you call the confirmed state (and I would call the 
> confirmed-by-CREATE_SESSION state).  You could do that by dong
> a CREATE_SESSION if you knew the sequenceid but I don't see the
> point in explicating the fine structure of the various confirmed clientid
> sub-states.
> 
> > A more ideal solution would be to create an additional
> > EXCHID4_FLAG that signifies the existence of a migrated
> > lease for the querying client.
> 
> If you were starting today (or had a time machine) you might do it that way.
> 
> 
> > In sum, here are some options we've considered:
> 
> > 1a. destination asserts CONF_R, but client uses the
> > returned contrived slot sequence anyway
> 
> 
> My first choice.

I believe that is what I have prototyped in the Linux
client, currently, and it is known to work as well as
a non-migration EXCHANGE_ID / CREATE_SESSION.

I don't object to this as long as the WG has blessed
this mechanism and documents it in an "NFSv4.1
migration update" as mentioned above.


> > 1b. destination asserts CONF_R, and client uses a fixed
> > constant starting slot sequence
> 
> My second choice.
> 
> > 2. destination does not assert CONF_R, and client is
> > prepared for BAD_STATEID if TSM did not occur
> 
> a real drag.
> 
> > 3. destination asserts a new EXCHID4_FLAG that signifies
> > a TSM has made the client's lease available on that
> > server
> 
> Unfortunately, NFSv4.1 is not an extensible minor version.
> 
> > 4. NFSv4.1 TSM cannot occur unless the client's sessions
> > are also migrated
> 
> 
> > Are there others?
> 
> There are others but none come to mind right now.
> 
> On Mon, Apr 17, 2017 at 2:45 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> > On Mar 9, 2017, at 2:00 PM, David Noveck <davenoveck@gmail.com> wrote:
> >
> > > I'm not certain what you mean by "handled ... as it
> > > would have been in 4.0". Server trunking discovery in
> > > NFSv4.0 requires an elaborate dance. In NFSv4.1,
> > > trunking discovery can be done by a single operation
> > > (EXCHANGE_ID).
> >
> > What I meant was that, apart from the greater
> > calories expended in v4.0 case, the result should have
> > been the same, i.e. that there is no trunking.
> >
> > > I'm speculating that if the server hadn't asserted
> > > CONFIRMED_R, this client could have recovered
> > > correctly from the migration event. Is that what you
> > > mean?
> >
> > It wasn't what I meant, but, now that you say it, I think is
> > probably right.
> 
> We've explored this mechanism a little more, and now we seem
> to be hamstrung on the first paragraph of RFC 5661, Section
> 18.35.3:
> 
>    The client uses the EXCHANGE_ID operation to register a particular
>    client owner with the server.  The client ID returned from this
>    operation will be necessary for requests that create state on the
>    server and will serve as a parent object to sessions created by the
>    client.  In order to confirm the client ID it must first be used,
>    along with the returned eir_sequenceid, as arguments to
>    CREATE_SESSION.  If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the
>    result, eir_flags, then eir_sequenceid MUST be ignored, as it has no
>    relevancy.
> 
> The problem is the normative requirement in the last
> sentence. If CONFIRMED_R is set, a client is required
> to ignore the returned contrived slot sequence number.
> [ This is likely the reason that clients react to this
> flag by purging the lease and starting over ].
> 
> Our first thought was to use the sequence number in the
> migrated lease (in other words, the sequence number that
> was established on the source server). A client knows
> that sequence number by looking at the lease it has with
> the source server.
> 
> However, there's no guarantee that the client will not
> perform additional CREATE_SESSION operations against
> the source server between the time the source server
> copies the migrating lease and the client starts trunking
> discovery for the destination server.
> 
> Another thought was to use a well-known constant, like
> zero, when creating the first session on the destination
> server.
> 
> What sequence number should be used for the first
> CREATE_SESSSION with the destination server? If it is
> the copied sequence number, what can prevent mutation
> of that sequence number while migration of that lease
> is not yet complete?
> 
> We've established that trunking discovery works correctly
> when the destination server does not assert CONFIRMED_R.
> This is because the client may use the contrived slot
> sequence number provided by the destination server.
> 
> If the destination server does not assert CONFIRMED_R,
> then how does the client determine whether Transparent
> State Migration has occurred? Can it simply start using
> the open and lock state it has, and deal with BAD_STATEID
> if the servers did not perform TSM?
> 
> 
> Confirmation of the lease means more than just that it
> is present on the queried server: it also means there is
> now a cached CREATE_SESSION reply, and a known good
> contrived slot sequence number. Neither of those is true
> for a migrated lease when the client's sessions were not
> also migrated.
> 
> Thus we should consider that a migrated lease is really
> not confirmed at all, but in some intermediate state.
> A more ideal solution would be to create an additional
> EXCHID4_FLAG that signifies the existence of a migrated
> lease for the querying client.
> 
> 
> In sum, here are some options we've considered:
> 
> 1a. destination asserts CONF_R, but client uses the
> returned contrived slot sequence anyway
> 
> 1b. destination asserts CONF_R, and client uses a fixed
> constant starting slot sequence
> 
> 2. destination does not assert CONF_R, and client is
> prepared for BAD_STATEID if TSM did not occur
> 
> 3. destination asserts a new EXCHID4_FLAG that signifies
> a TSM has made the client's lease available on that
> server
> 
> 4. NFSv4.1 TSM cannot occur unless the client's sessions
> are also migrated
> 
> Are there others?
> 
> --
> Chuck Lever
> 
> 
> 
> 

--
Chuck Lever