Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM

Chuck Lever <chuck.lever@oracle.com> Mon, 17 April 2017 18:45 UTC

Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
From: Chuck Lever <chuck.lever@oracle.com>
In-Reply-To: <CADaq8jfhxcGqJd0U40JkyT9eT+hj24dovMmqFUtkSic_b08jDg@mail.gmail.com>
Date: Mon, 17 Apr 2017 14:45:40 -0400
Cc: NFSv4 <nfsv4@ietf.org>
Content-Transfer-Encoding: 7bit
Message-Id: <213DBA50-FB48-4B48-A177-22C2B6879D49@oracle.com>
References: <ED0D48EE-4618-4E07-B97F-8320C77CF1EC@oracle.com> <CADaq8jfHAF6+2AfuRGr2a=D0FX96YAVu=gGJTqGwSvhXCbntfg@mail.gmail.com> <BADB9283-3B08-49DD-A2E2-5C0C20054C45@oracle.com> <CADaq8jfhxcGqJd0U40JkyT9eT+hj24dovMmqFUtkSic_b08jDg@mail.gmail.com>
To: David Noveck <davenoveck@gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/KD5wTICJt5CDWWWJMTcZ_as42tQ>
Subject: Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM
Precedence: list

> On Mar 9, 2017, at 2:00 PM, David Noveck <davenoveck@gmail.com> wrote:
> 
> > I'm not certain what you mean by "handled ... as it
> > would have been in 4.0". Server trunking discovery in
> > NFSv4.0 requires an elaborate dance. In NFSv4.1,
> > trunking discovery can be done by a single operation
> > (EXCHANGE_ID).
> 
> What I meant was that, apart from the greater 
> calories expended in v4.0 case, the result should have
> been the same, i.e. that there is no trunking.
> 
> > I'm speculating that if the server hadn't asserted
> > CONFIRMED_R, this client could have recovered
> > correctly from the migration event. Is that what you
> > mean?
> 
> It wasn't what I meant, but, now that you say it, I think is
> probably right.

We've explored this mechanism a little more, and now we seem
to be hamstrung on the first paragraph of RFC 5661, Section
18.35.3:

   The client uses the EXCHANGE_ID operation to register a particular
   client owner with the server.  The client ID returned from this
   operation will be necessary for requests that create state on the
   server and will serve as a parent object to sessions created by the
   client.  In order to confirm the client ID it must first be used,
   along with the returned eir_sequenceid, as arguments to
   CREATE_SESSION.  If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the
   result, eir_flags, then eir_sequenceid MUST be ignored, as it has no
   relevancy.

The problem is the normative requirement in the last
sentence. If CONFIRMED_R is set, a client is required
to ignore the returned contrived slot sequence number.
[ This is likely the reason that clients react to this
flag by purging the lease and starting over ].

Our first thought was to use the sequence number in the
migrated lease (in other words, the sequence number that
was established on the source server). A client knows
that sequence number by looking at the lease it has with
the source server.

However, there's no guarantee that the client will not
perform additional CREATE_SESSION operations against
the source server between the time the source server
copies the migrating lease and the client starts trunking
discovery for the destination server.

Another thought was to use a well-known constant, like
zero, when creating the first session on the destination
server.

What sequence number should be used for the first
CREATE_SESSSION with the destination server? If it is
the copied sequence number, what can prevent mutation
of that sequence number while migration of that lease
is not yet complete?

We've established that trunking discovery works correctly
when the destination server does not assert CONFIRMED_R.
This is because the client may use the contrived slot
sequence number provided by the destination server.

If the destination server does not assert CONFIRMED_R,
then how does the client determine whether Transparent
State Migration has occurred? Can it simply start using
the open and lock state it has, and deal with BAD_STATEID
if the servers did not perform TSM?

Confirmation of the lease means more than just that it
is present on the queried server: it also means there is
now a cached CREATE_SESSION reply, and a known good
contrived slot sequence number. Neither of those is true
for a migrated lease when the client's sessions were not
also migrated.

Thus we should consider that a migrated lease is really
not confirmed at all, but in some intermediate state.
A more ideal solution would be to create an additional
EXCHID4_FLAG that signifies the existence of a migrated
lease for the querying client.

In sum, here are some options we've considered:

1a. destination asserts CONF_R, but client uses the
returned contrived slot sequence anyway

1b. destination asserts CONF_R, and client uses a fixed
constant starting slot sequence

2. destination does not assert CONF_R, and client is
prepared for BAD_STATEID if TSM did not occur

3. destination asserts a new EXCHID4_FLAG that signifies
a TSM has made the client's lease available on that
server

4. NFSv4.1 TSM cannot occur unless the client's sessions
are also migrated

Are there others?

--
Chuck Lever

Re: [nfsv4] Preventing an NFSv4.1 client from des… Chuck Lever
Re: [nfsv4] Preventing an NFSv4.1 client from des… Xuan Qi
Re: [nfsv4] Preventing an NFSv4.1 client from des… David Noveck
Re: [nfsv4] Preventing an NFSv4.1 client from des… David Noveck
Re: [nfsv4] Preventing an NFSv4.1 client from des… Chuck Lever
[nfsv4] Preventing an NFSv4.1 client from destroy… Chuck Lever
Re: [nfsv4] Preventing an NFSv4.1 client from des… David Noveck
Re: [nfsv4] Preventing an NFSv4.1 client from des… Chuck Lever
Re: [nfsv4] Preventing an NFSv4.1 client from des… David Noveck