Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM

> One of Oracle's engineers, Xuan Qi, has been experimenting
> with NFSv4.1 Transparent State Migration. The
> implementations under test are a Solaris-based prototype
> NFS server, and a recent vintage migration-capable Linux
> NFS client. Xuan observed the following:

> + While a simple workload is running, there is an FS
>  migration event from server A to server B. Transparent
>   State Migration is done as part of this migration. The
>   session that had been established between the client and
>   server A is NOT migrated.

> + The client has never interacted with server B before the
>   migration event.

In that case, there is no possibility that the clientid used to
access server A will be merged with one used to access
client B.  So it is likely that the clientid used to access
client A, together with the stateid's used to access the
migrated fs are now usable on B.

> After the client retrieves fs_locations
> from server A, it begins trunking discovery with server B.

I assume that it compares the server owners it got
from each server. Correct?

> + In the initial EXCHANGE_ID request with server B, the
>  client presents the same client owner and boot verifier
>  it previously presented to server A.

Makes sense.

> + In the initial EXCHANGE_ID reply, server B reports the
>  presence of the migrated lease by returning the same
>  clientID the client got from server A and asserting
>   EXCHGID4_FLAG_CONFIRMED_R.

> + The client thus believes this lease is an old one it
>  forgot to clean up.

I don't see why it should believe that.  It is aware of the
migration from A to B and thus the possibility that
a clientid has been migrated.  The fact that it is the
same id makes it even more likely that the lease is
migrated, although a coincidence is possible.

> It purges the migrated lease by
>  sending a second EXCHANGE_ID with the same client owner
>  but an arbitrary boot verifier.

OUCH!!. That sounds to me like a client bug.

>Server B returns a
>  second distinct client ID.

Which is basically ignored I take it.

> + The client issues a CREATE_SESSION with the second
>  client ID. This client ID and session is not used for
>  subsequent operations.

OK.

> + The client issues a DESTROY_SESSION and DESTROY_CLIENTID
>  to server A, using the original session and client ID.
>  How polite!

Would it still do that if there was another non-migrated fs on A using
the same session?  I hope not.

> + The client then establishes a fresh lease with server
>  B by sending a third EXCHANGE_ID with the same client
>  owner and the original boot verifier. Server B returns
>  a third distinct client ID.

This clientid will have no stateids associated with it.

> The client then uses this
>   new client ID for a CREATE_SESSION. This session is used
>   to issue RECLAIM_COMPLETE with no OPEN state recovery.

Not clear why it is doing this, but it is not relevant.  The
damage was done when the existing migrated lock state
was thrown away.

.> + The first WRITE operation to server B fails with
 > NFS4ERR_BAD_STATEID.

Not surprising, since the session is associated with a client for
which there are no valid stateids.

I don't understand why it thinks that stateid was valid, since it
previously
invalidated it by doing an EXCHANGE_I'D with a new boot verifier.

> The client performs no-grace
> stateid recovery, and the workload continues normally.

Sigh!

> + Both servers report the same server scope, but different
>  server owners. The scope string is
>
>     "Solaris NFSv4.1 Server Scope"

> RFC 5661 and draft-ietf-nfsv4-migration-issues are not
> clear about how servers and clients are supposed to
> interact after Transparent State Migration.

probably not.  The basic assumption is that there needs
to be an update to bring rfc5661 into line with rfc7931, but
that hasn't been done yet.

With regard to this specific situation, it doesn't seem that anything
4.1-specific
has gone on.  In particular, I haven't noticed any particular consquences
due
to sessions or server name or server scope.  There has been an instance of
RECLAIM_COMPLETE but by that point, the damage had been done. It looks
to me that if this had been handled essentially as it would have been in
4.0,
things would have worked better. Maybe I've missed something.

>  For instance:

> 1. Since the client has had no previous contact with
> server B, should the migrated lease be considered
> confirmed or not confirmed on server B?

I'd say that if is confirmed on A and migrated to B,
then it should be confirmed on B as well.

> 2. How should the client determine whether or not
> Transparent State Migration has occurred?

I think one should follow the approach in rfc7931.  If the stateids
are  valid on the new server, then transparent state
migfation has occurred.

> 3. If the session was migrated (which is not the case
> in the current example), how should the client determine
> it must perform BIND_CONN_TO_SESSION instead of
> CREATE_SESSION? (I guess just try BC2S first?)

I think your guess is right.

>4. If TSM is detected, the client performs no state
> recovery. Should it avoid sending
> RECLAIM_COMPLETE in this case?

I don't see why it should send RECLAIM_COMPLETE
in a situation in which there is no indication that
 state was lost.  Do we understand why it was sent in this case?

On Wed, Mar 8, 2017 at 4:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:

> One of Oracle's engineers, Xuan Qi, has been experimenting
> with NFSv4.1 Transparent State Migration. The
> implementations under test are a Solaris-based prototype
> NFS server, and a recent vintage migration-capable Linux
> NFS client. Xuan observed the following:
>
> + While a simple workload is running, there is an FS
>   migration event from server A to server B. Transparent
>   State Migration is done as part of this migration. The
>   session that had been established between the client and
>   server A is NOT migrated.
>
> + The client has never interacted with server B before the
>   migration event. After the client retrieves fs_locations
>   from server A, it begins trunking discovery with server B.
>
> + In the initial EXCHANGE_ID request with server B, the
>   client presents the same client owner and boot verifier
>   it previously presented to server A.
>
> + In the initial EXCHANGE_ID reply, server B reports the
>   presence of the migrated lease by returning the same
>   clientID the client got from server A and asserting
>   EXCHGID4_FLAG_CONFIRMED_R.
>
> + The client thus believes this lease is an old one it
>   forgot to clean up. It purges the migrated lease by
>   sending a second EXCHANGE_ID with the same client owner
>   but an arbitrary boot verifier. Server B returns a
>   second distinct client ID.
>
> + The client issues a CREATE_SESSION with the second
>   client ID. This client ID and session is not used for
>   subsequent operations.
>
> + The client issues a DESTROY_SESSION and DESTROY_CLIENTID
>   to server A, using the original session and client ID.
>   How polite!
>
> + The client then establishes a fresh lease with server
>   B by sending a third EXCHANGE_ID with the same client
>   owner and the original boot verifier. Server B returns
>   a third distinct client ID. The client then uses this
>   new client ID for a CREATE_SESSION. This session is used
>   to issue RECLAIM_COMPLETE with no OPEN state recovery.
>
> + The first WRITE operation to server B fails with
>   NFS4ERR_BAD_STATEID. The client performs no-grace
>   stateid recovery, and the workload continues normally.
>
> + Both servers report the same server scope, but different
>   server owners. The scope string is
>
>     "Solaris NFSv4.1 Server Scope"
>
> RFC 5661 and draft-ietf-nfsv4-migration-issues are not
> clear about how servers and clients are supposed to
> interact after Transparent State Migration. For instance:
>
> 1. Since the client has had no previous contact with
> server B, should the migrated lease be considered
> confirmed or not confirmed on server B?
>
> 2. How should the client determine whether or not
> Transparent State Migration has occurred?
>
> 3. If the session was migrated (which is not the case
> in the current example), how should the client determine
> it must perform BIND_CONN_TO_SESSION instead of
> CREATE_SESSION? (I guess just try BC2S first?)
>
> 4. If TSM is detected, the client performs no state
> recovery. Should it avoid sending RECLAIM_COMPLETE in
> this case?
>
>
> --
> Chuck Lever
>
>
>
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4
>