Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM

David Noveck <davenoveck@gmail.com> Thu, 09 March 2017 10:52 UTC

Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3382A129454 for <nfsv4@ietfa.amsl.com>; Thu, 9 Mar 2017 02:52:44 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3ZZ5zgVl5CLn for <nfsv4@ietfa.amsl.com>; Thu, 9 Mar 2017 02:52:41 -0800 (PST)
Received: from mail-ot0-x231.google.com (mail-ot0-x231.google.com [IPv6:2607:f8b0:4003:c0f::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7272012943F for <nfsv4@ietf.org>; Thu, 9 Mar 2017 02:52:41 -0800 (PST)
Received: by mail-ot0-x231.google.com with SMTP id x37so54003693ota.2 for <nfsv4@ietf.org>; Thu, 09 Mar 2017 02:52:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=s4Eyb2b/7eMMO0Ccje/BL0j55mbUtHAk0U/oRu7bJOw=; b=Xb93y7c+ldP24SyB/1UlO4E5gSHzaFdACR95G8bGX2uSbRcQO1CEyMQ1A3j5pMy49e ythHN4tbrqEJdXl0zzcbrAYVa/bfMLENNCJwvYKdahkeQix4swyPX5iZ84XM/krAmQOn 1QitXnDGDCkOzZfUNJtRWi8tFUSEiuZlb+cQc+3XXDn62PCLvu7qYez7lTrBCqXknoHZ rRlkTcrAxp50MNyROu8rVlPXb1hqIhu4R3gf47wUhc8v1ofskH7KwXPuC3nTPH5D9Byp Cq8wmHOJEO8BJEmfvmxPRZc69wuPHEp4eR3BhW0AXf0LPUaTJZJUC7vKX4NjI0MOo2ss uIeQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=s4Eyb2b/7eMMO0Ccje/BL0j55mbUtHAk0U/oRu7bJOw=; b=KB1LpAw43oaL+S1L5mIf3sCASt77fpORBroq7NFdyr2NfN0W1s6z0VLCzvIa67Xcto 88pd9kS9K8d8wVvkc0T+b9SMEOGYDek6H0zKHTJGrlhQH4CiSOsuprEZhmLFFA5oO/Ee L5J+OzPuL6VgKGJZQP3rCvkP+CT92ixdsrbbDhJgsGwuCMj/lNCWVE95jsb5rtxuHMvS nCiGbkbgrnFqA3wQtCpdnBo/J3p2j4F6qfSn/zy3Sa+aBIjjvqYHBq+uqjQnHmmKMlsJ UjzkEAvtm5azxhfGTQLrC/mLEc6zNJXlyqv/mcl86xRPIztaD4JNnNtIzliExSQCCStB Xeuw==
X-Gm-Message-State: AMke39lqaK1+Vu1tQQl2BULYdndZxlgrNcqXBBTDh4XtuRliHanEQo/rRKgcb9nD37toFjimxr2FY9Hn3dJv9g==
X-Received: by 10.157.47.38 with SMTP id h35mr5764873otb.130.1489056760640; Thu, 09 Mar 2017 02:52:40 -0800 (PST)
MIME-Version: 1.0
Received: by 10.182.137.200 with HTTP; Thu, 9 Mar 2017 02:52:40 -0800 (PST)
In-Reply-To: <ED0D48EE-4618-4E07-B97F-8320C77CF1EC@oracle.com>
References: <ED0D48EE-4618-4E07-B97F-8320C77CF1EC@oracle.com>
From: David Noveck <davenoveck@gmail.com>
Date: Thu, 09 Mar 2017 05:52:40 -0500
Message-ID: <CADaq8jfHAF6+2AfuRGr2a=D0FX96YAVu=gGJTqGwSvhXCbntfg@mail.gmail.com>
To: Chuck Lever <chuck.lever@oracle.com>
Content-Type: multipart/alternative; boundary="94eb2c04792244bfef054a4a0b31"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/vlMpL9ekEyuBIb4qFYhADTPCAQU>
Cc: NFSv4 <nfsv4@ietf.org>
Subject: Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Mar 2017 10:52:44 -0000

> One of Oracle's engineers, Xuan Qi, has been experimenting
> with NFSv4.1 Transparent State Migration. The
> implementations under test are a Solaris-based prototype
> NFS server, and a recent vintage migration-capable Linux
> NFS client. Xuan observed the following:

> + While a simple workload is running, there is an FS
>  migration event from server A to server B. Transparent
>   State Migration is done as part of this migration. The
>   session that had been established between the client and
>   server A is NOT migrated.

> + The client has never interacted with server B before the
>   migration event.

In that case, there is no possibility that the clientid used to
access server A will be merged with one used to access
client B.  So it is likely that the clientid used to access
client A, together with the stateid's used to access the
migrated fs are now usable on B.

> After the client retrieves fs_locations
> from server A, it begins trunking discovery with server B.

I assume that it compares the server owners it got
from each server. Correct?

> + In the initial EXCHANGE_ID request with server B, the
>  client presents the same client owner and boot verifier
>  it previously presented to server A.

Makes sense.

> + In the initial EXCHANGE_ID reply, server B reports the
>  presence of the migrated lease by returning the same
>  clientID the client got from server A and asserting
>   EXCHGID4_FLAG_CONFIRMED_R.

> + The client thus believes this lease is an old one it
>  forgot to clean up.

I don't see why it should believe that.  It is aware of the
migration from A to B and thus the possibility that
a clientid has been migrated.  The fact that it is the
same id makes it even more likely that the lease is
migrated, although a coincidence is possible.

> It purges the migrated lease by
>  sending a second EXCHANGE_ID with the same client owner
>  but an arbitrary boot verifier.

OUCH!!. That sounds to me like a client bug.

>Server B returns a
>  second distinct client ID.

Which is basically ignored I take it.

> + The client issues a CREATE_SESSION with the second
>  client ID. This client ID and session is not used for
>  subsequent operations.

OK.

> + The client issues a DESTROY_SESSION and DESTROY_CLIENTID
>  to server A, using the original session and client ID.
>  How polite!

Would it still do that if there was another non-migrated fs on A using
the same session?  I hope not.

> + The client then establishes a fresh lease with server
>  B by sending a third EXCHANGE_ID with the same client
>  owner and the original boot verifier. Server B returns
>  a third distinct client ID.

This clientid will have no stateids associated with it.

> The client then uses this
>   new client ID for a CREATE_SESSION. This session is used
>   to issue RECLAIM_COMPLETE with no OPEN state recovery.

Not clear why it is doing this, but it is not relevant.  The
damage was done when the existing migrated lock state
was thrown away.

.> + The first WRITE operation to server B fails with
 > NFS4ERR_BAD_STATEID.

Not surprising, since the session is associated with a client for
which there are no valid stateids.

I don't understand why it thinks that stateid was valid, since it
previously
invalidated it by doing an EXCHANGE_I'D with a new boot verifier.

> The client performs no-grace
> stateid recovery, and the workload continues normally.

Sigh!

> + Both servers report the same server scope, but different
>  server owners. The scope string is
>
>     "Solaris NFSv4.1 Server Scope"

> RFC 5661 and draft-ietf-nfsv4-migration-issues are not
> clear about how servers and clients are supposed to
> interact after Transparent State Migration.

probably not.  The basic assumption is that there needs
to be an update to bring rfc5661 into line with rfc7931, but
that hasn't been done yet.

With regard to this specific situation, it doesn't seem that anything
4.1-specific
has gone on.  In particular, I haven't noticed any particular consquences
due
to sessions or server name or server scope.  There has been an instance of
RECLAIM_COMPLETE but by that point, the damage had been done. It looks
to me that if this had been handled essentially as it would have been in
4.0,
things would have worked better. Maybe I've missed something.

>  For instance:

> 1. Since the client has had no previous contact with
> server B, should the migrated lease be considered
> confirmed or not confirmed on server B?

I'd say that if is confirmed on A and migrated to B,
then it should be confirmed on B as well.

> 2. How should the client determine whether or not
> Transparent State Migration has occurred?

I think one should follow the approach in rfc7931.  If the stateids
are  valid on the new server, then transparent state
migfation has occurred.

> 3. If the session was migrated (which is not the case
> in the current example), how should the client determine
> it must perform BIND_CONN_TO_SESSION instead of
> CREATE_SESSION? (I guess just try BC2S first?)

I think your guess is right.

>4. If TSM is detected, the client performs no state
> recovery. Should it avoid sending
> RECLAIM_COMPLETE in this case?

I don't see why it should send RECLAIM_COMPLETE
in a situation in which there is no indication that
 state was lost.  Do we understand why it was sent in this case?

On Wed, Mar 8, 2017 at 4:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:

> One of Oracle's engineers, Xuan Qi, has been experimenting
> with NFSv4.1 Transparent State Migration. The
> implementations under test are a Solaris-based prototype
> NFS server, and a recent vintage migration-capable Linux
> NFS client. Xuan observed the following:
>
> + While a simple workload is running, there is an FS
>   migration event from server A to server B. Transparent
>   State Migration is done as part of this migration. The
>   session that had been established between the client and
>   server A is NOT migrated.
>
> + The client has never interacted with server B before the
>   migration event. After the client retrieves fs_locations
>   from server A, it begins trunking discovery with server B.
>
> + In the initial EXCHANGE_ID request with server B, the
>   client presents the same client owner and boot verifier
>   it previously presented to server A.
>
> + In the initial EXCHANGE_ID reply, server B reports the
>   presence of the migrated lease by returning the same
>   clientID the client got from server A and asserting
>   EXCHGID4_FLAG_CONFIRMED_R.
>
> + The client thus believes this lease is an old one it
>   forgot to clean up. It purges the migrated lease by
>   sending a second EXCHANGE_ID with the same client owner
>   but an arbitrary boot verifier. Server B returns a
>   second distinct client ID.
>
> + The client issues a CREATE_SESSION with the second
>   client ID. This client ID and session is not used for
>   subsequent operations.
>
> + The client issues a DESTROY_SESSION and DESTROY_CLIENTID
>   to server A, using the original session and client ID.
>   How polite!
>
> + The client then establishes a fresh lease with server
>   B by sending a third EXCHANGE_ID with the same client
>   owner and the original boot verifier. Server B returns
>   a third distinct client ID. The client then uses this
>   new client ID for a CREATE_SESSION. This session is used
>   to issue RECLAIM_COMPLETE with no OPEN state recovery.
>
> + The first WRITE operation to server B fails with
>   NFS4ERR_BAD_STATEID. The client performs no-grace
>   stateid recovery, and the workload continues normally.
>
> + Both servers report the same server scope, but different
>   server owners. The scope string is
>
>     "Solaris NFSv4.1 Server Scope"
>
> RFC 5661 and draft-ietf-nfsv4-migration-issues are not
> clear about how servers and clients are supposed to
> interact after Transparent State Migration. For instance:
>
> 1. Since the client has had no previous contact with
> server B, should the migrated lease be considered
> confirmed or not confirmed on server B?
>
> 2. How should the client determine whether or not
> Transparent State Migration has occurred?
>
> 3. If the session was migrated (which is not the case
> in the current example), how should the client determine
> it must perform BIND_CONN_TO_SESSION instead of
> CREATE_SESSION? (I guess just try BC2S first?)
>
> 4. If TSM is detected, the client performs no state
> recovery. Should it avoid sending RECLAIM_COMPLETE in
> this case?
>
>
> --
> Chuck Lever
>
>
>
> _______________________________________________
> nfsv4 mailing list
> nfsv4@ietf.org
> https://www.ietf.org/mailman/listinfo/nfsv4
>