Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM

David Noveck <davenoveck@gmail.com> Tue, 18 April 2017 19:30 UTC

Return-Path: <davenoveck@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 109AE129C31 for <nfsv4@ietfa.amsl.com>; Tue, 18 Apr 2017 12:30:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MtvsQvvug-DL for <nfsv4@ietfa.amsl.com>; Tue, 18 Apr 2017 12:30:39 -0700 (PDT)
Received: from mail-io0-x229.google.com (mail-io0-x229.google.com [IPv6:2607:f8b0:4001:c06::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E36181293D9 for <nfsv4@ietf.org>; Tue, 18 Apr 2017 12:30:38 -0700 (PDT)
Received: by mail-io0-x229.google.com with SMTP id r16so7915545ioi.2 for <nfsv4@ietf.org>; Tue, 18 Apr 2017 12:30:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=zee5VfL80SUnIvCd2TDiewKq4sILZ7fB5XUwI9JSotI=; b=uA972EcpXSTazu9Mtls4ltSC6NqNtv3i65orzY3JZ8sGtxz0muenS0hde0T0sG8P2F gUd4g6Zx/c8BUenUPc3FWPJxdf28SgrcNDF7RPzx7D4slU6TZuqo8L5xcdalIbShzMEv asJXLBCuyJLH5jD98tLyQLFSrIgt3i4M9UF5/8YWnXK1txSMwZDHM1MswNB+aUWpuAFW TzMur2Dm3LKr/0l0XvfiR+PcGZnSxHiDAJXtyI0v+rsUxVRkOmA131ChRWleENs0Oix0 xwY0B28wU6MTzUGxXSfxMZCRRB3MXrwPaSPPfBhI8ZlI3KDmYUHRQja/IjMHVQ+Pf8fh LdHQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=zee5VfL80SUnIvCd2TDiewKq4sILZ7fB5XUwI9JSotI=; b=JpHmHKkrcYHyVjNQIa6emcjATTwJTSiF+8if+3hDybdjJosksAE3HLnjFYKlazC+sW zPZo1/+2rfyui8IW1Hk44Dyl5AbM6e5mgzvI+OrAqJ4cs8gU1rXlKOjX2uDUaM0JUzKP j0CQGFOz/yGU5M6tX4bnVRTkWy6Ufl4IcxUVVHuFxajairZpEiCTk47YX5UqEDKOvPXs b2WOLmLx8qvT7pNG/CwzG/fKhf9fQ3yitWwkYJ4EaYENQ/0JoZrWiUBjF+mKYPieXgTq dfbgoY4RmXTiZQ/iV65P5qK8YnYEjBWXkID3xp8Qw1uZwfrZXKW8semZttwpdKl3/16a bUhg==
X-Gm-Message-State: AN3rC/6r/UiUblDgqAFzVjR7LLtvJhN50gBPeEUo+KAPKfLpPd4j22MJ RoBzZNdJJ4Q3ZgDS0nCwTo0AX3a1PA==
X-Received: by 10.107.136.143 with SMTP id s15mr15890219ioi.224.1492543837970; Tue, 18 Apr 2017 12:30:37 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.107.175.14 with HTTP; Tue, 18 Apr 2017 12:30:37 -0700 (PDT)
In-Reply-To: <213DBA50-FB48-4B48-A177-22C2B6879D49@oracle.com>
References: <ED0D48EE-4618-4E07-B97F-8320C77CF1EC@oracle.com> <CADaq8jfHAF6+2AfuRGr2a=D0FX96YAVu=gGJTqGwSvhXCbntfg@mail.gmail.com> <BADB9283-3B08-49DD-A2E2-5C0C20054C45@oracle.com> <CADaq8jfhxcGqJd0U40JkyT9eT+hj24dovMmqFUtkSic_b08jDg@mail.gmail.com> <213DBA50-FB48-4B48-A177-22C2B6879D49@oracle.com>
From: David Noveck <davenoveck@gmail.com>
Date: Tue, 18 Apr 2017 15:30:37 -0400
Message-ID: <CADaq8jeuPQfUNM-QH2jEKuMRSjsEwWQU_0ZBVJH0KDU7-vipmQ@mail.gmail.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: NFSv4 <nfsv4@ietf.org>
Content-Type: multipart/alternative; boundary="001a113ed55a464dab054d75f136"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/88d4SAnTO6kdM904zc8fHO47GvE>
Subject: Re: [nfsv4] Preventing an NFSv4.1 client from destroying a migrated lease after TSM
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Apr 2017 19:30:42 -0000

To avoid burying the lead, I'll respond waaay out of order:

> In sum, here are some options we've considered:

> 1a. destination asserts CONF_R, but client uses the
> returned contrived slot sequence anyway

Clearly, migration-issues-13 needs to take account of this issue.
I believe 1a is the best choice and will explain why below.

> We've explored this mechanism a little more, and now we seem
> to be hamstrung on the first paragraph of RFC 5661, Section
> 18.35.3:

Like much of RFCs 5661 and 7530, the paragraph in question
was written without any consideration of issues related to
transparent state migration.  Sigh!

I think I have not been as explicit about this as I should have been
but, as we have started to address a number of migration-related issues for
NFSv4.1 it appears that we will need a standards-track document,
updating RFC5661, paralleling RFC7931 for NFSv4.0.

RFC7931 had a replacement SETCLIENTID description and it now
appears that the new document will need a replacement for the
EXCHANGE_ID  description.   Coincidence? ....  or something
else? :-)

>   The client uses the EXCHANGE_ID operation to register a particular
>   client owner with the server.

However, when the client_owner has been already been registered
by other means (e.g. transparent state migration), the client may still
use EXCHANGE_ID to obtain the clientid assigned.previously.

>  The client ID returned from this
>   operation will be necessary for requests that create state on the
>   server and will serve as a parent object to sessions created by the
>   client.

OK.

>   In order to confirm the client ID it must first be used,
>   along with the returned eir_sequenceid, as arguments to
>   CREATE_SESSION.

In situations in which the registration of the clent_owner has not occurred
previously ...

> If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the
> result, eir_flags, then eir_sequenceid MUST be ignored, as it has no
>  relevancy.

It is not central to my argument but I believe that the "MUST" is not in
accord with RFC2119.  The point here is that if a CREATE_SESSION
has been one, which this text assume is the case, that has already
established the proper sequence id to use.  This is explaning how the
protocol works and not establishing an interoperability REQUIREMENT>

Anyway, I would go on:

If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the
result, eir_flags, then it is an indication that the the registration

of the client_owner has already occurred and that a further

CREATE_SESSION is not  needed to confirm it.  Of course,

subsequent CREATE_SESSIONs may be needed for other

reasons.

The value eir_seqenceid is used to establish an initial

sequence value associate with the clientid returned.  In cases

in which a CREATE_SESSION has already been done, there

is no need for this value, since sequencing of such request has

already been established and the client has no need for this

value and will ignore it.

> The problem is the normative requirement in the last
> sentence.

I agree that this REQUIREMENT is a problem.

> If CONFIRMED_R is set, a client is required
> to ignore the returned contrived slot sequence number.

According to RFC5661, yes.

> [ This is likely the reason that clients react to this
> flag by purging the lease and starting over ].

I believe that the reason is that if the client thinks/knows a
clientid is unconfirmed and he finds out that the server thinks
otherwise, he figure something is *really *messed up, in which
punting like this is to be preferred to a panic :-(

> Our first thought was to use the sequence number in the
> migrated lease (in other words, the sequence number that
> was established on the source server). A client knows
> that sequence number by looking at the lease it has with
> the source server.

In this case, you are moving over the single-slot
quasi-session associated with the client.

> However, there's no guarantee that the client will not
> perform additional CREATE_SESSION operations against
> the source server between the time the source server
> copies the migrating lease and the client starts trunking
> discovery for the destination server.

True.  The client can hack around this by starting with
the last sequence he has and backing up  if it is too
high.  I'd rather not go there, though.

> Another thought was to use a well-known constant, like
> zero, when creating the first session on the destination
> server.

That is probably the best option if you is committed to respect
the language of the current RFC5661 requirement.  However,
you would still need to revise the EXCHANGE_ID description
to provide a coherent description, in any case.

> We've established that trunking discovery works correctly
> when the destination server does not assert CONFIRMED_R.

:-)

> This is because the client may use the contrived slot
> sequence number provided by the destination server.

It's because he does use that slot sequence number.  To me, that is
an additional reason he should be allowed to do so.

> If the destination server does not assert CONFIRMED_R,
> then how does the client determine whether Transparent
> State Migration has occurred? Can it simply start using
> the open and lock state it has, and deal with BAD_STATEID
> if the servers did not perform TSM?

I think that may be possible.  However, it is is ugly.  if you have another
way
to find out that state was transferred, you can look at the status bits
from SEQUENCE,
and only worry about lost stateids when there is an indication  that some
state was lost.

> Confirmation of the lease means more than just that it
> is present on the queried server

Confirmation means that the client is not uncertain about
whether it is exists.  Confirmation by means of CREATE_SESSION
is one way to do that, by making sure that the EXCHANGE_ID was
not a random retry, since abandoned.

> it also means there is
> now a cached CREATE_SESSION reply, and a known good
> contrived slot sequence number.

In many situations, that is a reasonable inference bu it doesn't
change the meaning of "confirmation".

> Neither of those is true
> for a migrated lease when the client's sessions were not
> also migrated.

Yes, but I'm not certain that both of these are true in the case in which
the client's
session are migrated.

> Thus we should consider that a migrated lease is really
> not confirmed at all,

I consider it confirmed, but confirmed in a different manner.

> but in some intermediate state.

if it it were an intermediate state, there would be some way to get into
what you call the confirmed state (and I would call the
confirmed-by-CREATE_SESSION state).  You could do that by dong
a CREATE_SESSION if you knew the sequenceid but I don't see the
point in explicating the fine structure of the various confirmed clientid
sub-states.

> A more ideal solution would be to create an additional
> EXCHID4_FLAG that signifies the existence of a migrated
> lease for the querying client.

If you were starting today (or had a time machine) you might do it that way.


> In sum, here are some options we've considered:

> 1a. destination asserts CONF_R, but client uses the
> returned contrived slot sequence anyway


My first choice.

> 1b. destination asserts CONF_R, and client uses a fixed
> constant starting slot sequence

My second choice.

> 2. destination does not assert CONF_R, and client is
> prepared for BAD_STATEID if TSM did not occur

a real drag.

> 3. destination asserts a new EXCHID4_FLAG that signifies
> a TSM has made the client's lease available on that
> server

Unfortunately, NFSv4.1 is not an extensible minor version.

> 4. NFSv4.1 TSM cannot occur unless the client's sessions
> are also migrated


> Are there others?

There are others but none come to mind right now.

On Mon, Apr 17, 2017 at 2:45 PM, Chuck Lever <chuck.lever@oracle.com> wrote:

>
> > On Mar 9, 2017, at 2:00 PM, David Noveck <davenoveck@gmail.com> wrote:
> >
> > > I'm not certain what you mean by "handled ... as it
> > > would have been in 4.0". Server trunking discovery in
> > > NFSv4.0 requires an elaborate dance. In NFSv4.1,
> > > trunking discovery can be done by a single operation
> > > (EXCHANGE_ID).
> >
> > What I meant was that, apart from the greater
> > calories expended in v4.0 case, the result should have
> > been the same, i.e. that there is no trunking.
> >
> > > I'm speculating that if the server hadn't asserted
> > > CONFIRMED_R, this client could have recovered
> > > correctly from the migration event. Is that what you
> > > mean?
> >
> > It wasn't what I meant, but, now that you say it, I think is
> > probably right.
>
> We've explored this mechanism a little more, and now we seem
> to be hamstrung on the first paragraph of RFC 5661, Section
> 18.35.3:
>
>    The client uses the EXCHANGE_ID operation to register a particular
>    client owner with the server.  The client ID returned from this
>    operation will be necessary for requests that create state on the
>    server and will serve as a parent object to sessions created by the
>    client.  In order to confirm the client ID it must first be used,
>    along with the returned eir_sequenceid, as arguments to
>    CREATE_SESSION.  If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the
>    result, eir_flags, then eir_sequenceid MUST be ignored, as it has no
>    relevancy.
>
> The problem is the normative requirement in the last
> sentence. If CONFIRMED_R is set, a client is required
> to ignore the returned contrived slot sequence number.
> [ This is likely the reason that clients react to this
> flag by purging the lease and starting over ].
>
> Our first thought was to use the sequence number in the
> migrated lease (in other words, the sequence number that
> was established on the source server). A client knows
> that sequence number by looking at the lease it has with
> the source server.
>
> However, there's no guarantee that the client will not
> perform additional CREATE_SESSION operations against
> the source server between the time the source server
> copies the migrating lease and the client starts trunking
> discovery for the destination server.
>
> Another thought was to use a well-known constant, like
> zero, when creating the first session on the destination
> server.
>
> What sequence number should be used for the first
> CREATE_SESSSION with the destination server? If it is
> the copied sequence number, what can prevent mutation
> of that sequence number while migration of that lease
> is not yet complete?
>
> We've established that trunking discovery works correctly
> when the destination server does not assert CONFIRMED_R.
> This is because the client may use the contrived slot
> sequence number provided by the destination server.
>
> If the destination server does not assert CONFIRMED_R,
> then how does the client determine whether Transparent
> State Migration has occurred? Can it simply start using
> the open and lock state it has, and deal with BAD_STATEID
> if the servers did not perform TSM?
>
>
> Confirmation of the lease means more than just that it
> is present on the queried server: it also means there is
> now a cached CREATE_SESSION reply, and a known good
> contrived slot sequence number. Neither of those is true
> for a migrated lease when the client's sessions were not
> also migrated.
>
> Thus we should consider that a migrated lease is really
> not confirmed at all, but in some intermediate state.
> A more ideal solution would be to create an additional
> EXCHID4_FLAG that signifies the existence of a migrated
> lease for the querying client.
>
>
> In sum, here are some options we've considered:
>
> 1a. destination asserts CONF_R, but client uses the
> returned contrived slot sequence anyway
>
> 1b. destination asserts CONF_R, and client uses a fixed
> constant starting slot sequence
>
> 2. destination does not assert CONF_R, and client is
> prepared for BAD_STATEID if TSM did not occur
>
> 3. destination asserts a new EXCHID4_FLAG that signifies
> a TSM has made the client's lease available on that
> server
>
> 4. NFSv4.1 TSM cannot occur unless the client's sessions
> are also migrated
>
> Are there others?
>
> --
> Chuck Lever
>
>
>
>