Re: [nfsv4] Draft is updated following your comments, Thanks//答复: draft-mzhang-nfsv4-sequence-id-calibration-01

Rick Macklem <rick.macklem@gmail.com> Fri, 21 April 2023 14:28 UTC

Return-Path: <rick.macklem@gmail.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C6F32C151542 for <nfsv4@ietfa.amsl.com>; Fri, 21 Apr 2023 07:28:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.095
X-Spam-Level:
X-Spam-Status: No, score=-2.095 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qspZuW6HoDER for <nfsv4@ietfa.amsl.com>; Fri, 21 Apr 2023 07:28:23 -0700 (PDT)
Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 03D17C151546 for <nfsv4@ietf.org>; Fri, 21 Apr 2023 07:28:22 -0700 (PDT)
Received: by mail-pf1-x432.google.com with SMTP id d2e1a72fcca58-63b4bf2d74aso1911528b3a.2 for <nfsv4@ietf.org>; Fri, 21 Apr 2023 07:28:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1682087302; x=1684679302; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MXvIgxDgGKQ3WmN4SW0SgTvpfo0ilx4xd2X+TCh2c9Y=; b=Q7KpfvM+2AAWyUL7aGhbuqi1dzS7Oso7PxaOEH1sGmsEJtMogDYaYMF8yTDNA/kohK Og4QRzdT9umwZS1r9YBnF723C4YBY3sHjeiuvo5SDnLp05At6WqmSR5pGD/0iKwD/inS X/HVHjRnRQ7SgkVAtyjjPRBcv9Gqh9oBAdBVYVkxyrlU74QqesHJO0/PFP9MU7ppo7xa x9IxPGr6PfSzjxt9OjQcta13p+5A9/W6zDsC7u7avcTWFvElnDiHWJBcHTLXkmB43fkB PSrvGHbEkTOrPDLF1dPRwgS+Bqzi+/soEENo9UrDWItXWpzdqP4oWXkelvRNARrZ5pjq 13Og==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682087302; x=1684679302; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MXvIgxDgGKQ3WmN4SW0SgTvpfo0ilx4xd2X+TCh2c9Y=; b=QuOrTg7ab/fVoNA5bMWkbXimWYP3UbN7Oy5Zlx4Ix7ZvP01T3xa/vURpgBxelj+UD2 ofV3ySThjIOGcNWhlHUbi6dyE4BPuCTwhkEGL5IQB6VXf15Av5MR91+FovIGz/08gQIl qZ8imwPHSeqK42+AGQ+POEIUWhUTbxto1DAy0+J6dud0uGZ9trOagHyo4qUQv0VkEr2k RY7RmMLSzSmjTWww1ltV/9o12jgJeJoZUjnsFfkinCKCDPs4F2pA7FQuton2+QAdUroX yewFrdhL1eaTF7HxuCAFVfilUtSnH0f6GfsOL2ok40XCLl1T1tt6D+LAUtm3cTqF74Vz Wx9A==
X-Gm-Message-State: AAQBX9d8qHAh9Sx/+5YQyMhMdTesdezgy4txlIp2/Ss0yr39AcgHDxcP oLu02/6PfG66nAtXrG6XNXAUw0dMbw7yUp/fcQ==
X-Google-Smtp-Source: AKy350ZWC93XqiTVm8zfjR+WF+PSeJhcY738ZjcHK0JLzwkm6ZDk5Tv+9l976qsaRiVATjXuFn0pP68If1csQYws4JQ=
X-Received: by 2002:a17:90a:bf83:b0:249:66b3:946 with SMTP id d3-20020a17090abf8300b0024966b30946mr5314819pjs.13.1682087301901; Fri, 21 Apr 2023 07:28:21 -0700 (PDT)
MIME-Version: 1.0
References: <83389b42bb0f49509c757cdabd5c6b3f@huawei.com> <CAM5tNy7F9Yf8rJQqy4rOwtcK_FSa+SuRsG=9KenB=mG9Wqv6MA@mail.gmail.com> <CAM5tNy56P0XxrDU==hjocvtBuY9DO2JrOn8PThyfBxRorbW74g@mail.gmail.com> <CAM5tNy6=3JRZTZZH5NW=QHocTd1DOS9HSav+PXnVC=N0_xduxA@mail.gmail.com> <CAM5tNy6x75sprE2Zqkxu5QE3wBFysuOUKj056Uyj9fqHPbkLbg@mail.gmail.com> <30a502ce7a90469196140b69b91f2867@huawei.com>
In-Reply-To: <30a502ce7a90469196140b69b91f2867@huawei.com>
From: Rick Macklem <rick.macklem@gmail.com>
Date: Fri, 21 Apr 2023 07:28:13 -0700
Message-ID: <CAM5tNy7Rnv_K=JPGRkabCiMdHn=_tmC=RgpaUcZwJj+QNyjqiQ@mail.gmail.com>
To: "yangjing (U)" <yangjing8@huawei.com>
Cc: "nfsv4@ietf.org" <nfsv4@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/RgC1aAKJCpBDYJrTiw6JrBVYogY>
Subject: Re: [nfsv4] Draft is updated following your comments, Thanks//答复: draft-mzhang-nfsv4-sequence-id-calibration-01
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Apr 2023 14:28:26 -0000

On Thu, Apr 20, 2023 at 8:08 PM yangjing (U) <yangjing8@huawei.com> wrote:
>
> Hi Rick, Thanks for your comments and sorry for the late response ^_^
>
>
> On Mon, Apr 10, 2023 at 2:20 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Mon, Apr 10, 2023 at 1:58 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > >
> > > On Mon, Apr 10, 2023 at 1:15 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > > >
> > > > On Sun, Apr 9, 2023 at 11:39 PM yangjing (U)
> > > > <yangjing8=40huawei.com@dmarc.ietf.org> wrote:
> > > > >
> > > > > The draft is updated as
> > > > > https://www.ietf.org/archive/id/draft-mzhang-nfsv4-sequence-id-c
> > > > > alibration-02.txt
> > > > Here are a few generic comments about the draft (I know nothing
> > > > about detailed formatting, etc).
> > > >
> > > > 1 - It is my understanding that new operations can only be added
> > > > to minor version 2 and not
> > > >      minor version 1.
>
> I agree with you,  If new operations can only be added to minor version 2.
>
> > > > 2 - I largely agree with what Tom Talpey said in his comments. I
> > > > think you need to figure out
> > > >      how/when the sequenceid is getting messed up.
> > > >      Note that it is my understanding that the main purpose of
> > > > sessions is to achieve
> > > >      "exactly once RPC semantics" (the fact that the reply cache
> > > > is bounded in size is a nice
> > > >      added feature, but not the principal purpose).
> > > >      As such, a sequenceid being misordered implies that "exactly
> > > > once" semantics could
> > > >      already have been broken by a software bug in the client.
> > > >
> > > >      You note that the misordered sequenceid could be caused by
> > > > "network partitioning".
> > > >      In a correct implementation that should not be the case.
> > > > After a network partition heals,
> > > >      the client should re-send all outstanding RPCs with the same
> > > > Session/slot/sequenceid
> > > >      as was used in the original RPC request message. These should
> > > > succeed (no misordered
> > > >      error reply) using cached replies, as required.
> > > >      --> If a client's recovery from network partitioning is not
> > > > doing this, then it needs to be
> > > >            fixed.
> > > >      --> As above, if a misordered sequenceid error reply is due
> > > > to a client bug, then that
> > > >            must be fixed, since "exactly once" semantics may already be broken.
> > > > The common client bug that results in a misordered sequenceid
> > > > error reply happens when an in-progress RPC gets interrupted
> > > > before the RPC reply is processed.
>
> You are right. Thanks for your analysis on when misordered sequenceid happens. But this draft does not care how and when it happens, the same as the protocol specification.
> We  focus on how to deal with this error.  If this error never happens, we will not define it, right? We did find that misordered sequenceid happens and IO performance is affected
> because session is reestablished, but we are not sure why and how it happens, because no packages are being captured at that moment.
>
> > > > This is not a case
> > > > where your suggested extension should be used, since the
> > > > in-progress RPC may still be "in-progress". Although I am not sure
> > > > there is a 100% correct way for a client to handle the need to
> > > > interrupt an in-progress RPC, marking the session slot as broken
> > > > and no longer using it is the best approach I am aware of. (I
> > > > suspect you might be referring to the Linux client and I cannot
> > > > comment on how it handles interruption via signals or timeouts for
> > > > session slots, since I am not a Linux client implementor.)
>
> Yes, when we find IO performance is affected, we check Linux client code, find that client will drop the session, then we check the NFS protocol specification....
> I don’t think marking the session slot as broken is the best way. What is next?
> First: Leaving the slot broken. On one hand, with more and more slots broken, concurrency
> of this  session will decline, On another hand, Is new mechanism needed to maintain and notify slot status between client and server?
> Second: Recover the broken slot. That is what this draft is trying to do, I think.
But you cannot resynchronize the broken slot safely unless the RPC that
caused the sequuenceid to become non-synchronized is an idempotent one
(one that can be done more than once on the server safely).
This is difficult (maybe impossible) to determine when a reply to a
subsequent RPC using the same session/slot is NFS4ERR_SEQ_MISORDERED.

To do so for all cases where a RPC reply of NFS4ERR_SEQ_MISORDERED
is received could result in a non-idempotent RPC being executed multiple
times on the server. This is a serious breakage and essentially defeats the
main purpose for using sessions.

You are correct that, once a slot is marked bad, performance degradation
is possible. However NFS4ERR_SEQ_MISORDERED failures should be
rare (some might say that this error should never occur for correctly
implemented clients) and, as such, having several slots on a session
should also be a rare occurrence.
--> After some percentage of the slots are marked bad, the client will
     need to do a CreateSession and replace the session with the bad slots
     with a new one.
     What percentage is a design tradeoff. It sounds like Linux chose
    "> 0%" (or first bad slot, if you prefer). For FreeBSD, I chose "100%"
    (or all slots bad) if you prefer.
>
> > > >
> > > > It sounds like migration might be an exception to the above, but
> > > > it should be addressed specifically (as Tom Talpey notes).
>
> For migration scenario, we will analyze it later
>
> > > >
> > > > 3 - If your extension were used by a client, how would the client
> > > > know whether or not
> > > >      there are still RPC(s) in-progress that have been assigned
> > > > that session/slot/sequenceid?
> > > An additional comment:
> > > - If the client knows that the outstanding RPC request on the
> > >   session/slot/sequenceid that got a misordered error reply
> > >   is one where "exactly once" semantics is not required,
> > >   then I can see that the client implementor might be able to
> > >    safely resynchronize the sequenceid.
> >Oops again. It would be the outstanding RPC request for which no reply has been received that causes the sequenceid to be misordered and that is the RPC that cannot require "exactly once" semantics.
>
> >rick
> > >   To do this, I do not think your extensions are needed.
> > >    If the client issues an RPC that consists of only the
> > >    Sequence operation repeatedly, with the sequenceid incremented
> > >    by one each attempt after waiting for a reply to the previous
> > >    attempt, it should get NFS_OK within a few attempts.
>
> The misordered sequenceid is not known. If the cached sequenceid is N, any value beyond N+1 and N will be identified as misordered. The range of sequenceid is (1, 2^32 - 1).
> So, I don’t think attempting repeatedly is a better choice.
Well, here is the basic algorithm the client would use to maintain a slot's
sequenceid. (Note there should never be more than one RPC on the fly
for a given session/slot.)
- When an RPC reply is received from the server with NFS_OK for the
  Sequence operation, advance the sequenceid for the session/slot by 1.
As such, I do not see how a sequenceid can ever be out by more than 1
unless there is a coding bug. For the case of a coding bug, fix the bug.

For example:
- Lets assume that the client is configured (not a good situation, but
  possibly unavoidable) so that a timeout or POSIX signal results in
  the client not waiting "forever" for an RPC reply.
- Lets also assume that client becomes network partitioned from the server.

When this is the case, it is possible that a few (even many) RPCs will
be sent to the server with the same session/slot/sequenceid, one after
another, each
timing out before an RPC reply is received. But note that they all
will have the same sequenceid.
--> When this happens, it is possible that one of these RPC requests
     will make it to the server, be processed, and advance the sequenceid
     by one.
     --> All subsequent RPCs with the same session/slot/sequenceid
         processed at the server will get a NFS4ERR_SEQ_MISORDERED reply.

Once the network partition is healed, the client will see a
NFS4ERR_SEQ_MISORDERED reply (plus possibly other delayed
replies for the same session/slot/sequenceid), but the sequenceid will only
be out by 1 compared with what the server's sequenceid is for the same
session/slot.
--> This implies that the first attempt of a RPC that consists of only
     the Sequence operation with sa_sequenceid set to N + 1
     (where N is what the client is currently using for sequenceid for
      the session/slot) will succeed.
     --> Whether a subsequence attempt with sequenceid set to N + 2
           is justified is debatable, but certainly doing more that that does
           not seem reasonable to me.

If, due to some serious breakage, the above does not work. I would
say a CreateSession to replace the now badly broken session, is in
order.

Note again "this re-synchronization can only be safely done if the
first RPC for which
no reply was received is an idempotent one" and that is
going to be difficult to determine when the NFS4ERR_SEQ_MISORDERED is received
for a subsequent RPC that uses the same session/slot.
--> If the client notes when it "gives up on waiting for an RPC reply" that the
      slot is bad and it sees that the sa_cachethis argument in the
Sequence operation
      for the outstanding RPC request was set false (indicating it is
an idempotent RPC)
      then I believ it could safely do a re-synchronization of the
slot at that point in time.
      --> I chose to not do that and just mark the slot bad, but I can
see an argument
           for doing so, since the RPC is known to be idempotent.

rick

>
> > >    Then the correct sequenceid to be used for the session/slot is known.
> > Oops, "known" is probably too strong a word here. Since the cause of
> > the misordered sequenceid is not known, I think there is a very low
> > probability that an RPC using the session/slot is still in flight and
> > will change the sequenceid when processed on the server, causing
> > another misordered sequenceid error reply.
> >
>
> I agree with you on "There may be still RPC(s) in-progress that have been assigned that session/slot/sequenceid" when misorder reply getting the client,
> I will update the draft to specify that client should wait for replies for all flying requests, then query correct sequenceid.
>
>
> > rick
> >
> > >
> > > rick
> > >
> > > >
> > > > rick
> > > >
> > > > > _______________________________________________
> > > > > nfsv4 mailing list
> > > > > nfsv4@ietf.org
> > > > > https://www.ietf.org/mailman/listinfo/nfsv4
>
>
> Thank you Rick
>
> Best Regards
> Jing
>
>
>