[nfsv4] The nfs-ganesha server keeps responding with NFS4ERR_SEQ_MISORDERED, and the client's business is stuck and cannot be recovered.

飞虎 郑 <zhengfeihu1@outlook.com> Tue, 23 May 2023 23:03 UTC

Return-Path: <zhengfeihu1@outlook.com>
X-Original-To: nfsv4@ietfa.amsl.com
Delivered-To: nfsv4@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6C18FC15109F for <nfsv4@ietfa.amsl.com>; Tue, 23 May 2023 16:03:13 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.847
X-Spam-Level:
X-Spam-Status: No, score=-1.847 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=outlook.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id wyPI2UvHhz8B for <nfsv4@ietfa.amsl.com>; Tue, 23 May 2023 16:03:11 -0700 (PDT)
Received: from APC01-TYZ-obe.outbound.protection.outlook.com (mail-tyzapc01olkn20801.outbound.protection.outlook.com [IPv6:2a01:111:f403:704b::801]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CF897C14F747 for <nfsv4@ietf.org>; Tue, 23 May 2023 16:03:10 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Q9p6qaKGlQY/m7n2xe/T0OS2fBUa+y+CjGMhHAPvvFC0chUmIpOaPgGjB89xmqxp+w1k7T6xyr7QtJ5g+Z/A244yH155pJx+vowP2BBq9N1yfWTzDKxfcLB457g9bTy/o2Ge8TuhRr8JnL9HCigt8JBf5BK+Agr+/JVQm3sRUDyFZzBDsVmwGDoaRgUa2JkGLpdlJ+rJ6HvRn//oNZkjvoclOesO+U/giZjvHpeGKKqT7V45IoGipchKL8lixs2uUTdfmgaZ0ZsyStBRvPXtMZLQyHsheOgcNvjMoMxiXQWmg472iNOMRKM7n/BL9l4ZhG6m+ncRM/5rKb5dE0t3bg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rICEWI1ufyIy/gx6h5dwSHmliOMSanYjFT1rMOGffGg=; b=WA9oty8jSvEdTTKsGDhiUS4p3QCHVYgBvKpW9GiXjjrD9jIkA0M6395FIa3JnGbPi2wexnCJoonmFCe8xX0v8zUYh0CdMMJ9N5GhR4UP4f9uDNrdy7rLRWzCAkr2UVH4+9yV129SrDVjg9oS7ug9dqsYUmizpJU5xdm0RTusApDJTRRKMJ32J4hL9hgN5EQbhwGX7pPSai9sJ4st0tGp7tDxhoUEeoBVwl3H370f0swaml1wVagaIQ0Oe9sSwwwoagz5JG5twX/1ZerZAUXc6UchtT9b/PduEwWXw5aZmdBgv9+EYmHuANhAa7QGjCq3eHodkNXYVyLVlmuB9038LA==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=rICEWI1ufyIy/gx6h5dwSHmliOMSanYjFT1rMOGffGg=; b=Cq3D3YcEZU8jhK//qm2X4BWgvKx7eQSCsBjxYrLv1UEw4nvDmkaXptN9vOmIZhtqROLqHkgfKm/yFPiOYko9dx1daSn+Oyv5iGZBTNZJTiDVv4hylz5Zh0YS+I74YLRVnw8eQ6lxdfAErqbdSZbrQUwet2vAzPrt5QyWXWjJfzsP7YsVJ9Bx1eS/fZD6Go7RDVrC5KQKyrMh3bSc1IM9fCV7EvakHl2csWdCnITQuMaOQZHYGB2usaHEkAI8mf95o8OBvbwvqarJw4cjE2+kDRr/R/EYvSHsHKj75y1QVOo1lIaITvCmZySW5FU0U5dbMzs65pP07E7QunwHqh3qeg==
Received: from SEZPR02MB5758.apcprd02.prod.outlook.com (2603:1096:101:45::6) by SG2PR02MB5699.apcprd02.prod.outlook.com (2603:1096:4:1ce::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6433.15; Tue, 23 May 2023 23:03:06 +0000
Received: from SEZPR02MB5758.apcprd02.prod.outlook.com ([fe80::28b5:1712:d108:3d91]) by SEZPR02MB5758.apcprd02.prod.outlook.com ([fe80::28b5:1712:d108:3d91%5]) with mapi id 15.20.6433.015; Tue, 23 May 2023 23:03:06 +0000
From: 飞虎 郑 <zhengfeihu1@outlook.com>
To: "nfsv4@ietf.org" <nfsv4@ietf.org>
Thread-Topic: The nfs-ganesha server keeps responding with NFS4ERR_SEQ_MISORDERED, and the client's business is stuck and cannot be recovered.
Thread-Index: AQHZjcp6WJ5lM80GnUG1/mxDeq0P4g==
Date: Tue, 23 May 2023 23:03:06 +0000
Message-ID: <SEZPR02MB57587890A46C9E0F8821EFE3E4409@SEZPR02MB5758.apcprd02.prod.outlook.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-tmn: [9ax0vuxCes82r41KgkEhvUIbEizLjblW+2EvmUKiLXc=]
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: SEZPR02MB5758:EE_|SG2PR02MB5699:EE_
x-ms-office365-filtering-correlation-id: 6c64ece3-159d-46a2-f0ef-08db5be1dee8
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: uQ7DeNvOllXCyIMk2MD0JWY/AEhPdlOFf9sBl0Vs8bxT10qcvZaWsynLVgOON+qDerLXUPnG5H/J6oAAaqyzc3OXhU2Z5bN1flGDIqBl/PyhMqNIrW0qDWEXppp69SwPY7Tb2/NH4nvXPErS2ARGtK27qksMli47olkr2TUiquJZhsYBvmQGW6L+o71c6CJxPjADR9scDaq9Jk0WAmYRp2qH5xqQbIsOJYCrNzRDIMRblQLw/LL7iOW3OryDhbm53LLE6xRftUHidfx3UQMvCU3qGCP0KOsx1/l6JgjOnXW7DKVI1h2jO1gIeRleNj21OwlTPdmirvZqFS6oKL9YnpKU+5/QiudQIKuJqO0oKeyehFE//GZ2GjKktME+h0ehKzrI7pmQM9mH4fg8Q8nx4Yh51wZK6GsO/z6uCDhcNooJYuPZEj6dsQ8bpOTCNHTnypdO/C/MYNrcxDbxunebRs4x9GKdJc6Su5xWmgd/yxI2Yq6ODd/eRRXa0nmtl4kVvnkcQez1wHevI8OfMwZMNHIMOWhSQnhmu5MMWUUOGMhN75QHY/lAxSv3GCQunhjc6efmtkmCrAF1XtmvGsjTg9mvSAlK14eicPDh/2QzzvR9LQilZSB6VFW++R17wUXdLDihw1yGXAI2GgLqQI2CLg==
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: 5qID+O3zV19U5K6u9SvvJdKydpNbLTI6M6qKf91gOlwzhPOX97gg6JfmyOvx+f79hrRKCOP/GdIRF5ALoYVQM0V7AORIyVdZAPyFxDk/c4sbBLYVuhLer6DfoBMM6nAEakx/Og29YoKNYHqYAma2VLn8f8Je8RR914obzePahdLIdOtYdt6YfHhyH8awlJ9X743TxngdwdaiDzQ1Ds+OCZMN8ybbQS9id6n6ROpMRKEKFO4UfxezEJ57ck7Xb2ysH9q2t4HuJWYGUkxACpvAkX0l9WyE2QLOgD4lmTIVkbvNdWOCIe6WrnaD6DXs+7KwYT9iHRR813iRSO3KCTeCo8E4fzQmjUgHYHTX8GXfRdTIUrnKBYIL5Y00FaKkvQUJqdOKGVbZDA1WX4wSKYU4fYkb+21bmB2dTpzJ6wnSf5KAxI6w0XzzCpGiWZdUkfRB9sgJRXMoMcK7KzoInhMOjnftyv6bnYD70cOHvTq2vvOR8lKGUKfKnibYvw4YdRE8emFoYBeRXiPHE5fkkhpkNSunfbkhlnPiZpmhi69D/JBw3jJ+63TaAA7qGRdT3M5HRXtyQf4ZHj+Y6TO3nLQOw+Zxdvd1UlIVinjCHgZTbmqLnXGgIjBeJcvBBh5IakPNsHUXzLP8PWKdDcoQ210QO2VH2ceQNp1gtWaWLOmKo3s+NHTxpuIiKC9d3yMoT+4hq6vyYsxLPeXqqVnALroGLwxhNyGcecEhExww9XIG/nD1YmXkqlJg3vNygZ1iTlauOQRF1oGYxIEt8yENVKaLUn0Plj6R2LgtEqmo1Xx8cGQWCAi5PNMCbs/x6wNo6Hvw68rE+Q2dbvt3L3+krYptBN3fmcDAs9PHCWMdhBRNOqu/1AYd1yugUXJv4vCBjyRnuwVfGu5wJs1Xr+ccmZSnuwPOPldH67j/+nR/n7j9YSnZa9cEPIEpT6JLg/OPSO00AtiQqiSed+xObt+7XffJ2cql4YXGh5i9cxe1wzbJl375tRoVug9YbBeCX/9iI7IPc5xbvKzOLENirWEHdN4U+UpYIujsFdXHY+5NLy8oZQDTzCvLVN4jKv/aRftmlu5HDG/dg9bd5xw2p7axQIBFlWmpI96kiSU7O4P4cqZQSgjpcDCrVsolVadEWbu7JMWcnWAA0UTx3DP88+CbizErr0OBwUC0rKWGyH6slVZO6U9aI6RYQKFRJCDPxOdHr2w4KJ8e8PY8M9/j7QiiHK+yoD/LcvKq4BHTmU9XuCUFA/Qfkfh9tDKuRdD1Q7CPVqjp
Content-Type: multipart/alternative; boundary="_000_SEZPR02MB57587890A46C9E0F8821EFE3E4409SEZPR02MB5758apcp_"
MIME-Version: 1.0
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: SEZPR02MB5758.apcprd02.prod.outlook.com
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-CrossTenant-Network-Message-Id: 6c64ece3-159d-46a2-f0ef-08db5be1dee8
X-MS-Exchange-CrossTenant-originalarrivaltime: 23 May 2023 23:03:06.8047 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SG2PR02MB5699
Archived-At: <https://mailarchive.ietf.org/arch/msg/nfsv4/v2BNn52EAsTmmzli0awPZFnmjss>
Subject: [nfsv4] The nfs-ganesha server keeps responding with NFS4ERR_SEQ_MISORDERED, and the client's business is stuck and cannot be recovered.
X-BeenThere: nfsv4@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: NFSv4 Working Group <nfsv4.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nfsv4/>
List-Post: <mailto:nfsv4@ietf.org>
List-Help: <mailto:nfsv4-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/nfsv4>, <mailto:nfsv4-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 23 May 2023 23:03:13 -0000

Issue:
The nfs-ganesha server keeps responding with NFS4ERR_SEQ_MISORDERED, and the client's business is stuck and cannot be recovered.


Background:
We use nfs-ganesha as the server and Linux kernel as the client.
We conducted the following test:
The NFS server is a three-node environment, and the client accesses the NFS server through a virtual IP. After the NFS client was mounted, files were read and written to the NFS server via vdbench.
On the NFS server node, we simulated memory sub-health through a script (memory continues to increase, and the server node performs a hard reboot after it exceeds the set critical value), and the virtual IP would migrate to another node after the node reboot. After the node reboot, the client's business was always stuck at 0, and accessing the mount point also caused it to hang. By capturing packets via Wireshark, we found that the client kept sending requests in the SEQUENCE, PUTFH, WRITE, GETATTR combination (or other types of requests) to the server, while the NFS server kept responding with NFS4ERR_SEQ_MISORDERED. We have conducted this kind of memory sub-health test many times, but this problem only occurred in this particular instance.
Client kernel version: 4.18.0-193.14.2.el8_2.x86_64 (mockbuild@10-75-9-128).


Analysis:
In order to verify whether the NFS server's reply of NFS4ERR_SEQ_MISORDERED would cause the client's business to drop to 0, I modified the code of ganesha: when the value of the sequence_id corresponding to the session_id and slot_id is set, I added 10 to the sequence_id cached on the server to simulate a request sequence mutation, which would only occur once (subsequent sequence_id values for other slot_ids would not mutate). After the request sequence mutation, the client's business kept dropping to 0 and could not recover. By dynamically enabling ganesha's logs and using wireshark to capture packets, it can be seen that ganesha replied NFS4ERR_SEQ_MISORDERED to the client. Afterwards, the client used a new slot_id to send a request with the main request of GETATTR(SEQUENCE, PUTFH, GETATTR), and the server processed it normally and replied to the client. In addition, the client kept sending combination requests with the main request of LOOKUP(SEQUENCE, PUTFH, LOOKUP, GETFH, GETATTR) using the old slot_id and corresponding request sequence number, and ganesha directly replied NFS4ERR_SEQ_MISORDERED after detecting the loss of the SEQUENCE request sequence number. From the results, it seems that the client kept using the request sequence number corresponding to the old slot_id to send some op requests to the server, and the server kept replying NFS4ERR_SEQ_MISORDERED, causing the client's business to drop to 0. I noticed that if there is a request sequence disorder, there is no interface to query the server's request sequence number. If there is a request sequence mutation, the client cannot obtain the request sequence number. It seems meaningless for the client to keep querying. In addition, I modified the ganesha code so that if the request sequence number is out of order for a period of time, ganesha replies NFS4ERR_BADSESSION or if the request sequence number returns to normal, the client's business can be restored and completed.

If NFS-Ganesha responds with NFS4ERR_BADSESSION after the request sequence becomes disordered, does it modify the protocol and I wonder if this is appropriate? I have already provided feedback on this issue to the nfs-ganesha community, please visit: https://github.com/nfs-ganesha/nfs-ganesha/issues/941.

获取Outlook for Android<https://aka.ms/AAb9ysg>