[bess] [Shepherding AD review] review of draft-ietf-bess-evpn-fast-df-recovery-08

"Gunter van de Velde (Nokia)" <gunter.van_de_velde@nokia.com> Thu, 30 May 2024 11:27 UTC

Return-Path: <gunter.van_de_velde@nokia.com>
X-Original-To: bess@ietfa.amsl.com
Delivered-To: bess@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9B4EFC14F601; Thu, 30 May 2024 04:27:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -7.097
X-Spam-Level:
X-Spam-Status: No, score=-7.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=nokia.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id eleNPjuMPqkm; Thu, 30 May 2024 04:26:57 -0700 (PDT)
Received: from EUR01-DB5-obe.outbound.protection.outlook.com (mail-db5eur01on2051.outbound.protection.outlook.com [40.107.15.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 9EFB9C14F5F7; Thu, 30 May 2024 04:26:52 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=BH4ydkFQcLfdtPSE9df0kfecHPrKAWX1GPadQE7u0LlOKLfZnJjVeiX4uSQ4v9id/Ce4zBR/RUql4gyLe86QxpNTnZcf1Kri0J041UsvByWc+8czuWUaqgkHNpwXwYWdSbnuO5QaVYQOlX18o/Qb5ws+9KF+SYpzglDZCqeQ6Md9nzm0KHqdXzu85Auv7zXTPwzpXo8Xmzi8IzLCoWnvWazNAGBzke0jnjtZGESVPcp5pFFTzyv4SJYvIkOIF3uiGpjFUo1Fp2v/gBWbIVmFTKuw66DhVkDemgsIjQ5fMApXEfECWoxNS+WJZs4gpI5tXG/WSPkM25B5Keea6f+13g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=KdSuEL3rOgay1PD0r094j6nj3zE6IuGODmLl+hWJKXw=; b=j/f8ovZUz0Jh/FtFVqy3lolhzpk3onCA7Z26b63xQWlUdax+5X4Fgo5FKnQfhOWSQ9CDnVyloMHFSkrJ7aRPkTGglrvPz+qrKt5OwjATlPgxEk5BCzV033kSfzmjxgpaxr/qIUoDoJajgbeFT5ucDV/w45j6P+CBrKz0yA1vKzbbZ+ovOI3xhRpw2k9QkWR51ZGb5GFvxr6TCRBVZWjGR/7AAzEM/6wt9nYWJu/HFAedfJloTJALjTdFLDoEYJr3ezyreLeCzScMFHbJKzPqj4uwl1fr2C99Kt+YMphefa4HjBQYMODZRGic85PXB2n/u0lSyixX3iyP/OBIY0XDOg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nokia.com; dmarc=pass action=none header.from=nokia.com; dkim=pass header.d=nokia.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nokia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=KdSuEL3rOgay1PD0r094j6nj3zE6IuGODmLl+hWJKXw=; b=tjiog82ibynKL8zcGvusn0AmkLli9Ytkg+2JXXiH/8O97GC6WTn17LtQPJlXyoX8o/uPM+IqzsrJBr5KN8qZ+C5uB7zHbmr2Rq8U/Ac6WP5St6smskG6h52Sx6b4VOJfzGF3vYQvpVaLwopid2LyWpgS16LIlwvXhXIpBRS3cdUQpUH13MvZisOSeGvw2HPI7XGZgeudElpW2Nl467FOdyROFpoPOaU+COnc4Fc46fUCfHDbp20ezN/SPUM5oSjn09qj9x7UZ2GAQdznlEPY9GLT0gnCY9WcEEopTESeDNPU5NbHhbnMlAjiLRCWCR9zJjKaxEzKvie5G58kyAT3ZA==
Received: from AS1PR07MB8589.eurprd07.prod.outlook.com (2603:10a6:20b:470::16) by PAWPR07MB9880.eurprd07.prod.outlook.com (2603:10a6:102:38c::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7633.19; Thu, 30 May 2024 11:26:49 +0000
Received: from AS1PR07MB8589.eurprd07.prod.outlook.com ([fe80::5ca6:f902:8e31:6f3e]) by AS1PR07MB8589.eurprd07.prod.outlook.com ([fe80::5ca6:f902:8e31:6f3e%3]) with mapi id 15.20.7587.037; Thu, 30 May 2024 11:26:49 +0000
From: "Gunter van de Velde (Nokia)" <gunter.van_de_velde@nokia.com>
To: "draft-ietf-bess-evpn-fast-df-recovery@ietf.org" <draft-ietf-bess-evpn-fast-df-recovery@ietf.org>
Thread-Topic: [Shepherding AD review] review of draft-ietf-bess-evpn-fast-df-recovery-08
Thread-Index: AdqyhEE0/udg6AbaSrirv0DsStOLzQ==
Date: Thu, 30 May 2024 11:26:49 +0000
Message-ID: <AS1PR07MB8589387F07014B3F1B4B5627E0F32@AS1PR07MB8589.eurprd07.prod.outlook.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=nokia.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: AS1PR07MB8589:EE_|PAWPR07MB9880:EE_
x-ms-office365-filtering-correlation-id: 3f94d22e-d50d-4bc3-dd34-08dc809b6588
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;ARA:13230031|366007|376005|1800799015|38070700009;
x-microsoft-antispam-message-info: MakNyFmnt0/2yWrtw1Itv6bZe6dNEr5uOKcAsyIHUHGjJcfGiUt1KZAwmKsTg8bIE34Nqf52FV+w0pNm7YMXOfGrAtNq+fmaKqSAmVtiHB0HI4S7SwRzYv70P+aN20MiR1n+RM+yR4FX7x5EPs4BbTEciDmiuhqX5MJtbgRcPzzlZXTTjKg37TQ9WV8z0xIg+fjDTL8ONJjPBqdRXP+L8tuQKTqJiyLxQFAX1nIWH6B3dnVFeT2VB2kTO3yEjDm5denDC7doE5ivGd7tm+dlPe+5w5ziGW3SmIm3KF7Xz6FuxFZuHDnrcAigGJc9d5yWmLEtyvh20yO+kqxPH1Zap3PFQz7n12O+O8Etbpp+jQPBi7/5eE/MJX0UhSDGwYlFfam2aIcCb1mIyjJ2WIvzxMrwVcxZgFILbE76EPrXT15RjewpCO9N1hYo8hEYxBIdvafT2DBTiPWnyruuIQRLdOUkPn66E5EMxozI6V81XnFKg1UEes2nxY3Yi6vbN8+5GpmjBry+rfvrdWpwmoeffSMgQ/7c/4Ww+4J90YIjG4LcE4EYrV6Ko58cvalbKQzXZXIuaJmI86YEwFKuVun5KVyggnPRtZowjdoeoAXx96b6A7RwBBnk/oPCKAFRQStgTXjzQayzp3id9/PhNrV3IldtSJKgVPo5bCYXGzy4hIAFSQGi+bWkckxdx+Fa6hB3EUuVFy5rxrxBh8wFH6GaSwjnZQvQXZl4Ytz4XZBomJI652OiszI225zXXQCCSvZ0hSezb/gPYTIAMZPeoNumHaDiG1UrpE53K2NMNwgtCyOFIm9dC2PPmGkiubMGZpD+3v4BjAM6RGRYg2XJAfQq/KaMHbSaHc7FmrL1HgCLUfRWQTWvWF8/rd+71LsLF1W6su8Jc4q+QpbGc4HIFJfD3AyrhKIyPLLOQ/tmdSgH1EDVx3IfFOYV77v4/NSnx4TXFAh9LaG4QlhgGYdVqkB1hqs/f1KwMiz32GYJ/tSu4/G4SaYMJeB+wby1dL8hY0qQOQ6ugk616XBWTcJBdTrVyS2CKsoMrt3iynTGpftR1O0haUtsDhI9uXmpgjkwIumCRkRYME8Yvm8zDw6FDyOxZTqMOKxtJHDserywPS8IM6hCRvm3haitIad8IWj7kFtH0pRabCUkpdP54eiC/Xwwre580PWzIkuetZpW6AzWLzw5x4YY3IqfBjd6bVU+m/2yk3ExhzXwR5Da6fkqdup5q7wbCEUArg6vLPnacjNIF44VPz2b/DzLCOxwe1iFZKjuLxOYPAJ8e3uGYR4H2bKGZg==
x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:AS1PR07MB8589.eurprd07.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(366007)(376005)(1800799015)(38070700009);DIR:OUT;SFP:1101;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: W7cfuLRqFrF+LC04auMX0gbFLrXMz2hsx7PzFFB6rabF8tt96dcn8nL8st9IU494zsTeLwMYpxbxCAD2j4u0RE2X2513Ax0qXQdKy4rG2rdzh/OdoAvwoSZSdzIU/Soz4DekF5FZ9ATiogFKi4dg1TNQCDtwTqECpFpqkdeeb5IncgGC07BxYryblT37lsKUJ9i0UJOQv8g1XzfDCTR+TpWCoReR6jYtsEYJUYIDmeFILgvUsViyKenu+gfFHxwCXxVVVWQ80hmXqUOuS7vx6jtm2QTBn3MdaMgEpBUxNDy232sFh5OV6jZeAqp7GRt3YF9CYNlvO8ZwMkC/2AGkbvbvyUviQk0ARv1i7ikDuDidIfrC63iKD8oidOu3eXU1XfvBr5G4PgCVP+5hYp69KFYU/Vq/tGeYatZwoXZa1KiEv7rtQDjNyx5NP1kBFsE/Uw8YIID5UigReNMIZGAH8K80Kn9ZOJUo4uSQplj/HM6HPeuxD582Ig16U8gswd5AARtlH9usnv9191Qlnk4MgKf3WVQDQROwbdInk1s/mfIvx3EfrDkpy6WCFxAll0BoOvDE0s/AA1EEyYi0XhP5n/sAOTYEOjEmZmMpmhg9QTQKENm5QrkfayAtkKkUMASaLXIUhQL7zhZXd3aKiXWgmqT2hl7PHKKVBYzmz5mViA2OnfibGUsmY18PT7GfAkcv094NJJVgqUbRE6sqcKSt9aHxRXbW6oyI17qYIkTtPBZkaiOm+tjgWIyjGQWTMGvNlF2M/fswUyhHjSVPdJOCKsEjBbDlkKJSLIs7zH3HKCzeRY8oUDDeGeRtsdV0XqUwd1SG0eraAEtYVEYRxiOov4K3QPwCKOWneB6BDeGLXdI9x0b4txvcZHMe0fYO+KhDo2YvlkSXbhTSnsyl7v5L43C8mu1Jyfms9umbwOXZqlWv/Mk6UtegRsh7Hcy9AGwcx4tz93bTUC5OXtql17s4NVGZsZrbZRDdFi1RPTndj5PDaEJBiZpLG6WJMwXlnApUrE9xIqkobtbcRpFwOwyQfyOEUMobohOkTZ+RDDdadOUWchUUX/BBnbOU5/qWGQwHyI6VFfsXsifHQMXaCyCn6vlSOTriu01WCl4K7CIICKp6NqgrCUmUjibK5vQLQHJNZoDPR0gYO65uMx0YKih62cl1IqahLFJ+fQSq+qtkI6n84XUEHefPTX49OYXhJueLHbSc7DS3V3e79BDI+nKA0tDcvIVm5dXaYoikjixjDlDjp3YQQUv5Go1fEvx4PZORlQ0RXC8Y+o6cBE7P5z9F0lPf65mi9xHf9hYpH9oiCd4Zbn47HdeCYs3SaTMxhxmXdYgMk70IzOiBp0dd33liVjgkIKS4WebDz4B1GDTpJLNQhYkFYKkYTBLXx1wNxg6hT0XMKVPBDZDGi/dPtNfmEKTCCaQLQreNtQo1B9uWohuCyOiBCcwLvg6YESMoE2ZRDLy14aWPbM61o9Zjs/wjdKZdPw+Fh9PBq+rCu8qWtoVjRVLN6VUwlUjE/UTH0SeaoriOMMYet6DU0Frc4tYcz8yIafB3RC50+gLFFnmEW4vIOTpgZV8vGRKW0SctaFVTbG/SyR8OvK0/gO4Rw2ew4g==
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: nokia.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: AS1PR07MB8589.eurprd07.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 3f94d22e-d50d-4bc3-dd34-08dc809b6588
X-MS-Exchange-CrossTenant-originalarrivaltime: 30 May 2024 11:26:49.1506 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 5d471751-9675-428d-917b-70f44f9630b0
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: gtMscbfPZY1NvzHCppi1Jouga85GJRkuZoTH2PWEc0BW3DelgwxlMB3uLV+2MEZfT8YDZiAzbdHqafZVtIDJ1eWg83HNkh/o7mHzBbSCeFw=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PAWPR07MB9880
Message-ID-Hash: DDTAWQEXQQ4G4M4XZSA4WSKV2KIUEEVC
X-Message-ID-Hash: DDTAWQEXQQ4G4M4XZSA4WSKV2KIUEEVC
X-MailFrom: gunter.van_de_velde@nokia.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-bess.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: 'BESS' <bess@ietf.org>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [bess] [Shepherding AD review] review of draft-ietf-bess-evpn-fast-df-recovery-08
List-Id: BGP-Enabled ServiceS working group discussion list <bess.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/bess/X9MQFa7DutT8xbk92VUYAcUptY4>
List-Archive: <https://mailarchive.ietf.org/arch/browse/bess>
List-Help: <mailto:bess-request@ietf.org?subject=help>
List-Owner: <mailto:bess-owner@ietf.org>
List-Post: <mailto:bess@ietf.org>
List-Subscribe: <mailto:bess-join@ietf.org>
List-Unsubscribe: <mailto:bess-leave@ietf.org>

# Gunter Van de Velde, RTG AD, comments for draft-ietf-bess-evpn-fast-df-recovery-08

Hi All,

Please find here a shepherding AD review of draft-ietf-bess-evpn-fast-df-recovery-08

I'm sorry it took a bit of time to get started on this draft.

I've begun reviewing this document before we kick off the IETF Last Call process. Once we address these points, we can move forward with the document through the IESG chain.

A big thank you to Adrian Farrel for his RTG-DIR review on the -07 version, which helped improve the document to its -08 version and to Matthew Bocci for the Shepherds write-up (4 July 2022)

In my review, I've noted some final observations while going through the document. For better readability, I've suggested some paragraph edits.

One thing I noticed is that there's not much RFC 2119-based normative language used. Maybe the authors can take another look and add or update the RFC 2119 text where needed.

You can find my review notes below.

#GENERIC COMMENTS
#================

88	   Virtualization Overlay (NVO) and DC inte)rconnect (DCI) services, and

Typo with the ")"

100	   multihomed Ethernet Segment.  This DF election is achieved
101	   independent of the number of EVPN Instances (EVIs) associated with
102	   that Ethernet Segment and it is performed via simple signaling
103	   between the recovered node and each of the other nodes in the
104	   multihomed group.

I believe that the word 'simple' is reasonable subjective. It may be better to replace with a construct using 'straightforward'. Possible rewrite:

"This Designated Forwarder (DF) election is conducted independently of the number of EVPN Instances (EVIs) associated with the Ethernet Segment and is executed through straightforward signaling between the recovered node and each of the other nodes in the multihomed group.
"

105	   This document updates the state machine described in Section 2.1 of

Being more explicit in what is updated could be better.

"This document updates the DF Election Finite State Machine (FSM) described in Section 2.1 of"

131	   In EVPN technology, multiple Provider Edge (PE) devices have the
132	   ability to encap and decap data belonging to the same VLAN.  In

expand on encap and decap for better readability.

131	   In EVPN technology, multiple Provider Edge (PE) devices have the
132	   ability to encap and decap data belonging to the same VLAN.  In
133	   certain situations, this may cause L2 duplicates and even loops if
134	   there is a momentary overlap of forwarding roles between two or more
135	   PE devices, leading to broadcast storms.

possible readability rewrite:
"In EVPN technology, multiple Provider Edge (PE) devices possess the capability to encapsulate and decapsulate data associated with the same VLAN. Under certain conditions, this may result in Layer 2 duplicates and potential loops if there is a temporary overlap in forwarding roles among two or more PE devices, consequently leading to broadcast storms.
"
137	   EVPN [RFC7432] currently uses timer based synchronization among PE
138	   devices in a redundancy group that can result in duplications (and
139	   even loops) because of multiple DFs if the timer is too short or
140	   packets being dropped if the timer is too long.

RFC7432 is providing more a specification the using a timer. Hence a more explicit text blob to document this property: 

"EVPN [RFC7432] currently specifies timer-based synchronization among PE devices within a redundancy group. This approach can lead to duplications and potential loops due to multiple Designated Forwarders (DFs) if the timer interval is too short, or to packet drops if the timer interval is too long."

142	   Using split-horizon filtering (Section 8.3 of [RFC7432]) can prevent
143	   loops (but not duplicates).  However, if there are overlapping DFs in
144	   two different sites at the same time for the same VLAN, the site
145	   identifier will be different upon the packet re-entering the Ethernet
146	   Segment and hence the split-horizon check will fail, leading to L2
147	   loops.

Strange grammatical construct and usage of "()". Potential rewrite to correct this assuming i kept the issue described correct:

"Employing split-horizon filtering, as described in Section 8.3 of [RFC7432], can prevent loops but does not address duplicates. However, if there are overlapping Designated Forwarders (DFs) at two different sites simultaneously for the same VLAN, the site identifier will differ when the packet re-enters the Ethernet Segment. Consequently, the split-horizon check will fail, resulting in Layer 2 loops.
"

149	   The updated DF procedures in [RFC8584] use the well known Highest
150	   Random Weight (HRW) algorithm to avoid reshuffling of VLANs among PE
151	   devices in the redundancy group upon failure/recovery.  This reduces
152	   the impact to VLANs not assigned to the failed/recovered ports and
153	   eliminates loops or duplicates at failure/recovery events.

Is there a reference that can be used for the well known HRW algorithm? 
What about the following rewrite proposal for readability:

"The updated Designated Forwarder (DF) procedures outlined in [RFC8584] utilize the well-known Highest Random Weight (HRW) algorithm to prevent the reshuffling of VLANs among PE devices within the redundancy group during failure or recovery events. This approach minimizes the impact on VLANs not assigned to the failed or recovered ports and eliminates the occurrence of loops or duplicates during such events.
"

179	   a given VLAN is possible.  Duplication of DF roles may eventually
180	   lead to duplication of traffic as well as L2 loops.

in previous text the word 'overlap' was used while here the word Duplication of DF roles is used.

195	   *  Complicated handshake signamling mechanisms and state machines are
196	      avoided in favor of a simple uni-directional signaling approach.

s/Complicated/Complex/
s/signamling/signaling/

198	   *  The solution is backwards-compatible (see Section 4), by PEs
199	      simply discarding the unrecognized new BGP Extended Community.

I think that the "The solution" seems reasonable opaque description. Maybe we should explicit mention that this concerns the fast dr recovery solution. I only noted this here as the first occurrence, but the more explicit text can be used in multiple locations within the draft text.

What about:
"The fast df recovery solution maintains backwards compatibility (see Section 4) by ensuring that PEs discard any unrecognized new BGP Extended Community."

201	   *  Existing DF Election algorithms are supported.

s/are/remain/

232	   Upon receipt of that new BGP Extended Community, partner PEs can
233	   determine the service carving time of the newly insterted PE.  The
234	   notion of skew is introduced to eliminate any potential duplicate
235	   traffic or loops.  The receiving partner PEs add a skew (default =
236	   -10ms) to the Service Carving Time to enforce this.  The previously
237	   inserted PE(s) must carve first, followed shortly (skew) by the newly
238	   insterted PE.

I got thrown off-guard with the word skew as a non-native English speaker.
Maybe a small explanation would be helpful. What about the following:

"Upon receipt of the new BGP Extended Community, partner PEs can determine the service carving time of the newly inserted PE. To eliminate any potential for duplicate traffic or loops, the concept of skew-a small time delay added to the service carving process to ensure a controlled and orderly transition when multiple Provider Edge (PE) devices are involved-is introduced. The receiving partner PEs add a skew (default = -10ms) to the service carving time to enforce this mechanism. This ensures that the previously inserted PEs complete their carving process first, followed shortly thereafter (by the specified skew) by the newly inserted PE.
"

240	   To summarize, all peering PEs carve almost simultaneously at the time
241	   announced by the newly added/recovered PE.  The newly inserted PE
242	   initiates the SCT, and carves immediately on its local timer expiry.
243	   The previously inserted PE(s) receiving Ethernet Segment route (RT-4)
244	   with a SCT BGP extended community, carve shortly before Service
245	   Carving Time.

This text provides me some confusion. The term "to carve" generally means to cut or shape something from a larger piece, often with precision and care. Hence i was a bit surprised to see this used here. 

May I assume that in the context of these network operations and specifically within EVPN (Ethernet VPN) and MPLS (Multiprotocol Label Switching) environments, "to carve" typically refers to the process of determining and establishing roles or responsibilities for forwarding traffic among Provider Edge (PE) devices? 

If yes, maybe such text blob should be explicit mentioned somewhere in the draft?

266	   [RFC5905].  As the current NTP era value is not exchanged, a local
267	   clock which is "synchronized" but to the wrong era is outside of the
268	   scope of this document.

What is era value?

257	                        1                   2                   3
258	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
259	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
260	   | Type = 0x06   | Sub-Type(0x0F)|      Timestamp Seconds        ~
261	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
262	   ~  Timestamp Seconds            | Timestamp Fractional Seconds  |
263	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

a figure number/caption is missing.

269	   The 64-bit timestamp of NTP consists of a 32-bit part for seconds and
270	   a 32-bit part for fractional second:

There seems to be a 32bit/64bit and 128bt timestamp according:
https://datatracker.ietf.org/doc/html/rfc5905#section-6
Should description not align with all of these?

274	   *  Timestamp Fractional Seconds: the high order 16 bits of the NTP
275	      fractional seconds are encoded in this field.  The use of a 16-bit
276	      fractional seconds yields adequate precision of 15 microseconds
277	      (2^-16 s).

I assume that the lower order 16 bits are assumed to be '0'? Maybe that should be explicit called out?

296	   This capability is used in conjunction with the agreed upon DF Type
297	   (DF Election Type).  For example if all the PEs in the Ethernet
298	   Segment indicate having Time Synchronization capability and are
299	   requesting the DF type to be HRW, then the HRW algorithm is used in
300	   conjunction with this capability.

readability rewrite:
"This capability is utilized in conjunction with the agreed-upon Designated Forwarder (DF) Type (DF Election Type). For instance, if all the PE devices in the Ethernet Segment indicate possessing Time Synchronization capability and request the DF Type to be Highest Random Weight (HRW), then the HRW algorithm is employed in conjunction with this capability.
"

Note, what happens if one of the involved PEs do not support Time synchronisation capability?

309	   The peering PE's FSM in DF_DONE which receives a RECV_ES transitions
310	   to DF_CALC.  Because of the SCT carried in the Ethernet-Segment
311	   update, the output of the DF_CALC and transition back into DF_DONE
312	   are delayed, as are accompanying forwarding updates to DF/NDF state.

This processes not so easy. I assume that all these are states of the FSM?
Would the following be a correct rewrite for readability?

"Upon receiving a RECV_ES message, the peering PE's Finite State Machine (FSM) transitions from the DF_DONE (indicating the DF election process was complete) state to the DF_CALC (indicating that a new DF calculation is needed) state . Due to the Service Carving Time (SCT) included in the Ethernet-Segment update, the completion of the DF_CALC state and the subsequent transition back to the DF_DONE state are delayed. This delay ensures proper synchronization and prevents conflicts. Consequently, the accompanying forwarding updates to the Designated Forwarder (DF) and Non-Designated Forwarder (NDF) states are also deferred.
"

314	   The corresponding actions when transitions are performed or states
315	   are entered/exited is modified as follows:
316
317	   9.  DF_CALC on CALCULATED: Mark the election result for the VLAN or
318	       Bundle.
319
320	       9.1  Where SCT timestamp is present on the RECV_ES event of
321	            Action 11, wait until the time indicated by the SCT before
322	            continuing to 9.2.
323
324	       9.2  Assume a DF/NDF for the local PE for the VLAN or VLAN
325	            Bundle, and transition to DF_DONE.

What about the following procedure text blob description for clarity:

"
The corresponding actions when transitions are performed or states are entered/exited are modified as follows:

9. DF_CALC on CALCULATED: Mark the election result for the VLAN or VLAN Bundle.

9.1. If an SCT timestamp is present during the RECV_ES event of Action 11, wait until the time indicated by the SCT before proceeding to step 9.2.

9.2. Assume the role of DF or NDF for the local PE concerning the VLAN or VLAN Bundle, and transition to the DF_DONE state.

This revised approach ensures proper timing and synchronization in the DF election process, avoiding conflicts and ensuring accurate forwarding updates.
"

329	   Let's take Figure 1 as an example where initially PE2 had failed and
330	   PE1 had taken over.  This example shows the problem with the
331	   DF-Election mechanism in Section 8.5 of [RFC7432], using the value of
332	   the timer configured for all PEs on the Ethernet Segment.

To make the text more proposed standard style, what about this textblob for readability:

"Consider Figure 1 as an example, where initially PE2 has failed and PE1 has taken over. This scenario illustrates the problem with the DF-Election mechanism described in Section 8.5 of [RFC7432], specifically in the context of the timer value configured for all PEs on the Ethernet Segment.
"

334	   Based on Section 8.5 of [RFC7432] and using the default 3 second
335	   timer in step 2:
337	   1.  Initial state: PE1 is in steady-state, PE2 is recovering
339	   2.  PE2 recovers at (absolute) time t=99
341	   3.  PE2 advertises RT-4 (sent at t=100) to partner PE1
343	   4.  PE2 starts a 3 second timer to allow the reception of RT-4 from
344	       other PE nodes
346	   5.  PE1 carves immediately on RT-4 reception, i.e. t=100 + minimal
347	       BGP propagation delay
349	   6.  PE2 carves at time t=103
350
351	   [RFC7432] aims of favouring traffic being dropped over duplicate
352	   traffic.  With the above procedure, traffic drops will occur as part
353	   of each PE recovery sequence since PE1 has transitioned some VLANs to
354	   Non-Designated-Forwarder (NDF) immediately upon reception.
355	   The timer value (default = 3 seconds) has a direct effect on the
356	   duration of the packets being dropped.  A shorter (especially zero)
357	   timer may, however, result in duplicate traffic or traffic loops.

What about:

"Procedure Based on Section 8.5 of [RFC7432] with Default 3-Second Timer:
1. Initial State: PE1 is in a steady state, and PE2 is recovering.
2. Recovery: PE2 recovers at an absolute time of t=99.
3. Advertisement: PE2 advertises RT-4, sent at t=100, to partner PE1.
4. Timer Start: PE2 starts a 3-second timer to allow the reception of RT-4 from other PE nodes.
5. Immediate Carving: PE1 carves immediately upon RT-4 reception, i.e., t=100 plus minimal BGP propagation delay.
6. Delayed Carving: PE2 carves at time t=103.

[RFC7432] favors traffic drops over duplicate traffic. With the above procedure, traffic drops will occur as part of each PE recovery sequence since PE1 transitions some VLANs to Non-Designated Forwarder (NDF) immediately upon RT-4 reception. The timer value (default = 3 seconds) directly affects the duration of the packet drops. A shorter (or zero) timer may result in duplicate traffic or traffic loops.
"

359	   Based on the Service Carving Time (SCT) approach:
361	   1.  Initial state: PE1 is in steady-state, PE2 is recovering
363	   2.  PE2 recovers at (absolute) time t=99
365	   3.  PE2 advertises RT-4 (sent at t=100) with target SCT value t=103
366	       to partner PE1
368	   4.  PE2 starts a 3 second timer to allow the reception of RT-4 from
369	       other PE nodes
371	   5.  PE1 starts service carving timer, with remaining time until t=103
373	   6.  Both PE1 and PE2 carve at (absolute) time t=103
374	   In fact, PE1 should carve slightly before PE2 (skew) to maintain the
375	   preference of minimal loss over duplicate traffic.  The previously
376	   inserted PE2 that is recovering performs both transitions DF to NDF
377	   and NDF to DF per VLANs at the timer's expiry.  Since the goal is to
378	   prevent duplicates, the original PE1, which received the SCT will
379	   apply:
381	   *  DF to NDF transition at t=SCT minus skew, where both PEs are NDF
382	      for 'skew' amount of time
384	   *  NDF to DF transition at t=SCT
385
386	   It is this split-behaviour which ensures a good transition of DF role
387	   with contained amount of loss.
388
389	   Using SCT approach, the negative effect of the timer to allow the
390	   reception of RT-4 from other PE nodes is mitigated.  Furthermore, the
391	   BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to
392	   PE1) becomes a non-issue.  The use of SCT approach remedies the
393	   problem associated with this timer: the 3 second timer window is
394	   shortened to the order of milliseconds.

What about the following textblobs for readability:

"Procedure Based on the Service Carving Time (SCT) Approach:
1. Initial State: PE1 is in a steady state, and PE2 is recovering.
2. Recovery: PE2 recovers at an absolute time of t=99.
3. Advertisement: PE2 advertises RT-4, sent at t=100, with a target SCT value of t=103 to partner PE1.
4. Timer Start: PE2 starts a 3-second timer to allow the reception of RT-4 from other PE nodes.
5. Service Carving Timer: PE1 starts the service carving timer, with the remaining time until t=103.
6. Simultaneous Carving: Both PE1 and PE2 carve at an absolute time of t=103.

To maintain the preference for minimal loss over duplicate traffic, PE1 should carve slightly before PE2 (with skew). The recovering PE2 performs both DF to NDF and NDF to DF transitions per VLAN at the timer's expiry. The original PE1, which received the SCT, applies the following:

* DF to NDF Transition: At t=SCT minus skew, where both PEs are NDF for the skew duration.
* NDF to DF Transition: At t=SCT.

This split-behavior ensures a smooth DF role transition with minimal loss.

Using the SCT approach, the negative effect of the timer to allow the reception of RT-4 from other PE nodes is mitigated. Furthermore, the BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to PE1) becomes a non-issue. The SCT approach shortens the 3-second timer window to the order of milliseconds, addressing the associated problems.
"

396	3.1.  Concurrent Recoveries

This section seems to be missing RFC2119 language on how nodes need to behave with respect the procedures outlined in this document.

402	   Election.  A similar situation arises in staggered recovering PEs,
403	   when a second PE recovers at rougly a first PE's advertised SCT
404	   expiry, and with its own new SCT-2 outside of the initial SCT window.

The word staggered is oddly used. What about the following:

"A similar situation arises in sequentially recovering PEs, when a second PE recovers approximately at the time of the first PE's advertised SCT expiry, and with its own new SCT-2 outside of the initial SCT window."

406	   In the case of multiple outstanding DF elections, one requested by
407	   each of the recovering PEs, the SCTs must simply be time-ordered and
408	   all PEs execute only a single DF Election at the service carving time
409	   corresponding to the largest received timestamp value.  The DF
410	   Election will involve all the active PEs in a single DF Election
411	   update.

To add to a similar edited writing style:
"In the case of multiple concurrent DF elections, each initiated by one of the recovering PEs, the SCTs must be ordered chronologically. All PEs shall execute only a single DF Election at the service carving time corresponding to the latest received timestamp value. This DF Election will involve all active PEs in a unified DF Election update.
"

However, it may require some formal RFC2119 language to make sure that implementations behave according this procedure

413	   Example:
415	   1.  Initial state: PE1 is in steady-state, all services elected at
416	       PE1.
418	   2.  PE2 recovers at time t=100, advertises RT-4 with target SCT value
419	       t=103 to partners (PE1)
421	   3.  PE2 starts a 3 second timer to allow the reception of RT-4 from
422	       other PE nodes
424	   4.  PE1 starts service carving timer, with remaining time until t=103
426	   5.  PE3 recovers at time t=102, advertises RT-4 with target SCT value
427	       t=105 to partners (PE1, PE2)
429	   6.  PE3 starts a 3 second timer to allow the reception of RT-4 from
430	       other PE nodes
432	   7.  PE2 cancels the running timer, starts service carving timer with
433	       remaining time until t=105
435	   8.  PE1 updates service carving timer, with remaining time until
436	       t=105
438	   9.  PE1, PE2 and PE3 carve at (absolute) time t=105

Example:
1. Initial State: PE1 is in a steady state, with all services elected at PE1.
2. Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4 with a target SCT value of t=103 to its partners (PE1).
3. Timer Initiation by PE2: PE2 starts a 3-second timer to allow the reception of RT-4 from other PE nodes.
4. Timer Initiation by PE1: PE1 starts the service carving timer, with the remaining time until t=103.
5. Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4 with a target SCT value of t=105 to its partners (PE1, PE2).
6. Timer Initiation by PE3: PE3 starts a 3-second timer to allow the reception of RT-4 from other PE nodes.
7. Timer Update by PE2: PE2 cancels the running timer and starts the service carving timer with the remaining time until t=105.
8. Timer Update by PE1: PE1 updates its service carving timer, with the remaining time until t=105.
9. Service Carving: PE1, PE2, and PE3 perform service carving at the absolute time of t=105.

446	4.  Backwards Compatibility
447
448	   Per redundancy group, for the DF election procedures to be globally
449	   convergent and unanimous, it is necessary that all the participating
450	   PEs agree on the DF Election algorithm to be used.  It is, however,
451	   possible that some PEs continue to use the existing modulo-based DF
452	   election and do not rely on the new SCT BGP extended community.  PEs
453	   running a baseline DF election mechanism will simply discard the new
454	   SCT BGP extended community as unrecognized.
455
456	   A PE can indicate its willingness to support clock-synched carving by
457	   signaling the new 'T' DF Election Capability as well as including the
458	   new Service Carving Time BGP extended community along with the
459	   Ethernet Segment Route (Type-4).  In the case where one or more PEs
460	   attached to the Ethernet Segment do not signal T=1, all PEs in the
461	   Ethernet Segment SHALL revert back to the [RFC7432] timer approach.
462	   This is especially important in the context of the VLAN shuffling
463	   with more than 2 PEs.

I am not sure what the modulo-based df is? is that the rfc7432 procedure? It was the first time that this was mentioned in this draft i believe.

what about following rewrite proposal for readability, but please add reference for the modulo-based election:

"For the DF election procedures to achieve global convergence and unanimity within a redundancy group, it is essential that all participating PEs agree on the DF election algorithm to be employed. However, it is possible that some PEs may continue to use the existing modulo-based DF election algorithm and not utilize the new Service Carving Time (SCT) BGP extended community. PEs that operate using the baseline DF election mechanism will simply discard the new SCT BGP extended community as unrecognized.

A PE can indicate its willingness to support clock-synchronized carving by signaling the new 'T' DF Election Capability and including the new SCT BGP extended community along with the Ethernet Segment Route (Type-4). If one or more PEs attached to the Ethernet Segment do not signal T=1, then all PEs in the Ethernet Segment SHALL revert to the timer-based approach as specified in [RFC7432]. This reversion is particularly crucial in preventing VLAN shuffling when more than two PEs are involved"

465	5.  Security Considerations

The conditions for when the SCT is far away in the future, it was not entirely clear or spelled out what an implementation should do. Maybe make it more explicite in the textual decscription as a normative reference using RFC2119 language