[bess] Iotdir telechat review of draft-ietf-bess-evpn-fast-df-recovery-09

Toerless Eckert via Datatracker <noreply@ietf.org> Wed, 14 August 2024 16:26 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: bess@ietf.org
Delivered-To: bess@ietfa.amsl.com
Received: from [10.244.2.52] (unknown [104.131.183.230]) by ietfa.amsl.com (Postfix) with ESMTP id 8F9D6C14F5E3; Wed, 14 Aug 2024 09:26:37 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
From: Toerless Eckert via Datatracker <noreply@ietf.org>
To: iot-directorate@ietf.org
X-Test-IDTracker: no
X-IETF-IDTracker: 12.22.0
Auto-Submitted: auto-generated
Precedence: bulk
Message-ID: <172365279719.1050163.4707569980076672660@dt-datatracker-6df4c9dcf5-t2x2k>
Date: Wed, 14 Aug 2024 09:26:37 -0700
Message-ID-Hash: VMUKBYMSGUE5GBZUP4PE7ZVEDOOILEJY
X-Message-ID-Hash: VMUKBYMSGUE5GBZUP4PE7ZVEDOOILEJY
X-MailFrom: noreply@ietf.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-bess.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: bess@ietf.org, draft-ietf-bess-evpn-fast-df-recovery.all@ietf.org, last-call@ietf.org, evyncke@cisco.com
X-Mailman-Version: 3.3.9rc4
Reply-To: Toerless Eckert <tte@cs.fau.de>
Subject: [bess] Iotdir telechat review of draft-ietf-bess-evpn-fast-df-recovery-09
List-Id: BGP-Enabled ServiceS working group discussion list <bess.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/bess/UjIN-99OGwYQFVl-s7PFrYAKwI4>
List-Archive: <https://mailarchive.ietf.org/arch/browse/bess>
List-Help: <mailto:bess-request@ietf.org?subject=help>
List-Owner: <mailto:bess-owner@ietf.org>
List-Post: <mailto:bess@ietf.org>
List-Subscribe: <mailto:bess-join@ietf.org>
List-Unsubscribe: <mailto:bess-leave@ietf.org>

Reviewer: Toerless Eckert
Review result: On the Right Track

Reviewer: Toerless Eckert

Summary:

The purpose of the document is to extend the BGP message signaling and local
router procedures for failover of "Designated Forwarders" for pseudowires using
calculated future timestamps and expecting clock synchronization across the
forwarders, so that after receipt of the BGP message, the switchover can be
handled autonomously by every node as synchronously as desired and allowed for
by the clock synchronization method used.

Review result: On The Right Track

I am the assigned IOTDIR reviewer.  I found the document well written and easy
to read, except for some typos, other nits and some logical description gap.

(unfortunately ?) I find the approach of the draft very useful, and i always
wished we would have been able to build this in other IETF protocol domains (IP
multicast), so i happen to have a range of technical concerns and suggestions
primarily around the completeness of the documents methods and detail
specifications, which i hope will be helpfull to improve on the quality of the
text and usefulness of the solution.

The following is a list of G.i general comments followed by the commented
idnits version of the draft.

Thank you very much for the work!
    Toerless Eckert

General comments:

G.1 minor: Why IOTdir review ?

I am a bit puzzled why this draft was given to IOTdir for early review. Neither
the draft nor the RFCs it references mentions IoT. And the mentioned pseudowire
use-cases are all around DataCenter. So i wonder what specific IoT feedback the
authors/WG is looking for. If thereactually is a specific type of use-cases for
IoT with this technology, then it would be great to mention.

G.2 minor/suggestion: HRW has known problems

HRW was popularized and (in)validated in deployments of PIM-SM since 1995 and
hence rfc2362 way before HRW1998 was written, but of course not credited in
RFC8485. I would nevertheless like to point out that the IP Multicast community
in the IETF had some run-ins with operators over the decades who where
disappointed by its non-equal distribution in face of specific typical set of
parameters such as consecutive or close to each other router-IDs. Of course,
the parameters used in EVPN are different, and i have not tried to validate if
or how such deployment specific anomalies would or could equally apply to the
EVPN version, but i would strongly suggest to be aware that HRW is by far a
well randomizing algorithm especially for the order of the input parameters.
HRW is now probably 30 years old, and maybe EVPN may wants to look into newer,
and supposedly better algorithms such as MurmurHash (which was a recommendation
from a math geek colleague even 15 years ago - and other proposals in the IETF
are picking up on it too).

G.3 minor/question: Please consider adding ordered shutdown support

If my understanding of RFC7432/RFC8584 and this draft is correct, the
interruption in case of an ordered shutdown of a DF is as large as that of an
unexpected shutdown/service interruption (without the detection of interruption
of course). I think this is not necessary.

I think it would be great if this draft could add support for the synchronized
switchover in case of ordered shutdown of a DF because such procedures
constitute likely a large number of outages in daily operations of larger
networks.

For example, the new extended community could have a flag indication of such an
ordered shutdown so that the indicated SCT will trigger synchronized failover
to the BDF (Backup DF). And only after the failover has happened would the
primary DF send out the NLRI withdraw route and finish the shutdown operation.

G.4 mayor: analysis of actual failover behavior

The mechanism of this draft seems to aspire through synchronized switchover to
achieve a switchover interruption  in the order of 10 msec (the skew default
value). I am worried that in the face of a large number of failovers (because
of a large number of VLAN/ES services), that the interruption becomes larger
and that it will be inconsistent across different services.

The way i imagine the failover to operate (from similar failovers n other
technologies like multicast), A router may fairly quickly be able to generate
the SCT carrying routes, so there can be a burst of SCT routes all with the
same SCT. When those SCT then actually expire both on the sending and receiving
router, the speed at which they are added/deleted in hardware-forwarding will
depend on the performance of updating hardware forwarding registers. Which may
be inconsistent across different routers. It is also not clear to me if the BGP
infrastructure or other factors can or can not introduce any reordering. But if
for example we have thousand routes that need to be updated, and one router can
update 1000 routes/sec and the other can update 2000 routes/sec, then one will
be done after half a second, the other after one second - no reordering assumed.

So it would be very helpfull to have some idea about the maximum imaginable
scalability required and likely min/max performances to vet the impact of this
candidate issue.

There is of course a way to overcome this issue, which is to generate SCT that
take the performance of (de)installation of hardware forwarding entries into
account, for example by assuming some floor performance and generate SCT for
such burst of service routes with timestamps increasing such that when they
will be executed, they will stay under such a performance floor. Aka: Have a
difference of e.g.: 4msec between each route, in result creating no more than
250 SCP updates/second.

In any case, it would be great if the grat target goal of this draft - less
than 10 msec interruption would not be invalidated by such real-world
performance impacts if it actually is easy to overcome it with a bit of
additional text in the draft.

G.5 mayor: Behavior upon non-synchronization.

I think the draft should do more due-diligence in its text for various
conditions of non-correct time synchronization between devices. Let first agree
on the conditions and general direction, and the i am happy to propose text if
it makes sense to the WG.

a) A router can and then should validate the state of synchronization of its
clock (in NTP for example this is typically possible via some management API,
not sure if there is already a YANG model). When restarting, the that its clock
is not synchronized to a necessary degree of accuracy yet. Minimum required
synchronization accuracy should be configurable, default maybe 3 msec. In this
case the router would wait until the synchronization is sufficient up to a
maximum time period (configurable, default maybe 30 seconds). If
synchronization is not sufficient then, revert to behave as non-draft compliant
router - and upgrade later on if and when synchronization is successful.

b) A router which is aware that it is correctly synchronized is is receiving an
SCT update from another router which did not correctly recognize its own
synchronization failure (e.g.: does not have the API to validate its local
clock being synchronized).  This condition might warrant a flag bit in the
route updates, if feasible.

To discover and work around this condition, routers will perform plausibility
check on received SCT timestamps, e.g.: validate that the received timestamp is
within a reasonable window around the local (synchronzied) clock at the time of
reception of the SCT carrying route: at least one second from current clock, at
most the configured interval (default 3 seconds), plus extensions, such as some
seconds if concern G.4 is taken into account. If ithe received SCD is out of
bounds, then the receiving router would raise some error condition and perform
some fallback failover, e.g.: within 3 seconds from reception (to avoid that
failover would happen at an imappropriately long time in the future
immediately, when SCT is in the past).

G.6 minor: some suggested NTP operational text

The following is proposed text for some NTP clock synchronization operational
considerations sections including only G.5 suggestion a). But also other
aspects crucial for successfull deployment.

----

While the use of a synchronized clock between the participating routers makes
the solution itself very simple and accurate, it does introduce a new
potentially large and complex dependency against the clock synchronization
mechanism used. Because of the use of NTP timestamps, it is not possible to
build really lightweight and autonomously operating clock synchronization
systems. Instead, one will likely need to create an operational dependency
against a clock source with automated inclusion of complexities specifically
the leap seconds, which includes satellite clock sources (Beidou, Galileo,
GLONASS or GPS), or terrestrial (DCF77, WWVB, MSF or JJY). If this dependency
is operationally already established for other purposes, then the mechanism of
this document does not provide incremental requirements except maybe for the
required accuracy. Otherwise the requirements to operate the clock
synchronization need to be analyzed.

For the mechanism of this document to provide the desired benefit,
synchronization of a few millisecond (5) or less is required, so that the skew
is sufficient to separate the break DF times from the make DF times. This
should in general not be a problem to achieve with minimal NTPv4 installations
that are aware of common pittfalls as follows.

When a router restarts, initial synchronization to other NTP server(s) is sped
up if the router has a local battery backed RTC clock from which it can derive
derive a starting time as well as the capability to step the clock to quickly
synchronize to the other NTP server(s).

If either is not possible, synchronization may take more than a few seconds
after reboot and it may be desirable to delay the bringing up DF functionality
up until the desired accuracy of clock synchronization is achieved.

Synchronization across WAN links can be subject to asymmetric latency, which
can be as high as some msec, such as for pseudowires across transcontinental
connectibity between backup DCs. Clock synchronization protocols can not
automatically figure out such asymmetric propagation latencies. If deployments
with such asymmetric latencies is required, the clock synchronization protocol
needs to have options to learn about such asymmetries, such as through
configuration.

G.7 minor: make before break instead of break before make

I think that it would make sense to define skew as configurable and explicitly
point to the option of making it positive so as to achieve "make before break"
functionality, E.g.: making the recovering router become DF slightly before the
withdrawing router.

I can think of several type of customer services that can better deal with
duplicates than with even short term losses. And unless i am overlooking some
looping issues in the broadcast domains (which i likely may), the only reason
to do break before make is IMHO services where the simultaneous sending will
result in overload. But whenever a service has a lot rate of actual user
traffic, most application will prefer a few duplicates over a few losst packets.

--

The following is idnits output to have line numbers. issues/discussions from
the review have no line numbers.
------

draft-ietf-bess-evpn-fast-df-recovery-09.txt:

  Showing Errors (**), Flaws (~~), Warnings (==), and Comments (--).
  Errors MUST be fixed before draft submission.  Flaws SHOULD be fixed before
  draft submission.

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info)
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Running in submission checking mode -- *not* checking nits according to
  https://www.ietf.org/id-info/checklist .
  ----------------------------------------------------------------------------

     No nits found.
--------------------------------------------------------------------------------

2       BESS Working Group                                     P. Brissette, Ed.
3       Internet-Draft                                                A. Sajassi
4       Updates: 8584 (if approved)                                   LA. Burdet
5       Intended status: Standards Track                                   Cisco
6       Expires: 9 January 2025                                         J. Drake
7                                                                    Independent
8                                                                     J. Rabadan
9                                                                          Nokia
10                                                                   8 July 2024

12                Fast Recovery for EVPN Designated Forwarder Election

13                      draft-ietf-bess-evpn-fast-df-recovery-09

15      Abstract

17         The Ethernet Virtual Private Network (EVPN) solution provides
18         Designated Forwarder (DF) election procedures for multihomed Ethernet
19         Segments.  These procedures have been enhanced further by applying
20         Highest Random Weight (HRW) algorithm for Designated Forwarder
21         election in order to avoid unnecessary DF status changes upon a
22         failure.  This document improves these procedures by providing a fast
23         Designated Forwarder election upon recovery of the failed link or
24         node associated with the multihomed Ethernet Segment.  This document
25         updates Section 2.1 of [RFC8584] by optionally introducing delays
26         between some of the events therein.

28         The solution is independent of the number of EVPN Instances (EVIs)
29         associated with that Ethernet Segment and it is performed via a
30         simple signaling between the recovered node and each of the other
31         nodes in the multihoming group.

33      Status of This Memo

35         This Internet-Draft is submitted in full conformance with the
36         provisions of BCP 78 and BCP 79.

38         Internet-Drafts are working documents of the Internet Engineering
39         Task Force (IETF).  Note that other groups may also distribute
40         working documents as Internet-Drafts.  The list of current Internet-
41         Drafts is at https://datatracker.ietf.org/drafts/current/.

43         Internet-Drafts are draft documents valid for a maximum of six months
44         and may be updated, replaced, or obsoleted by other documents at any
45         time.  It is inappropriate to use Internet-Drafts as reference
46         material or to cite them other than as "work in progress."

48         This Internet-Draft will expire on 9 January 2025.

50      Copyright Notice

52         Copyright (c) 2024 IETF Trust and the persons identified as the
53         document authors.  All rights reserved.

55         This document is subject to BCP 78 and the IETF Trust's Legal
56         Provisions Relating to IETF Documents (https://trustee.ietf.org/
57         license-info) in effect on the date of publication of this document.
58         Please review these documents carefully, as they describe your rights
59         and restrictions with respect to this document.  Code Components
60         extracted from this document must include Revised BSD License text as
61         described in Section 4.e of the Trust Legal Provisions and are
62         provided without warranty as described in the Revised BSD License.

64      Table of Contents

66         1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
67           1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
68           1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
69           1.3.  Challenges with Existing Mechanism  . . . . . . . . . . .   3
70           1.4.  Design Principles for a Solution  . . . . . . . . . . . .   5
71         2.  DF Election Synchronization Solution  . . . . . . . . . . . .   5
72           2.1.  BGP Encoding  . . . . . . . . . . . . . . . . . . . . . .   6
73           2.2.  Updates to RFC8584  . . . . . . . . . . . . . . . . . . .   7
74         3.  Synchronization Scenarios . . . . . . . . . . . . . . . . . .   8
75           3.1.  Concurrent Recoveries . . . . . . . . . . . . . . . . . .  10
76         4.  Backwards Compatibility . . . . . . . . . . . . . . . . . . .  11
77         5.  Security Considerations . . . . . . . . . . . . . . . . . . .  11
78         6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  12
79         7.  Normative References  . . . . . . . . . . . . . . . . . . . .  12
80         Appendix A.  Contributors . . . . . . . . . . . . . . . . . . . .  13
81         Appendix B.  Acknowledgements . . . . . . . . . . . . . . . . . .  13
82         Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  14

84      1.  Introduction

86         The Ethernet Virtual Private Network (EVPN) solution [RFC7432] is
87         becoming pervasive in data center (DC) applications for Network
88         Virtualization Overlay (NVO) and DC interconnect (DCI) services, and
89         in service provider (SP) applications for next generation virtual
90         private LAN services.

nit: If there is any IoT use, please mention

nit: "pervasive" is a bold statement. I do not know enough to support or doubt
it, but if there was any reference you could add to support the claim, then it
would make it stronger. Else maybe tone it down ("widely used")...

92         [RFC7432] describes Designated Frowarder (DF) election procedures for
                                           ^ typo

93         multihomed Ethernet Segments.  These procedures are enhanced further
94         in [RFC8584] by applying the Highest Random Weight (HRW) algorithm

nit:

please add the HRW1998 reference as used in RFC8584 as reference for the
term HRW and include it here.

95         for DF election in order to avoid unnecessary DF status changes upon
96         a link or node failure associated with the multihomed Ethernet
97         Segment.  This document makes further improvements to the DF election

nit: insert paragraph break before "This" (background -> contribution).

98         procedures in [RFC8584] by providing an option for a fast DF election
99         upon recovery of the failed link or node associated with the
100        multihomed Ethernet Segment.  This DF election is achieved
101        independent of the number of EVPN Instances (EVIs) associated with
102        that Ethernet Segment and it is performed via straightforward
103        signaling between the recovered node and each of the other nodes in
104        the multihomed group.
105        This document updates the DF Election Finite State Machine (FSM)
106        described in Section 2.1 of [RFC8584], by optionally introducing
107        delays between some events, as further detailed in Section 2.2.  The
108        solution is based on a simple one-way signaling mechanism.

110     1.1.  Requirements Language

112        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
113        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
114        "OPTIONAL" in this document are to be interpreted as described in BCP
115        14 [RFC2119] [RFC8174] when, and only when, they appear in all
116        capitals, as shown here.

118     1.2.  Terminology

120        PE:  Provider Edge device.

122        Designated Forwarder (DF):  A PE that is currently forwarding
123           (encapsulating/decapsulating) traffic for a given VLAN in and out
124           of a site.

126        EVI:  An EVPN instance spanning the Provider Edge (PE) devices
127           participating in that EVPN.

129     1.3.  Challenges with Existing Mechanism

131        In EVPN technology, multiple Provider Edge (PE) devices have the
132        ability to encapsulate and decapsulate data belonging to the same
133        VLAN.  Under certain conditions, this may cause Layer2 duplicates and
134        potential loops if there is a momentary overlap in forwarding roles
135        between two or more PE devices, consequently leading to broadcast
136        storms.

138        EVPN [RFC7432] currently specifies timer-based synchronization among
139        PE devices within a redundancy group.  This approach can lead to
140        duplications and potential loops due to multiple Designated
141        Forwarders (DFs) if the timer interval is too short, or to packet
142        drops if the timer interval is too long.

144        Split-horizon filtering, as described in Section 8.3 of [RFC7432],
145        can prevent loops but does not address duplicates.  However, if there
146        are overlapping Designated Forwarders (DFs) of two different sites
147        simultaneously for the same VLAN, the site identifier will differ
148        when the packet re-enters the Ethernet Segment.  Consequently, the
149        split-horizon check will fail, resulting in Layer 2 loops.

minor:

 i can not find a description of this setup and problem in [RFC7342],
and the description in the paragraph above is quite terse so that i am not
sure that i would make up from scratch a fitting example. I think it would
thus be useful to provide an topology with an appropriate example of this
condition and explain the problem based on that topology example.

151        The updated Designated Forwarder (DF) procedures outlined in
152        [RFC8584] use the well-known Highest Random Weight (HRW) algorithm to
153        prevent the reshuffling of VLANs among PE devices within the
154        redundancy group during failure or recovery events.  This approach
155        minimizes the impact on VLANs not assigned to the failed or recovered
156        ports and eliminates the occurrence of loops or duplicates during
157        such events.

159        However, upon PE insertion or a port being newly added to a
160        multihomed Ethernet Segment, HRW also cannot help as a transfer of DF
161        role to the new port must occur while the old DF is still active.

163                                          +---------+
164                       +-------------+    |         |
165                       |             |    |         |
166                     / |    PE1      |----|         |   +-------------+
167                    /  |             |    |  MPLS/  |   |             |---CE3
168                   /   +-------------+    |  VxLAN/ |   |     PE3     |
169              CE1 -                       |  Cloud  |   |             |
170                   \   +-------------+    |         |---|             |
171                    \  |             |    |         |   +-------------+
172                     \ |     PE2     |----|         |
173                       |             |    |         |
174                       +-------------+    |         |
175                                          +---------+

177                       Figure 1: CE1 multihomed to PE1 and PE2.

179        In Figure 1, when PE2 is inserted in the Ethernet Segment or its
180        CE1-facing interface recovered, PE1 will transfer the DF role of some
181        VLANs to PE2 to achieve load balancing.  However, because there is no
182        handshake mechanism between PE1 and PE2, overlapping of DF roles for
183        a given VLAN is possible which leads to duplication of traffic as
184        well as Layer 2 loops.

186        Current EVPN specifications [RFC7432] and [RFC8584] rely on a timer-
187        based approach for transferring the DF role to the newly inserted
188        device.  This can cause the following issues:

190        *  Loops/Duplicates if the timer value is too short
191        *  Prolonged Traffic Blackholing if the timer value is too long

193     1.4.  Design Principles for a Solution

195        The clock-synchronization solution for fast DF recovery presented in
196        this document follows several design principles and presents
197        multiples advantages, namely:

199        *  Complex handshake signaling mechanisms and state machines are
200           avoided in favor of a simple uni-directional signaling approach.

202        *  The fast DF recovery solution maintains backwards-compatibility
203           (see Section 4) by ensuring that PEs any unrecognized new BGP
204           Extended Community.

206        *  Existing DF Election algorithms remain supported.

208        *  The fast DF recovery solution is independent of any BGP delays in
209           propagation of Ethernet Segment routes (Route Type 4)

minor:

This claim is unclear to me. There is an overall maximum for the propagation
latency plus processing time of "just" a few seconds with the default SCT
calculation, right ? And that is communicated "in conjunction with" the
Ethernet Segment routes according to your below explanation. So there is
a maximum propagation limit. And likely some serialization, timing
dependencies.... ??!!

211        *  The fast DF recovery solution is agnostic of the actual time
212           synchronization mechanism used, and normalizes to NTP for EVPN
213           signalling only.

XXX

215     2.  DF Election Synchronization Solution

217        The fast DF recovery solution relies on the concept of common clock
218        alignment between partner PEs participating in a common Ethernet
219        Segment i.e. PE1 and PE2 in Figure 1.  The main idea is to have all
220        peering PEs of that Ethernet Segment perform DF election, and apply
221        the result at the same pre-announced time.

223        The DF Election procedure, as described in [RFC7432] and as
224        optionally signalled in [RFC8584], is applied.  All PEs attached to a
225        given Ethernet Segment are clock-synchronized using a networking
226        protocol for clock synchronization (e.g., NTP, PTP).  When a new PE
227        is inserted in an Ethernet Segment or a failed PE device of the
228        Ethernet Segment recovers, that PE communicates to peering partners
229        the current time plus the value of the timer for partner discovery
230        from step 2 in Section 8.5 of [RFC7432].  This constitutes an "end
231        time" or "absolute time" as seen from the local PE.  That absolute
232        time is called the "Service Carving Time" (SCT).

234        A new BGP Extended Community, the Service Carving Timestamp is
235        advertised along with the Ethernet Segment route (RT-4) to
236        communicate the Service Carving Time to other partners.

238        Upon receipt of the new BGP Extended Community, partner PEs can
239        determine the service carving time of the newly insterted PE.  To
240        eliminate any potential for duplicate traffic or loops, the concept
241        of skew is introduced: a small time offset to ensure a controlled and
242        orderly transition when multiple Provider Edge (PE) devices are
243        involved.  The receiving partner PEs add a skew (default = -10ms) to
244        the Service Carving Time to enforce this mechanism.  The previously
245        inserted PE(s) must perform service carving first, followed shortly
246        by the newly insterted PE, after the specified skew delay.

248        To summarize, all peering PEs perform service carving almost
249        simultaneously at the time announced by the newly added/recovered PE.
250        The newly inserted PE initiates the SCT, and triggers service carving
251        immediately on its local timer expiry.  The previously inserted PE(s)
252        receiving Ethernet Segment route (RT-4) with a SCT BGP extended
253        community, perform service carving shortly before Service Carving
254        Time.

256     2.1.  BGP Encoding

258        A new BGP extended community is defined to communicate the Service
259        Carving Timestamp for each Ethernet Segment.

261        A new transitive extended community where the Type field is 0x06, and
262        the Sub-Type is 0x0F is advertised along with the Ethernet Segment
263        route.  The expected Service Carving Time is encoded as an 8-octet
264        value as follows:

266                             1                   2                   3
267         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
268        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
269        | Type = 0x06   | Sub-Type(0x0F)|      Timestamp Seconds        ~
270        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
271        ~  Timestamp Seconds            | Timestamp Fractional Seconds  |
272        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

274                            Figure 2: Service Carving Time

276        The timestamp exchanged uses the NTP prime epoch of January 1, 1900
277        [RFC5905] and the 64-bit NTP Timestamp Format.  The NTP Era value is
278        not exchanged and Era 0 is assumed as of the writing of this
279        document.  A DF Election operation occurring exactly at the Era
280        transition boundary some time in 2036 is outside of the scope of this
281        document.

mayor:

This description effectively only supports the protocol until the end of Era 0,
because it not only describes what to do during switchover to Era N+1, but
it does not describe how to operate without encoding the Era. This makes
the protocol useful (without another RFC) for less than 12 years. That is IMHO
insufficient.

One simple solution, would be to describe that the Era is not included in the
encoding, but that a plausibility check is made on received timestamps. If it
is completely out of range with the receiving routers current Era, but within
rage with Era-1 or Era+1, then the timestamp is accordingly adjusted to use that
Era.

In another solution option, you can encode the Era by carving space from the SCT
encoding as follows:

IMHO, it is unnecessary to encode the fractional seconds with 16 bits.
The accuracy of the signalled timestamp does NOT impact the synchronized
accuracy of the execution of DF switchover. It only impacts the granularity of
timestamps that can be generated. If you would signal only the top 8 bits of
the fractional seconds, then you could still trigger a synchronized switchover
at intervals of 4 msec, which IMHO is more than necessary. And the switchover
could still be synchronized to an arbitrary better accuracy, such as 1 usec if
just the clock synchronization between the router is that good. Practically
speaking, NTP clock synchronization may often be just 1 msec accurate anyhow.

Even if you consider my thoughts from above concern G.4, and want to assign
different timestamps for every Ethernet Segment (especially with large number
of ethernet segments), then an interval of 4 msec would likely be more than
sufficient granularity.

So with just 8 bit fractional second encoding, you have 8 bit spare in the
encoding you can use for Era and other features (in the future).

282        The 64-bit NTP Timestamp Format consists of a 32-bit part for Seconds
283        and a 32-bit part for Fraction, which are encoded in the Service
284        Carving Time as follows:

286        *  Timestamp Seconds: 32-bit NTP seconds are encoded in this field.

288        *  Timestamp Fractional Seconds: the high order 16 bits of the NTP
289           'Fraction' field are encoded in this field.

291        When rebuilding a 64-bit NTP Timestamp Format using the values from a
292        received SCT BGP extended community, the lower order 16 bits of the
293        Fractional field are set to 0.  The use of a 16-bit fractional
294        seconds yields adequate precision of 15 microseconds (2^-16 s).

296        This document introduces a new flag called "T" (for Time
297        Synchronization) to the bitmap field of the DF Election Extended
298        Community defined in [RFC8584].

300                             1                   2                   3
301         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
302        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
303        | Type = 0x06   | Sub-Type(0x06)| RSV |  DF Alg | |A| |T|       ~
304        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
305        ~     Bitmap    |            Reserved = 0                       |
306        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

308                       Figure 3: DF Election Extended Community

310        *  Bit 3: Time Synchronization (corresponds to Bit 27 of the DF
311           Election Extended Community).  When set to 1, it indicates the
312           desire to use Time Synchronization capability with the rest of the
313           PEs in the Ethernet Segment.

nit:

"Bit 3" is a confusing definition because the "DF Election Extended Community"
field is only mentioned in the prior paragraph and not shown with this name
in the picture.

I would suggest to replace picture 3 with Figure 4 from rfc8584 - which does
show "Bitmap", and then follow it with Figure 5 from rfc8584 with "T" added,
and then follow with the "Bit 3" bullet point.

315        This capability is utilized in conjunction with the agreed-upon DF
316        Election Type.  For instance, if all the PE devices in the Ethernet
317        Segment indicate possessing Time Synchronization capability and
                            ^^^^^^^^^^

nit:

"the desire to use the" (to be consistent with the definition of T in line 312.

318        request the DF Election Type to be Highest Random Weight (HRW), then
319        the HRW algorithm is edused in conjunction with this capability.  A
                                ^^^^^^

nit: deduced ?

320        PE which does not support the procedures set out in this document, or
321        receives a route from another PE in which th capability is not set
                                                       ^
nit: "e"

322        MUST NOT delay Designated Forwarder election as this could lead to
323        duplicate traffic in some instances (overlapping Designated
324        Forwarders).

326     2.2.  Updates to RFC8584

328        This document introduces an additional delay to the events and
329        transitions defined for the default DF election algorithm FSM in
330        Section 2.1 of [RFC8584] without changing the FSM state or event
331        definitions themselves.

333        Upon receiving a RECV_ES message, the peering PE's Finite State

nit:

RFC8584 uses the term "RCVD_ES" for an event, and does not use the term
"RECV_ES" for a message. Unless there is good reason to introduce new
(inconsistent/duplicate) terminology, pls. change to terminology RCVD_ES event.
Also further below (line 350).

334        Machine (FSM) transitions from the DF_DONE (indicating the DF
335        election process was complete) state to the DF_CALC (indicating that
336        a new DF calculation is needed) state . Due to the Service Carving
337        Time (SCT) included in the Ethernet-Segment update, the completion of
338        the DF_CALC state and the subsequent transition back to the DF_DONE
339        state are delayed.  This delay ensures proper synchronization and
340        prevents conflicts.  Consequently, the accompanying forwarding
341        updates to the Designated Forwarder (DF) and Non-Designated Forwarder
342        (NDF) states are also deferred.

344        The corresponding actions when transitions are performed or states
345        are entered/exited are modified as follows:

nit:

Suggest to rewrite to the following, to be more precise:

Item 9. in RFC8584, Section 2.1, List "Corresponding actions when transitions
are performed or states are entered/exited" is changed as follows:

347        9.  DF_CALC on CALCULATED: Mark the election result for the VLAN or
348            VLAN Bundle.

350            9.1  If an SCT timestamp is present during the RECV_ES event of
351                 Action 11, wait until the time indicated by the SCT before
352                 proceeding to step 9.2.

354            9.2  Assume the role of DF or NDF for the local PE concerning the
355                 VLAN or VLAN Bundle, and transition to the DF_DONE state.

357        This revised approach ensures proper timing and synchronization in
358        the DF election process, avoiding conflicts and ensuring accurate
359        forwarding updates

minor:

a) Given how this is the normative text, i am worried that the "skew" variable
is not mentioned. Please insert accordingly.

b) 9.1 does not seem to cover the SCT delay that needs to be performed (equally,
except for skew) by the newly inserted PE. 9.1 only mentions the condition of
RECV_ES, which to me does not sounds like the newly inserted PE.

minor:

I am somewhat irritated that neither RFC8584 nor this draft have any text in the
state machiner section to indicate when/how ES routes are generated. This would
help IMHO especially in this new draft, because it is the time when the
timestamp is taken, SCT calculated and inserted into the ES route, and i guess
that that also starts the process leading to CALCULATED event on the newly
inserted router.

361     3.  Synchronization Scenarios

363        Consider Figure 1 as an example, where initially PE2 has failed and
364        PE1 has taken over.  This scenario illustrates the problem with the
365        DF-Election mechanism described in Section 8.5 of [RFC7432],
366        specifically in the context of the timer value configured for all PEs
367        on the Ethernet Segment.

369        Procedure based on Section 8.5 of [RFC7432] with the default 3 second
370        timer in step 2:

372        1.  Initial state: PE1 is in a steady-state and PE2 is recovering

374        2.  Recovery: PE2 recovers at an absolute time of t=99.

376        3.  Advertisement: PE2 advertises RT-4, sent at t=100, to partner
377            PE1.

379        4.  Timer Start: PE2 starts a 3 second timer to allow the reception
380            of RT-4 from other PE nodes.

382        5.  Immediate carving: PE1 performs service carving immediately upon
383            RT-4 reception, i.e.  t=100 plus some BGP propagation delay.

385        6.  Delayed Carving: PE2 performs service carving at time t=103

387        [RFC7432] favors traffic drops over duplicate traffic.  With the
388        above procedure, traffic drops will occur as part of each PE recovery
389        sequence since PE1 transitions some VLANs to Non-Designated Forwarder
390        (NDF) immediately upon RT-4 reception.
391        The timer value (default = 3 seconds) directly affects the duration
392        of the packet drops.  A shorter (or zero) timer may result in
393        duplicate traffic or traffic loops.

395        Procedure based on the Service Carving Time (SCT) approach:

397        1.  Initial state: PE1 is in a steady state, and PE2 is recovering

399        2.  Recovery: PE2 recovers at an absolute time of t=99.

401        3.  Advertisement: PE2 advertises RT-4, sent at t=100, with a target
402            SCT value of t=103 to partner PE1.

404        4.  Timer Start: PE2 starts a 3 second timer to allow the reception
405            of RT-4 from other PE nodes.

minor:

IMHO, this is not a 3 second timer, but a timer with a deadline of t=103. Which
is only at most 3 seconds, depending on whether step 4. happens exactly at
t=100 or somewhat later. Practically, it would always be later. IMHO, it  would
be good to emphasize on this crucial benefit of the new mechanism. Maybe need
to insert some addtl. processing delay into the section 8.5 example vs. this
example to show this difference (delay between steps 3 and 4).

407        5.  Service Carving Timer: PE1 starts the service carving timer, with
408            the remaining time until t=103

410        6.  Simultaneous Carving: Both PE1 and PE2 carve at an absolute time
411            of t=103

413        To maintain the preference for minimal loss over duplicate traffic,
414        PE1 should carve slightly before PE2 (with skew).  The recovering PE2
415        performs both DF to NDF and NDF to DF transitions per VLAN at the
416        timer's expiry.  The original PE1, which received the SCT, applies
417        the following:

419        *  DF to NDF Transition(s): at t=SCT minus skew, where both PEs are
420           NDF for the skew duration.

422        *  NDF to DF Transition(s): at t=SCT

minor:

In line 238, the draft says "Upon receipt of the new BGP Extended Community" ...
skew is being applied. Above text (line 419) instead defines application of
skew upon determination of the state transitiom. It may be that in all cases
where the BGP Extended Community is received, there is always only at most a DF
to NDF transition (but no NDF to DF transition), staying at NDF), but it still
is not ideal to have two inconsistent definitions when skew is being applied.

Technically i think the DF to NDF transition case is more sound than the
"receipt of the BGP extended community", aka: fix text around line 238 ?!

424        This split-behavior ensures a smooth DF role transition with minimal
425        loss.

427        Using the SCT approach, the negative effect of the timer to allow the
428        reception of RT-4 from other PE nodes is mitigated.  Furthermore, the
429        BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to
430        PE1) becomes a non-issue.  The SCT approach shortens the 3-second
431        timer window to the order of milliseconds.

433     3.1.  Concurrent Recoveries

435        In the eventuality 2 or more PEs in a peering Ethernet Segment group
436        are recovering concurrently or roughly the same time, each will
437        advertise a Service Carving Timestamp.  This SCT value would
438        correspond to what each recovering PE considers the "end time" for DF
439        Election.  A similar situation arises in sequentially recovering PEs,
440        when a second PE recovers approximately at the time of the first PE's
441        advertised SCT expiry, and with its own new SCT-2 outside of the
442        initial SCT window.

444        In the case of multiple concurrent DF elections, each initiated by
445        one of the recovering PEs, the SCTs must be ordered chronologically.
446        All PEs shall execute only a single DF Election at the service
447        carving time corresponding to the largest (latest) received timestamp
448        value.  This DF Election will involve all active PEs in a unified DF
449        Election update.

nit:

I think the wording 444-449 is misleading/incomplete. The latest SCT timestamp
is not the top critera, but if i understand the intent correctly, each "later"
PEi also needs to be considered to be a better(best) DF than the prior PE,
right ? Aka: In your below example (line 451ff),

    PE1 is DF
    When PE1 receives RT-4 from PE2, PE1 will redo DF calculation and
    consider PE2 to be the DF winner
    When PE2 later receives RT-4 from PE3, PE1 will redo DF calculation
    and now consider PE3 to be the DF winner. And only because PE3 is the
    DF winner, will PE1 now also cancel the SCT for PE2.

If on the other hand, the DF HRW for PE3 would be lower than that of PE2,
than PE1 would of course redo the DF election but given how PE3 does not
show the result, this AFAIK should also mean that the SCT from PE3 should have
no impact.

Yes/No ?

In any case it would be useful to improve the description to make this clearer.
Especially if/when i misunderstood it.

451        Example:

453        1.  Initial State: PE1 is in a steady state, with services elected at
454            PE1.

456        2.  Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4
457            with a target SCT value of t=103 to its partners (PE1)

459        3.  Timer Initiation by PE2: PE2 starts a 3 second timer to allow the
460            reception of RT-4 from other PE nodes.

462        4.  Timer Initiation by PE1: PE1 starts the service carving timer,
463            with the remaining time until t=103.

465        5.  Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4
466            with a target SCT value of t=105 to its partners (PE1, PE2).

468        6.  Timer Initiation by PE3: PE3 starts a 3 second timer to allow the
469            reception of RT-4 from other PE nodes

471        7.  Timer Update by PE2: PE2 cancels the running timer and starts the
472            service carving timer with the remaining time until t=105.

474        8.  Timer Update by PE1: PE1 updates its service carving timer, with
475            the remaining time until t=105.

477        9.  Service Carving: PE1, PE2, and PE3 perform service carving at the
478            absolute time of t=105.

480        In the eventuality a PE in a Ethernet Segment group recovers during
481        the discovery window specified in Section 8.5 of [RFC7432], and does
482        not support or advertise the T-bit, then all PEs in the current
483        peering sequence SHALL immediately revert to the default [RFC7432]
484        behavior.

486     4.  Backwards Compatibility

488        For the DF election procedures to achieve global convergence and
489        unanimity within a redundancy group, it is essential that all
490        participating PEs agree on the DF election algorithm to be employed.
491        However, it is possible that some PEs may continue to use the
492        existing modulo-based DF election algorithm from [RFC7432] and not
493        utilize the new Service Carving Time (SCT) BGP extended community.
494        PEs that operate using the baseline DF election mechanism will simply
495        discard the new SCT BGP extended community as unrecognized.
496        [RFC7432] and do not rely on the new SCT BGP extended community.

498        A PE can indicate its willingness to support clock-synchronized
499        carving by signaling the new 'T' DF Election Capability and including
500        the new SCT BGP extended community along with the Ethernet Segment
501        Route (Type-4).  If one or more PEs attached to the Ethernet Segment
502        do not signal T=1, then all PEs in the Ethernet Segment SHALL revert
503        to the timer-based approach as specified in [RFC7432].  This
504        reversion is particularly crucial in preventing VLAN shuffling when
505        more than two PEs are involved.

507     5.  Security Considerations

509        The mechanisms in this document use EVPN control plane as defined in
510        [RFC7432].  Security considerations described in [RFC7432] are
511        equally applicable.

513        For the new SCT Extended Community, attack vectors may be setting the
514        value to zero, to a value in the past or to large times in the
515        future.  The procedures in this document address implicitly what
516        occurs with a carving time in the past, as this would be a naturally
517        occurring event with a large BGP propagation delay: the receiving PE
518        SHALL treat the DF Election at the peer as having occurred already,
519        and proceed without starting any timer to futher delay service
520        carving.  For timestamp values in the future, a rogue PE may be
521        advertising a value inconsistent with its local behavior.  This is no
522        different than a rogue PE setting all its DF Election results
523        inconstently to its peers using (or ignoring adherence to) the
524        procedures from [RFC7432], and the result would similarly be
525        duplicate or dropped traffic.  It is left to implementations to
526        decide what consists an "unreasonably large" SCT value.

528        This document uses MPLS and IP-based tunnel technologies to support
529        data plane transport.  Security considerations described in [RFC7432]
530        and in [RFC8365] are equally applicable.

532     6.  IANA Considerations

534        IANA maintains the "EVPN Extended Community Sub-Types" registry set
535        up by [RFC7153].  IANA is requested to confirm the First Come First
536        Served assignment as follows:

538           Sub-Type Value   Name                        Reference       Date
539           --------------   -------------------------   -------------   ----
540                 0x0F       Service Carving Timestamp   This document   TBD

542        IANA should replace the field TBD with the date of publicaton of this
543        document as an RFC.

545        IANA maintains the "DF Election Capabilities" registry set up by
546        [RFC8584].  IANA is requested to make the following assignment from
547        this registry:

549            Bit         Name                         Reference        Date
550            ----        ----------------             -------------    ----
551            3           Time Synchronization         This document    TBD

553        IANA should replace the field TBD with the date of publicaton of this
554        document as an RFC.

556     7.  Normative References

558        [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
559                   Requirement Levels", BCP 14, RFC 2119,
560                   DOI 10.17487/RFC2119, March 1997,
561                   <https://www.rfc-editor.org/info/rfc2119>.

563        [RFC5905]  Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,
564                   "Network Time Protocol Version 4: Protocol and Algorithms
565                   Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,
566                   <https://www.rfc-editor.org/info/rfc5905>.

568        [RFC7153]  Rosen, E. and Y. Rekhter, "IANA Registries for BGP
569                   Extended Communities", RFC 7153, DOI 10.17487/RFC7153,
570                   March 2014, <https://www.rfc-editor.org/info/rfc7153>.

572        [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
573                   Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
574                   Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
575                   2015, <https://www.rfc-editor.org/info/rfc7432>.

577        [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
578                   2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
579                   May 2017, <https://www.rfc-editor.org/info/rfc8174>.

581        [RFC8365]  Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R.,
582                   Uttaro, J., and W. Henderickx, "A Network Virtualization
583                   Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365,
584                   DOI 10.17487/RFC8365, March 2018,
585                   <https://www.rfc-editor.org/info/rfc8365>.

587        [RFC8584]  Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake,
588                   J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet
589                   VPN Designated Forwarder Election Extensibility",
590                   RFC 8584, DOI 10.17487/RFC8584, April 2019,
591                   <https://www.rfc-editor.org/info/rfc8584>.

593     Appendix A.  Contributors

595        In addition to the authors listed on the front page, the following
596        co-authors have also contributed substantially to this document:

598        Gaurav Badoni
599        Cisco

601        Email: gbadoni@cisco.com

603        Dhananjaya Rao
604        Cisco

606        Email: dhrao@cisco.com

608     Appendix B.  Acknowledgements

610        Authors would like to acknowledge helpful comments and contributions
611        of Satya Mohanty and Bharath Vasudevan.  Also thank you to Anoop
612        Ghanwani and Gunter van de Velde for their thorough review with
613        valuable comments and corrections.

615     Authors' Addresses

617        Patrice Brissette (editor)
618        Cisco
619        Email: pbrisset@cisco.com

621        Ali Sajassi
622        Cisco
623        Email: sajassi@cisco.com

625        Luc Andre Burdet
626        Cisco
627        Email: lburdet@cisco.com

629        John Drake
630        Independent
631        Email: je_drake@yahoo.com

633        Jorge Rabadan
634        Nokia
635        Email: jorge.rabadan@nokia.com

EOF