[bess] Re: Iotdir telechat review of draft-ietf-bess-evpn-fast-df-recovery-09

Toerless Eckert <tte@cs.fau.de> Wed, 21 August 2024 15:50 UTC

Return-Path: <eckert@i4.informatik.uni-erlangen.de>
X-Original-To: bess@ietfa.amsl.com
Delivered-To: bess@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9327BC1CAF4C; Wed, 21 Aug 2024 08:50:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.656
X-Spam-Level:
X-Spam-Status: No, score=-1.656 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GCZq1xYZZV0o; Wed, 21 Aug 2024 08:50:33 -0700 (PDT)
Received: from faui40.informatik.uni-erlangen.de (faui40.informatik.uni-erlangen.de [IPv6:2001:638:a000:4134::ffff:40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A1439C180B77; Wed, 21 Aug 2024 08:50:30 -0700 (PDT)
Received: from faui48e.informatik.uni-erlangen.de (faui48e.informatik.uni-erlangen.de [131.188.34.51]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by faui40.informatik.uni-erlangen.de (Postfix) with ESMTPS id 4WprQC4f8kznkql; Wed, 21 Aug 2024 17:50:27 +0200 (CEST)
Received: by faui48e.informatik.uni-erlangen.de (Postfix, from userid 10463) id 4WprQC3kzJzkx4g; Wed, 21 Aug 2024 17:50:27 +0200 (CEST)
Date: Wed, 21 Aug 2024 17:50:27 +0200
From: Toerless Eckert <tte@cs.fau.de>
To: Luc André Burdet <laburdet.ietf@gmail.com>
Message-ID: <ZsYMw-2G8dv1Yfim@faui48e.informatik.uni-erlangen.de>
References: <172365279719.1050163.4707569980076672660@dt-datatracker-6df4c9dcf5-t2x2k> <CH0PR14MB496232A56DFA436791928766AF8C2@CH0PR14MB4962.namprd14.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CH0PR14MB496232A56DFA436791928766AF8C2@CH0PR14MB4962.namprd14.prod.outlook.com>
Message-ID-Hash: SFKOMS3TQQYDKYUFQJZYOO35N4H4M3DV
X-Message-ID-Hash: SFKOMS3TQQYDKYUFQJZYOO35N4H4M3DV
X-MailFrom: eckert@i4.informatik.uni-erlangen.de
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-bess.ietf.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: "iot-directorate@ietf.org" <iot-directorate@ietf.org>, "bess@ietf.org" <bess@ietf.org>, "draft-ietf-bess-evpn-fast-df-recovery.all@ietf.org" <draft-ietf-bess-evpn-fast-df-recovery.all@ietf.org>, "last-call@ietf.org" <last-call@ietf.org>, "evyncke@cisco.com" <evyncke@cisco.com>
X-Mailman-Version: 3.3.9rc4
Precedence: list
Subject: [bess] Re: Iotdir telechat review of draft-ietf-bess-evpn-fast-df-recovery-09
List-Id: BGP-Enabled ServiceS working group discussion list <bess.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/bess/t4fRcDdZMfgiq5QBsySk6WOO6ww>
List-Archive: <https://mailarchive.ietf.org/arch/browse/bess>
List-Help: <mailto:bess-request@ietf.org?subject=help>
List-Owner: <mailto:bess-owner@ietf.org>
List-Post: <mailto:bess@ietf.org>
List-Subscribe: <mailto:bess-join@ietf.org>
List-Unsubscribe: <mailto:bess-leave@ietf.org>

Thanks, Luc, inline

On Mon, Aug 19, 2024 at 08:24:05PM +0000, Luc André Burdet wrote:
> Hi Toerless,
> 
> Thank you for the detailed review. I have updated the inline-comments for -10 which will be posted soon. For the itemized questions please see below. Thanks !

Not quite clear i parse this correctly... I'll wait for -10 and then check what you changed,
and/or then further replies from you... ?

> G.1 and G.2 : I will leave that question for a wider scope, this document merely updates existing RFCs -> and the reference to HRW is ‘en passant’ as an improvement which happened over time (perfect or not...)

G.1: As an IOTdir member, i am primarily curious as to if/where pseudowires are used
     in IoT, but happy to get educated on the hallway if the WG feels its unnecessary to
     give example like datacenter in the text. I just thought that given how old pseudowires
     are, IOT use cases may have come later, and this would be a nice opportunity to add
     a bit of marketing about them.

G.2: Sure, this was primarily meant as a warning to the WG. Took many years of deployments
     in IP Multicast before there where customers who complained. Not that we ever helped
     them either ;-))

> G.3 is a very interesting proposal actually, for orderly ‘removal’ from network (maintenance operations).   I will give this some more thought with co-authors to see how to incorporate this, thanks for the valuable suggestion !

Sure, let me know if i can help. Graceful shutdown has IMHO seems to be one of those
systematic gaps of IETF RTG work, so it's always nice to take opportunities to do
something against it.

> G.4 please note this draft currently addresses “controlled recovery” only, not “controlled failures” (as in G.3).  while technically accurate, in reality interface recovery is very rarely the “same millisecond” or close thereto.
> In practice, even if unlatched all together interfaces recovering will also have some time gaps in between them.  To address this concern is to provide for a non-default (configured) skew to account for hw programing speed(s).  More pertinent though, is that this draft allows for larger non-default peering values (the 3s from base RFC) and interfaces that have known-slow-programming or a large number of subinterfaces or hosts to program can easily avail of a larger peering timer specific to the conditions of that ES. The SCT represents the wall-clock of this base-RFC peering timer at the recovering PE.

Sorry if any of my text was misunderstandable to discuss "controlled failures". That was not my
intention. I understand the issues of that case and didn't mean to imply application of this
drafts technology to it.

What i am worried about is the missing text to outline in more detail how the controlled recovery
needs to work (be implemented) to allow for "sub 10msec" ("simultaneous") controlled recovery
in the face of differently fast routers and large number of routes. Especially so as to avoid
naive implementations doing what i outlined (quickly mark in control plane large number of
routes with the same SCT and then see diverging actual SCT failover times due to differences
in HW programming speed).

For example, something like "Implementations SHOULD have through appropriate configuration parameters
an understanding of the lowest available SCT based route-change rate in their redundant peers and 
spread SCT times by this rate - to allow for actual executable SCT times under large number of
routes".

Also: I am unclear if there can be reordering of routes between sending and receiving nodes, e.g.:
because of RR. Do you know ?

> G.5/G.6 the variant (a) is the one I am aware of implemented by vendors: wait for NTP sync before proceeding to many or most operations in control plane, incl this peering of ethernet-segments.  If NTP snc becomes an issue (on router first-reload for example) delays are usually added prior to inserting the router into the network (advertising routes).  In short, NTP sync often becomes a gate to some operations -> I could add some text with a stronger link to clock-sync before including the SCT extended community ?

I always prefer explicit operational standards as opposed to just leaving it up to the imagination
of the implementers what to do. E.g.: Implementations SHOULD have a configurable
timer tmaxwait_scn. This is the time that a router will wait after recovering and being able to bring
up pseudowire segments and until clock synchronization is established with sufficient accuracy to
guarantee the desired simultaneously SCN operation. This timer has no impact on loss of clock
synchronization during normal router operation. It is assumed that that condition can be avoided through
appropriate redundancy in clock synchronization configuration.

Of course, if solutions like this are documented already in anothrer place that you could point to,
that would suffice too, but i am not aware of such operational RFCs (or other SDO specs).

I'd also like to renew the suggestion of b): add text about plausibility check of received SCT timestamps and
appropriate mitigation if plausibility fails. Else this could lead to unexplanable long recovery if
SCT is too far into the future.

Likewise the other operational text i provided.

> G.7 this was always poorly written – I have updated to “substracting a positive value” throughtout- but the “break before make” is actually on purpose.  On recovery you do not want 2 interfaces in DF mode, that will create duplicates, loops etc.

I actually did have experiences with customers who did prefer duplicates during failover
(IP Multicast). That is just a matter of whether or not the app and the path bandwidths can
handle them.

I do not understand looping in redundant pseudowire setups well enough to have an argument about them.

Cheers
    Toerless

> Regards,
> Luc André
> 
> Luc André Burdet |  Cisco  |  laburdet.ietf@gmail.com  |  Tel: +1 613 254 4814
> 
> 
> From: Toerless Eckert via Datatracker <noreply@ietf.org>
> Date: Wednesday, August 14, 2024 at 12:27
> To: iot-directorate@ietf.org <iot-directorate@ietf.org>
> Cc: bess@ietf.org <bess@ietf.org>, draft-ietf-bess-evpn-fast-df-recovery.all@ietf.org <draft-ietf-bess-evpn-fast-df-recovery.all@ietf.org>, last-call@ietf.org <last-call@ietf.org>, evyncke@cisco.com <evyncke@cisco.com>
> Subject: [bess] Iotdir telechat review of draft-ietf-bess-evpn-fast-df-recovery-09
> Reviewer: Toerless Eckert
> Review result: On the Right Track
> 
> Reviewer: Toerless Eckert
> 
> Summary:
> 
> The purpose of the document is to extend the BGP message signaling and local
> router procedures for failover of "Designated Forwarders" for pseudowires using
> calculated future timestamps and expecting clock synchronization across the
> forwarders, so that after receipt of the BGP message, the switchover can be
> handled autonomously by every node as synchronously as desired and allowed for
> by the clock synchronization method used.
> 
> Review result: On The Right Track
> 
> I am the assigned IOTDIR reviewer.  I found the document well written and easy
> to read, except for some typos, other nits and some logical description gap.
> 
> (unfortunately ?) I find the approach of the draft very useful, and i always
> wished we would have been able to build this in other IETF protocol domains (IP
> multicast), so i happen to have a range of technical concerns and suggestions
> primarily around the completeness of the documents methods and detail
> specifications, which i hope will be helpfull to improve on the quality of the
> text and usefulness of the solution.
> 
> The following is a list of G.i general comments followed by the commented
> idnits version of the draft.
> 
> Thank you very much for the work!
>     Toerless Eckert
> 
> General comments:
> 
> G.1 minor: Why IOTdir review ?
> 
> I am a bit puzzled why this draft was given to IOTdir for early review. Neither
> the draft nor the RFCs it references mentions IoT. And the mentioned pseudowire
> use-cases are all around DataCenter. So i wonder what specific IoT feedback the
> authors/WG is looking for. If thereactually is a specific type of use-cases for
> IoT with this technology, then it would be great to mention.
> 
> G.2 minor/suggestion: HRW has known problems
> 
> HRW was popularized and (in)validated in deployments of PIM-SM since 1995 and
> hence rfc2362 way before HRW1998 was written, but of course not credited in
> RFC8485. I would nevertheless like to point out that the IP Multicast community
> in the IETF had some run-ins with operators over the decades who where
> disappointed by its non-equal distribution in face of specific typical set of
> parameters such as consecutive or close to each other router-IDs. Of course,
> the parameters used in EVPN are different, and i have not tried to validate if
> or how such deployment specific anomalies would or could equally apply to the
> EVPN version, but i would strongly suggest to be aware that HRW is by far a
> well randomizing algorithm especially for the order of the input parameters.
> HRW is now probably 30 years old, and maybe EVPN may wants to look into newer,
> and supposedly better algorithms such as MurmurHash (which was a recommendation
> from a math geek colleague even 15 years ago - and other proposals in the IETF
> are picking up on it too).
> 
> G.3 minor/question: Please consider adding ordered shutdown support
> 
> If my understanding of RFC7432/RFC8584 and this draft is correct, the
> interruption in case of an ordered shutdown of a DF is as large as that of an
> unexpected shutdown/service interruption (without the detection of interruption
> of course). I think this is not necessary.
> 
> I think it would be great if this draft could add support for the synchronized
> switchover in case of ordered shutdown of a DF because such procedures
> constitute likely a large number of outages in daily operations of larger
> networks.
> 
> For example, the new extended community could have a flag indication of such an
> ordered shutdown so that the indicated SCT will trigger synchronized failover
> to the BDF (Backup DF). And only after the failover has happened would the
> primary DF send out the NLRI withdraw route and finish the shutdown operation.
> 
> G.4 mayor: analysis of actual failover behavior
> 
> The mechanism of this draft seems to aspire through synchronized switchover to
> achieve a switchover interruption  in the order of 10 msec (the skew default
> value). I am worried that in the face of a large number of failovers (because
> of a large number of VLAN/ES services), that the interruption becomes larger
> and that it will be inconsistent across different services.
> 
> The way i imagine the failover to operate (from similar failovers n other
> technologies like multicast), A router may fairly quickly be able to generate
> the SCT carrying routes, so there can be a burst of SCT routes all with the
> same SCT. When those SCT then actually expire both on the sending and receiving
> router, the speed at which they are added/deleted in hardware-forwarding will
> depend on the performance of updating hardware forwarding registers. Which may
> be inconsistent across different routers. It is also not clear to me if the BGP
> infrastructure or other factors can or can not introduce any reordering. But if
> for example we have thousand routes that need to be updated, and one router can
> update 1000 routes/sec and the other can update 2000 routes/sec, then one will
> be done after half a second, the other after one second - no reordering assumed.
> 
> So it would be very helpfull to have some idea about the maximum imaginable
> scalability required and likely min/max performances to vet the impact of this
> candidate issue.
> 
> There is of course a way to overcome this issue, which is to generate SCT that
> take the performance of (de)installation of hardware forwarding entries into
> account, for example by assuming some floor performance and generate SCT for
> such burst of service routes with timestamps increasing such that when they
> will be executed, they will stay under such a performance floor. Aka: Have a
> difference of e.g.: 4msec between each route, in result creating no more than
> 250 SCP updates/second.
> 
> In any case, it would be great if the grat target goal of this draft - less
> than 10 msec interruption would not be invalidated by such real-world
> performance impacts if it actually is easy to overcome it with a bit of
> additional text in the draft.
> 
> G.5 mayor: Behavior upon non-synchronization.
> 
> I think the draft should do more due-diligence in its text for various
> conditions of non-correct time synchronization between devices. Let first agree
> on the conditions and general direction, and the i am happy to propose text if
> it makes sense to the WG.
> 
> a) A router can and then should validate the state of synchronization of its
> clock (in NTP for example this is typically possible via some management API,
> not sure if there is already a YANG model). When restarting, the that its clock
> is not synchronized to a necessary degree of accuracy yet. Minimum required
> synchronization accuracy should be configurable, default maybe 3 msec. In this
> case the router would wait until the synchronization is sufficient up to a
> maximum time period (configurable, default maybe 30 seconds). If
> synchronization is not sufficient then, revert to behave as non-draft compliant
> router - and upgrade later on if and when synchronization is successful.
> 
> b) A router which is aware that it is correctly synchronized is is receiving an
> SCT update from another router which did not correctly recognize its own
> synchronization failure (e.g.: does not have the API to validate its local
> clock being synchronized).  This condition might warrant a flag bit in the
> route updates, if feasible.
> 
> To discover and work around this condition, routers will perform plausibility
> check on received SCT timestamps, e.g.: validate that the received timestamp is
> within a reasonable window around the local (synchronzied) clock at the time of
> reception of the SCT carrying route: at least one second from current clock, at
> most the configured interval (default 3 seconds), plus extensions, such as some
> seconds if concern G.4 is taken into account. If ithe received SCD is out of
> bounds, then the receiving router would raise some error condition and perform
> some fallback failover, e.g.: within 3 seconds from reception (to avoid that
> failover would happen at an imappropriately long time in the future
> immediately, when SCT is in the past).
> 
> G.6 minor: some suggested NTP operational text
> 
> The following is proposed text for some NTP clock synchronization operational
> considerations sections including only G.5 suggestion a). But also other
> aspects crucial for successfull deployment.
> 
> ----
> 
> While the use of a synchronized clock between the participating routers makes
> the solution itself very simple and accurate, it does introduce a new
> potentially large and complex dependency against the clock synchronization
> mechanism used. Because of the use of NTP timestamps, it is not possible to
> build really lightweight and autonomously operating clock synchronization
> systems. Instead, one will likely need to create an operational dependency
> against a clock source with automated inclusion of complexities specifically
> the leap seconds, which includes satellite clock sources (Beidou, Galileo,
> GLONASS or GPS), or terrestrial (DCF77, WWVB, MSF or JJY). If this dependency
> is operationally already established for other purposes, then the mechanism of
> this document does not provide incremental requirements except maybe for the
> required accuracy. Otherwise the requirements to operate the clock
> synchronization need to be analyzed.
> 
> For the mechanism of this document to provide the desired benefit,
> synchronization of a few millisecond (5) or less is required, so that the skew
> is sufficient to separate the break DF times from the make DF times. This
> should in general not be a problem to achieve with minimal NTPv4 installations
> that are aware of common pittfalls as follows.
> 
> When a router restarts, initial synchronization to other NTP server(s) is sped
> up if the router has a local battery backed RTC clock from which it can derive
> derive a starting time as well as the capability to step the clock to quickly
> synchronize to the other NTP server(s).
> 
> If either is not possible, synchronization may take more than a few seconds
> after reboot and it may be desirable to delay the bringing up DF functionality
> up until the desired accuracy of clock synchronization is achieved.
> 
> Synchronization across WAN links can be subject to asymmetric latency, which
> can be as high as some msec, such as for pseudowires across transcontinental
> connectibity between backup DCs. Clock synchronization protocols can not
> automatically figure out such asymmetric propagation latencies. If deployments
> with such asymmetric latencies is required, the clock synchronization protocol
> needs to have options to learn about such asymmetries, such as through
> configuration.
> 
> G.7 minor: make before break instead of break before make
> 
> I think that it would make sense to define skew as configurable and explicitly
> point to the option of making it positive so as to achieve "make before break"
> functionality, E.g.: making the recovering router become DF slightly before the
> withdrawing router.
> 
> I can think of several type of customer services that can better deal with
> duplicates than with even short term losses. And unless i am overlooking some
> looping issues in the broadcast domains (which i likely may), the only reason
> to do break before make is IMHO services where the simultaneous sending will
> result in overload. But whenever a service has a lot rate of actual user
> traffic, most application will prefer a few duplicates over a few losst packets.
> 
> --
> 
> The following is idnits output to have line numbers. issues/discussions from
> the review have no line numbers.
> ------
> 
> draft-ietf-bess-evpn-fast-df-recovery-09.txt:
> 
>   Showing Errors (**), Flaws (~~), Warnings (==), and Comments (--).
>   Errors MUST be fixed before draft submission.  Flaws SHOULD be fixed before
>   draft submission.
> 
>   Checking boilerplate required by RFC 5378 and the IETF Trust (see
>   https://trustee.ietf.org/license-info)
>   ----------------------------------------------------------------------------
> 
>      No issues found here.
> 
>   Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
>   ----------------------------------------------------------------------------
> 
>      No issues found here.
> 
>   Running in submission checking mode -- *not* checking nits according to
>   https://www.ietf.org/id-info/checklist .
>   ----------------------------------------------------------------------------
> 
>      No nits found.
> --------------------------------------------------------------------------------
> 
> 2       BESS Working Group                                     P. Brissette, Ed.
> 3       Internet-Draft                                                A. Sajassi
> 4       Updates: 8584 (if approved)                                   LA. Burdet
> 5       Intended status: Standards Track                                   Cisco
> 6       Expires: 9 January 2025                                         J. Drake
> 7                                                                    Independent
> 8                                                                     J. Rabadan
> 9                                                                          Nokia
> 10                                                                   8 July 2024
> 
> 12                Fast Recovery for EVPN Designated Forwarder Election
> 
> 13                      draft-ietf-bess-evpn-fast-df-recovery-09
> 
> 15      Abstract
> 
> 17         The Ethernet Virtual Private Network (EVPN) solution provides
> 18         Designated Forwarder (DF) election procedures for multihomed Ethernet
> 19         Segments.  These procedures have been enhanced further by applying
> 20         Highest Random Weight (HRW) algorithm for Designated Forwarder
> 21         election in order to avoid unnecessary DF status changes upon a
> 22         failure.  This document improves these procedures by providing a fast
> 23         Designated Forwarder election upon recovery of the failed link or
> 24         node associated with the multihomed Ethernet Segment.  This document
> 25         updates Section 2.1 of [RFC8584] by optionally introducing delays
> 26         between some of the events therein.
> 
> 28         The solution is independent of the number of EVPN Instances (EVIs)
> 29         associated with that Ethernet Segment and it is performed via a
> 30         simple signaling between the recovered node and each of the other
> 31         nodes in the multihoming group.
> 
> 33      Status of This Memo
> 
> 35         This Internet-Draft is submitted in full conformance with the
> 36         provisions of BCP 78 and BCP 79.
> 
> 38         Internet-Drafts are working documents of the Internet Engineering
> 39         Task Force (IETF).  Note that other groups may also distribute
> 40         working documents as Internet-Drafts.  The list of current Internet-
> 41         Drafts is at https://datatracker.ietf.org/drafts/current/.
> 
> 43         Internet-Drafts are draft documents valid for a maximum of six months
> 44         and may be updated, replaced, or obsoleted by other documents at any
> 45         time.  It is inappropriate to use Internet-Drafts as reference
> 46         material or to cite them other than as "work in progress."
> 
> 48         This Internet-Draft will expire on 9 January 2025.
> 
> 50      Copyright Notice
> 
> 52         Copyright (c) 2024 IETF Trust and the persons identified as the
> 53         document authors.  All rights reserved.
> 
> 55         This document is subject to BCP 78 and the IETF Trust's Legal
> 56         Provisions Relating to IETF Documents (https://trustee.ietf.org/
> 57         license-info) in effect on the date of publication of this document.
> 58         Please review these documents carefully, as they describe your rights
> 59         and restrictions with respect to this document.  Code Components
> 60         extracted from this document must include Revised BSD License text as
> 61         described in Section 4.e of the Trust Legal Provisions and are
> 62         provided without warranty as described in the Revised BSD License.
> 
> 64      Table of Contents
> 
> 66         1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
> 67           1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
> 68           1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
> 69           1.3.  Challenges with Existing Mechanism  . . . . . . . . . . .   3
> 70           1.4.  Design Principles for a Solution  . . . . . . . . . . . .   5
> 71         2.  DF Election Synchronization Solution  . . . . . . . . . . . .   5
> 72           2.1.  BGP Encoding  . . . . . . . . . . . . . . . . . . . . . .   6
> 73           2.2.  Updates to RFC8584  . . . . . . . . . . . . . . . . . . .   7
> 74         3.  Synchronization Scenarios . . . . . . . . . . . . . . . . . .   8
> 75           3.1.  Concurrent Recoveries . . . . . . . . . . . . . . . . . .  10
> 76         4.  Backwards Compatibility . . . . . . . . . . . . . . . . . . .  11
> 77         5.  Security Considerations . . . . . . . . . . . . . . . . . . .  11
> 78         6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  12
> 79         7.  Normative References  . . . . . . . . . . . . . . . . . . . .  12
> 80         Appendix A.  Contributors . . . . . . . . . . . . . . . . . . . .  13
> 81         Appendix B.  Acknowledgements . . . . . . . . . . . . . . . . . .  13
> 82         Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  14
> 
> 84      1.  Introduction
> 
> 86         The Ethernet Virtual Private Network (EVPN) solution [RFC7432] is
> 87         becoming pervasive in data center (DC) applications for Network
> 88         Virtualization Overlay (NVO) and DC interconnect (DCI) services, and
> 89         in service provider (SP) applications for next generation virtual
> 90         private LAN services.
> 
> nit: If there is any IoT use, please mention
> 
> nit: "pervasive" is a bold statement. I do not know enough to support or doubt
> it, but if there was any reference you could add to support the claim, then it
> would make it stronger. Else maybe tone it down ("widely used")...
> 
> 92         [RFC7432] describes Designated Frowarder (DF) election procedures for
>                                            ^ typo
> 
> 93         multihomed Ethernet Segments.  These procedures are enhanced further
> 94         in [RFC8584] by applying the Highest Random Weight (HRW) algorithm
> 
> nit:
> 
> please add the HRW1998 reference as used in RFC8584 as reference for the
> term HRW and include it here.
> 
> 95         for DF election in order to avoid unnecessary DF status changes upon
> 96         a link or node failure associated with the multihomed Ethernet
> 97         Segment.  This document makes further improvements to the DF election
> 
> nit: insert paragraph break before "This" (background -> contribution).
> 
> 98         procedures in [RFC8584] by providing an option for a fast DF election
> 99         upon recovery of the failed link or node associated with the
> 100        multihomed Ethernet Segment.  This DF election is achieved
> 101        independent of the number of EVPN Instances (EVIs) associated with
> 102        that Ethernet Segment and it is performed via straightforward
> 103        signaling between the recovered node and each of the other nodes in
> 104        the multihomed group.
> 105        This document updates the DF Election Finite State Machine (FSM)
> 106        described in Section 2.1 of [RFC8584], by optionally introducing
> 107        delays between some events, as further detailed in Section 2.2.  The
> 108        solution is based on a simple one-way signaling mechanism.
> 
> 110     1.1.  Requirements Language
> 
> 112        The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
> 113        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
> 114        "OPTIONAL" in this document are to be interpreted as described in BCP
> 115        14 [RFC2119] [RFC8174] when, and only when, they appear in all
> 116        capitals, as shown here.
> 
> 118     1.2.  Terminology
> 
> 120        PE:  Provider Edge device.
> 
> 122        Designated Forwarder (DF):  A PE that is currently forwarding
> 123           (encapsulating/decapsulating) traffic for a given VLAN in and out
> 124           of a site.
> 
> 126        EVI:  An EVPN instance spanning the Provider Edge (PE) devices
> 127           participating in that EVPN.
> 
> 129     1.3.  Challenges with Existing Mechanism
> 
> 131        In EVPN technology, multiple Provider Edge (PE) devices have the
> 132        ability to encapsulate and decapsulate data belonging to the same
> 133        VLAN.  Under certain conditions, this may cause Layer2 duplicates and
> 134        potential loops if there is a momentary overlap in forwarding roles
> 135        between two or more PE devices, consequently leading to broadcast
> 136        storms.
> 
> 138        EVPN [RFC7432] currently specifies timer-based synchronization among
> 139        PE devices within a redundancy group.  This approach can lead to
> 140        duplications and potential loops due to multiple Designated
> 141        Forwarders (DFs) if the timer interval is too short, or to packet
> 142        drops if the timer interval is too long.
> 
> 144        Split-horizon filtering, as described in Section 8.3 of [RFC7432],
> 145        can prevent loops but does not address duplicates.  However, if there
> 146        are overlapping Designated Forwarders (DFs) of two different sites
> 147        simultaneously for the same VLAN, the site identifier will differ
> 148        when the packet re-enters the Ethernet Segment.  Consequently, the
> 149        split-horizon check will fail, resulting in Layer 2 loops.
> 
> minor:
> 
>  i can not find a description of this setup and problem in [RFC7342],
> and the description in the paragraph above is quite terse so that i am not
> sure that i would make up from scratch a fitting example. I think it would
> thus be useful to provide an topology with an appropriate example of this
> condition and explain the problem based on that topology example.
> 
> 151        The updated Designated Forwarder (DF) procedures outlined in
> 152        [RFC8584] use the well-known Highest Random Weight (HRW) algorithm to
> 153        prevent the reshuffling of VLANs among PE devices within the
> 154        redundancy group during failure or recovery events.  This approach
> 155        minimizes the impact on VLANs not assigned to the failed or recovered
> 156        ports and eliminates the occurrence of loops or duplicates during
> 157        such events.
> 
> 159        However, upon PE insertion or a port being newly added to a
> 160        multihomed Ethernet Segment, HRW also cannot help as a transfer of DF
> 161        role to the new port must occur while the old DF is still active.
> 
> 163                                          +---------+
> 164                       +-------------+    |         |
> 165                       |             |    |         |
> 166                     / |    PE1      |----|         |   +-------------+
> 167                    /  |             |    |  MPLS/  |   |             |---CE3
> 168                   /   +-------------+    |  VxLAN/ |   |     PE3     |
> 169              CE1 -                       |  Cloud  |   |             |
> 170                   \   +-------------+    |         |---|             |
> 171                    \  |             |    |         |   +-------------+
> 172                     \ |     PE2     |----|         |
> 173                       |             |    |         |
> 174                       +-------------+    |         |
> 175                                          +---------+
> 
> 177                       Figure 1: CE1 multihomed to PE1 and PE2.
> 
> 179        In Figure 1, when PE2 is inserted in the Ethernet Segment or its
> 180        CE1-facing interface recovered, PE1 will transfer the DF role of some
> 181        VLANs to PE2 to achieve load balancing.  However, because there is no
> 182        handshake mechanism between PE1 and PE2, overlapping of DF roles for
> 183        a given VLAN is possible which leads to duplication of traffic as
> 184        well as Layer 2 loops.
> 
> 186        Current EVPN specifications [RFC7432] and [RFC8584] rely on a timer-
> 187        based approach for transferring the DF role to the newly inserted
> 188        device.  This can cause the following issues:
> 
> 190        *  Loops/Duplicates if the timer value is too short
> 191        *  Prolonged Traffic Blackholing if the timer value is too long
> 
> 193     1.4.  Design Principles for a Solution
> 
> 195        The clock-synchronization solution for fast DF recovery presented in
> 196        this document follows several design principles and presents
> 197        multiples advantages, namely:
> 
> 199        *  Complex handshake signaling mechanisms and state machines are
> 200           avoided in favor of a simple uni-directional signaling approach.
> 
> 202        *  The fast DF recovery solution maintains backwards-compatibility
> 203           (see Section 4) by ensuring that PEs any unrecognized new BGP
> 204           Extended Community.
> 
> 206        *  Existing DF Election algorithms remain supported.
> 
> 208        *  The fast DF recovery solution is independent of any BGP delays in
> 209           propagation of Ethernet Segment routes (Route Type 4)
> 
> minor:
> 
> This claim is unclear to me. There is an overall maximum for the propagation
> latency plus processing time of "just" a few seconds with the default SCT
> calculation, right ? And that is communicated "in conjunction with" the
> Ethernet Segment routes according to your below explanation. So there is
> a maximum propagation limit. And likely some serialization, timing
> dependencies.... ??!!
> 
> 211        *  The fast DF recovery solution is agnostic of the actual time
> 212           synchronization mechanism used, and normalizes to NTP for EVPN
> 213           signalling only.
> 
> XXX
> 
> 215     2.  DF Election Synchronization Solution
> 
> 217        The fast DF recovery solution relies on the concept of common clock
> 218        alignment between partner PEs participating in a common Ethernet
> 219        Segment i.e. PE1 and PE2 in Figure 1.  The main idea is to have all
> 220        peering PEs of that Ethernet Segment perform DF election, and apply
> 221        the result at the same pre-announced time.
> 
> 223        The DF Election procedure, as described in [RFC7432] and as
> 224        optionally signalled in [RFC8584], is applied.  All PEs attached to a
> 225        given Ethernet Segment are clock-synchronized using a networking
> 226        protocol for clock synchronization (e.g., NTP, PTP).  When a new PE
> 227        is inserted in an Ethernet Segment or a failed PE device of the
> 228        Ethernet Segment recovers, that PE communicates to peering partners
> 229        the current time plus the value of the timer for partner discovery
> 230        from step 2 in Section 8.5 of [RFC7432].  This constitutes an "end
> 231        time" or "absolute time" as seen from the local PE.  That absolute
> 232        time is called the "Service Carving Time" (SCT).
> 
> 234        A new BGP Extended Community, the Service Carving Timestamp is
> 235        advertised along with the Ethernet Segment route (RT-4) to
> 236        communicate the Service Carving Time to other partners.
> 
> 238        Upon receipt of the new BGP Extended Community, partner PEs can
> 239        determine the service carving time of the newly insterted PE.  To
> 240        eliminate any potential for duplicate traffic or loops, the concept
> 241        of skew is introduced: a small time offset to ensure a controlled and
> 242        orderly transition when multiple Provider Edge (PE) devices are
> 243        involved.  The receiving partner PEs add a skew (default = -10ms) to
> 244        the Service Carving Time to enforce this mechanism.  The previously
> 245        inserted PE(s) must perform service carving first, followed shortly
> 246        by the newly insterted PE, after the specified skew delay.
> 
> 248        To summarize, all peering PEs perform service carving almost
> 249        simultaneously at the time announced by the newly added/recovered PE.
> 250        The newly inserted PE initiates the SCT, and triggers service carving
> 251        immediately on its local timer expiry.  The previously inserted PE(s)
> 252        receiving Ethernet Segment route (RT-4) with a SCT BGP extended
> 253        community, perform service carving shortly before Service Carving
> 254        Time.
> 
> 256     2.1.  BGP Encoding
> 
> 258        A new BGP extended community is defined to communicate the Service
> 259        Carving Timestamp for each Ethernet Segment.
> 
> 261        A new transitive extended community where the Type field is 0x06, and
> 262        the Sub-Type is 0x0F is advertised along with the Ethernet Segment
> 263        route.  The expected Service Carving Time is encoded as an 8-octet
> 264        value as follows:
> 
> 266                             1                   2                   3
> 267         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> 268        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 269        | Type = 0x06   | Sub-Type(0x0F)|      Timestamp Seconds        ~
> 270        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 271        ~  Timestamp Seconds            | Timestamp Fractional Seconds  |
> 272        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 
> 274                            Figure 2: Service Carving Time
> 
> 276        The timestamp exchanged uses the NTP prime epoch of January 1, 1900
> 277        [RFC5905] and the 64-bit NTP Timestamp Format.  The NTP Era value is
> 278        not exchanged and Era 0 is assumed as of the writing of this
> 279        document.  A DF Election operation occurring exactly at the Era
> 280        transition boundary some time in 2036 is outside of the scope of this
> 281        document.
> 
> mayor:
> 
> This description effectively only supports the protocol until the end of Era 0,
> because it not only describes what to do during switchover to Era N+1, but
> it does not describe how to operate without encoding the Era. This makes
> the protocol useful (without another RFC) for less than 12 years. That is IMHO
> insufficient.
> 
> One simple solution, would be to describe that the Era is not included in the
> encoding, but that a plausibility check is made on received timestamps. If it
> is completely out of range with the receiving routers current Era, but within
> rage with Era-1 or Era+1, then the timestamp is accordingly adjusted to use that
> Era.
> 
> In another solution option, you can encode the Era by carving space from the SCT
> encoding as follows:
> 
> IMHO, it is unnecessary to encode the fractional seconds with 16 bits.
> The accuracy of the signalled timestamp does NOT impact the synchronized
> accuracy of the execution of DF switchover. It only impacts the granularity of
> timestamps that can be generated. If you would signal only the top 8 bits of
> the fractional seconds, then you could still trigger a synchronized switchover
> at intervals of 4 msec, which IMHO is more than necessary. And the switchover
> could still be synchronized to an arbitrary better accuracy, such as 1 usec if
> just the clock synchronization between the router is that good. Practically
> speaking, NTP clock synchronization may often be just 1 msec accurate anyhow.
> 
> Even if you consider my thoughts from above concern G.4, and want to assign
> different timestamps for every Ethernet Segment (especially with large number
> of ethernet segments), then an interval of 4 msec would likely be more than
> sufficient granularity.
> 
> So with just 8 bit fractional second encoding, you have 8 bit spare in the
> encoding you can use for Era and other features (in the future).
> 
> 282        The 64-bit NTP Timestamp Format consists of a 32-bit part for Seconds
> 283        and a 32-bit part for Fraction, which are encoded in the Service
> 284        Carving Time as follows:
> 
> 286        *  Timestamp Seconds: 32-bit NTP seconds are encoded in this field.
> 
> 288        *  Timestamp Fractional Seconds: the high order 16 bits of the NTP
> 289           'Fraction' field are encoded in this field.
> 
> 291        When rebuilding a 64-bit NTP Timestamp Format using the values from a
> 292        received SCT BGP extended community, the lower order 16 bits of the
> 293        Fractional field are set to 0.  The use of a 16-bit fractional
> 294        seconds yields adequate precision of 15 microseconds (2^-16 s).
> 
> 296        This document introduces a new flag called "T" (for Time
> 297        Synchronization) to the bitmap field of the DF Election Extended
> 298        Community defined in [RFC8584].
> 
> 300                             1                   2                   3
> 301         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> 302        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 303        | Type = 0x06   | Sub-Type(0x06)| RSV |  DF Alg | |A| |T|       ~
> 304        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 305        ~     Bitmap    |            Reserved = 0                       |
> 306        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 
> 308                       Figure 3: DF Election Extended Community
> 
> 310        *  Bit 3: Time Synchronization (corresponds to Bit 27 of the DF
> 311           Election Extended Community).  When set to 1, it indicates the
> 312           desire to use Time Synchronization capability with the rest of the
> 313           PEs in the Ethernet Segment.
> 
> nit:
> 
> "Bit 3" is a confusing definition because the "DF Election Extended Community"
> field is only mentioned in the prior paragraph and not shown with this name
> in the picture.
> 
> I would suggest to replace picture 3 with Figure 4 from rfc8584 - which does
> show "Bitmap", and then follow it with Figure 5 from rfc8584 with "T" added,
> and then follow with the "Bit 3" bullet point.
> 
> 315        This capability is utilized in conjunction with the agreed-upon DF
> 316        Election Type.  For instance, if all the PE devices in the Ethernet
> 317        Segment indicate possessing Time Synchronization capability and
>                             ^^^^^^^^^^
> 
> nit:
> 
> "the desire to use the" (to be consistent with the definition of T in line 312.
> 
> 318        request the DF Election Type to be Highest Random Weight (HRW), then
> 319        the HRW algorithm is edused in conjunction with this capability.  A
>                                 ^^^^^^
> 
> nit: deduced ?
> 
> 320        PE which does not support the procedures set out in this document, or
> 321        receives a route from another PE in which th capability is not set
>                                                        ^
> nit: "e"
> 
> 322        MUST NOT delay Designated Forwarder election as this could lead to
> 323        duplicate traffic in some instances (overlapping Designated
> 324        Forwarders).
> 
> 326     2.2.  Updates to RFC8584
> 
> 328        This document introduces an additional delay to the events and
> 329        transitions defined for the default DF election algorithm FSM in
> 330        Section 2.1 of [RFC8584] without changing the FSM state or event
> 331        definitions themselves.
> 
> 333        Upon receiving a RECV_ES message, the peering PE's Finite State
> 
> nit:
> 
> RFC8584 uses the term "RCVD_ES" for an event, and does not use the term
> "RECV_ES" for a message. Unless there is good reason to introduce new
> (inconsistent/duplicate) terminology, pls. change to terminology RCVD_ES event.
> Also further below (line 350).
> 
> 334        Machine (FSM) transitions from the DF_DONE (indicating the DF
> 335        election process was complete) state to the DF_CALC (indicating that
> 336        a new DF calculation is needed) state . Due to the Service Carving
> 337        Time (SCT) included in the Ethernet-Segment update, the completion of
> 338        the DF_CALC state and the subsequent transition back to the DF_DONE
> 339        state are delayed.  This delay ensures proper synchronization and
> 340        prevents conflicts.  Consequently, the accompanying forwarding
> 341        updates to the Designated Forwarder (DF) and Non-Designated Forwarder
> 342        (NDF) states are also deferred.
> 
> 344        The corresponding actions when transitions are performed or states
> 345        are entered/exited are modified as follows:
> 
> nit:
> 
> Suggest to rewrite to the following, to be more precise:
> 
> Item 9. in RFC8584, Section 2.1, List "Corresponding actions when transitions
> are performed or states are entered/exited" is changed as follows:
> 
> 347        9.  DF_CALC on CALCULATED: Mark the election result for the VLAN or
> 348            VLAN Bundle.
> 
> 350            9.1  If an SCT timestamp is present during the RECV_ES event of
> 351                 Action 11, wait until the time indicated by the SCT before
> 352                 proceeding to step 9.2.
> 
> 354            9.2  Assume the role of DF or NDF for the local PE concerning the
> 355                 VLAN or VLAN Bundle, and transition to the DF_DONE state.
> 
> 357        This revised approach ensures proper timing and synchronization in
> 358        the DF election process, avoiding conflicts and ensuring accurate
> 359        forwarding updates
> 
> minor:
> 
> a) Given how this is the normative text, i am worried that the "skew" variable
> is not mentioned. Please insert accordingly.
> 
> b) 9.1 does not seem to cover the SCT delay that needs to be performed (equally,
> except for skew) by the newly inserted PE. 9.1 only mentions the condition of
> RECV_ES, which to me does not sounds like the newly inserted PE.
> 
> minor:
> 
> I am somewhat irritated that neither RFC8584 nor this draft have any text in the
> state machiner section to indicate when/how ES routes are generated. This would
> help IMHO especially in this new draft, because it is the time when the
> timestamp is taken, SCT calculated and inserted into the ES route, and i guess
> that that also starts the process leading to CALCULATED event on the newly
> inserted router.
> 
> 361     3.  Synchronization Scenarios
> 
> 363        Consider Figure 1 as an example, where initially PE2 has failed and
> 364        PE1 has taken over.  This scenario illustrates the problem with the
> 365        DF-Election mechanism described in Section 8.5 of [RFC7432],
> 366        specifically in the context of the timer value configured for all PEs
> 367        on the Ethernet Segment.
> 
> 369        Procedure based on Section 8.5 of [RFC7432] with the default 3 second
> 370        timer in step 2:
> 
> 372        1.  Initial state: PE1 is in a steady-state and PE2 is recovering
> 
> 374        2.  Recovery: PE2 recovers at an absolute time of t=99.
> 
> 376        3.  Advertisement: PE2 advertises RT-4, sent at t=100, to partner
> 377            PE1.
> 
> 379        4.  Timer Start: PE2 starts a 3 second timer to allow the reception
> 380            of RT-4 from other PE nodes.
> 
> 382        5.  Immediate carving: PE1 performs service carving immediately upon
> 383            RT-4 reception, i.e.  t=100 plus some BGP propagation delay.
> 
> 385        6.  Delayed Carving: PE2 performs service carving at time t=103
> 
> 387        [RFC7432] favors traffic drops over duplicate traffic.  With the
> 388        above procedure, traffic drops will occur as part of each PE recovery
> 389        sequence since PE1 transitions some VLANs to Non-Designated Forwarder
> 390        (NDF) immediately upon RT-4 reception.
> 391        The timer value (default = 3 seconds) directly affects the duration
> 392        of the packet drops.  A shorter (or zero) timer may result in
> 393        duplicate traffic or traffic loops.
> 
> 395        Procedure based on the Service Carving Time (SCT) approach:
> 
> 397        1.  Initial state: PE1 is in a steady state, and PE2 is recovering
> 
> 399        2.  Recovery: PE2 recovers at an absolute time of t=99.
> 
> 401        3.  Advertisement: PE2 advertises RT-4, sent at t=100, with a target
> 402            SCT value of t=103 to partner PE1.
> 
> 404        4.  Timer Start: PE2 starts a 3 second timer to allow the reception
> 405            of RT-4 from other PE nodes.
> 
> minor:
> 
> IMHO, this is not a 3 second timer, but a timer with a deadline of t=103. Which
> is only at most 3 seconds, depending on whether step 4. happens exactly at
> t=100 or somewhat later. Practically, it would always be later. IMHO, it  would
> be good to emphasize on this crucial benefit of the new mechanism. Maybe need
> to insert some addtl. processing delay into the section 8.5 example vs. this
> example to show this difference (delay between steps 3 and 4).
> 
> 407        5.  Service Carving Timer: PE1 starts the service carving timer, with
> 408            the remaining time until t=103
> 
> 410        6.  Simultaneous Carving: Both PE1 and PE2 carve at an absolute time
> 411            of t=103
> 
> 413        To maintain the preference for minimal loss over duplicate traffic,
> 414        PE1 should carve slightly before PE2 (with skew).  The recovering PE2
> 415        performs both DF to NDF and NDF to DF transitions per VLAN at the
> 416        timer's expiry.  The original PE1, which received the SCT, applies
> 417        the following:
> 
> 419        *  DF to NDF Transition(s): at t=SCT minus skew, where both PEs are
> 420           NDF for the skew duration.
> 
> 422        *  NDF to DF Transition(s): at t=SCT
> 
> minor:
> 
> In line 238, the draft says "Upon receipt of the new BGP Extended Community" ...
> skew is being applied. Above text (line 419) instead defines application of
> skew upon determination of the state transitiom. It may be that in all cases
> where the BGP Extended Community is received, there is always only at most a DF
> to NDF transition (but no NDF to DF transition), staying at NDF), but it still
> is not ideal to have two inconsistent definitions when skew is being applied.
> 
> Technically i think the DF to NDF transition case is more sound than the
> "receipt of the BGP extended community", aka: fix text around line 238 ?!
> 
> 424        This split-behavior ensures a smooth DF role transition with minimal
> 425        loss.
> 
> 427        Using the SCT approach, the negative effect of the timer to allow the
> 428        reception of RT-4 from other PE nodes is mitigated.  Furthermore, the
> 429        BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to
> 430        PE1) becomes a non-issue.  The SCT approach shortens the 3-second
> 431        timer window to the order of milliseconds.
> 
> 433     3.1.  Concurrent Recoveries
> 
> 435        In the eventuality 2 or more PEs in a peering Ethernet Segment group
> 436        are recovering concurrently or roughly the same time, each will
> 437        advertise a Service Carving Timestamp.  This SCT value would
> 438        correspond to what each recovering PE considers the "end time" for DF
> 439        Election.  A similar situation arises in sequentially recovering PEs,
> 440        when a second PE recovers approximately at the time of the first PE's
> 441        advertised SCT expiry, and with its own new SCT-2 outside of the
> 442        initial SCT window.
> 
> 444        In the case of multiple concurrent DF elections, each initiated by
> 445        one of the recovering PEs, the SCTs must be ordered chronologically.
> 446        All PEs shall execute only a single DF Election at the service
> 447        carving time corresponding to the largest (latest) received timestamp
> 448        value.  This DF Election will involve all active PEs in a unified DF
> 449        Election update.
> 
> nit:
> 
> I think the wording 444-449 is misleading/incomplete. The latest SCT timestamp
> is not the top critera, but if i understand the intent correctly, each "later"
> PEi also needs to be considered to be a better(best) DF than the prior PE,
> right ? Aka: In your below example (line 451ff),
> 
>     PE1 is DF
>     When PE1 receives RT-4 from PE2, PE1 will redo DF calculation and
>     consider PE2 to be the DF winner
>     When PE2 later receives RT-4 from PE3, PE1 will redo DF calculation
>     and now consider PE3 to be the DF winner. And only because PE3 is the
>     DF winner, will PE1 now also cancel the SCT for PE2.
> 
> If on the other hand, the DF HRW for PE3 would be lower than that of PE2,
> than PE1 would of course redo the DF election but given how PE3 does not
> show the result, this AFAIK should also mean that the SCT from PE3 should have
> no impact.
> 
> Yes/No ?
> 
> In any case it would be useful to improve the description to make this clearer.
> Especially if/when i misunderstood it.
> 
> 451        Example:
> 
> 453        1.  Initial State: PE1 is in a steady state, with services elected at
> 454            PE1.
> 
> 456        2.  Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4
> 457            with a target SCT value of t=103 to its partners (PE1)
> 
> 459        3.  Timer Initiation by PE2: PE2 starts a 3 second timer to allow the
> 460            reception of RT-4 from other PE nodes.
> 
> 462        4.  Timer Initiation by PE1: PE1 starts the service carving timer,
> 463            with the remaining time until t=103.
> 
> 465        5.  Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4
> 466            with a target SCT value of t=105 to its partners (PE1, PE2).
> 
> 468        6.  Timer Initiation by PE3: PE3 starts a 3 second timer to allow the
> 469            reception of RT-4 from other PE nodes
> 
> 471        7.  Timer Update by PE2: PE2 cancels the running timer and starts the
> 472            service carving timer with the remaining time until t=105.
> 
> 474        8.  Timer Update by PE1: PE1 updates its service carving timer, with
> 475            the remaining time until t=105.
> 
> 477        9.  Service Carving: PE1, PE2, and PE3 perform service carving at the
> 478            absolute time of t=105.
> 
> 480        In the eventuality a PE in a Ethernet Segment group recovers during
> 481        the discovery window specified in Section 8.5 of [RFC7432], and does
> 482        not support or advertise the T-bit, then all PEs in the current
> 483        peering sequence SHALL immediately revert to the default [RFC7432]
> 484        behavior.
> 
> 486     4.  Backwards Compatibility
> 
> 488        For the DF election procedures to achieve global convergence and
> 489        unanimity within a redundancy group, it is essential that all
> 490        participating PEs agree on the DF election algorithm to be employed.
> 491        However, it is possible that some PEs may continue to use the
> 492        existing modulo-based DF election algorithm from [RFC7432] and not
> 493        utilize the new Service Carving Time (SCT) BGP extended community.
> 494        PEs that operate using the baseline DF election mechanism will simply
> 495        discard the new SCT BGP extended community as unrecognized.
> 496        [RFC7432] and do not rely on the new SCT BGP extended community.
> 
> 498        A PE can indicate its willingness to support clock-synchronized
> 499        carving by signaling the new 'T' DF Election Capability and including
> 500        the new SCT BGP extended community along with the Ethernet Segment
> 501        Route (Type-4).  If one or more PEs attached to the Ethernet Segment
> 502        do not signal T=1, then all PEs in the Ethernet Segment SHALL revert
> 503        to the timer-based approach as specified in [RFC7432].  This
> 504        reversion is particularly crucial in preventing VLAN shuffling when
> 505        more than two PEs are involved.
> 
> 507     5.  Security Considerations
> 
> 509        The mechanisms in this document use EVPN control plane as defined in
> 510        [RFC7432].  Security considerations described in [RFC7432] are
> 511        equally applicable.
> 
> 513        For the new SCT Extended Community, attack vectors may be setting the
> 514        value to zero, to a value in the past or to large times in the
> 515        future.  The procedures in this document address implicitly what
> 516        occurs with a carving time in the past, as this would be a naturally
> 517        occurring event with a large BGP propagation delay: the receiving PE
> 518        SHALL treat the DF Election at the peer as having occurred already,
> 519        and proceed without starting any timer to futher delay service
> 520        carving.  For timestamp values in the future, a rogue PE may be
> 521        advertising a value inconsistent with its local behavior.  This is no
> 522        different than a rogue PE setting all its DF Election results
> 523        inconstently to its peers using (or ignoring adherence to) the
> 524        procedures from [RFC7432], and the result would similarly be
> 525        duplicate or dropped traffic.  It is left to implementations to
> 526        decide what consists an "unreasonably large" SCT value.
> 
> 528        This document uses MPLS and IP-based tunnel technologies to support
> 529        data plane transport.  Security considerations described in [RFC7432]
> 530        and in [RFC8365] are equally applicable.
> 
> 532     6.  IANA Considerations
> 
> 534        IANA maintains the "EVPN Extended Community Sub-Types" registry set
> 535        up by [RFC7153].  IANA is requested to confirm the First Come First
> 536        Served assignment as follows:
> 
> 538           Sub-Type Value   Name                        Reference       Date
> 539           --------------   -------------------------   -------------   ----
> 540                 0x0F       Service Carving Timestamp   This document   TBD
> 
> 542        IANA should replace the field TBD with the date of publicaton of this
> 543        document as an RFC.
> 
> 545        IANA maintains the "DF Election Capabilities" registry set up by
> 546        [RFC8584].  IANA is requested to make the following assignment from
> 547        this registry:
> 
> 549            Bit         Name                         Reference        Date
> 550            ----        ----------------             -------------    ----
> 551            3           Time Synchronization         This document    TBD
> 
> 553        IANA should replace the field TBD with the date of publicaton of this
> 554        document as an RFC.
> 
> 556     7.  Normative References
> 
> 558        [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
> 559                   Requirement Levels", BCP 14, RFC 2119,
> 560                   DOI 10.17487/RFC2119, March 1997,
> 561                   <https://www.rfc-editor.org/info/rfc2119>.
> 
> 563        [RFC5905]  Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,
> 564                   "Network Time Protocol Version 4: Protocol and Algorithms
> 565                   Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,
> 566                   <https://www.rfc-editor.org/info/rfc5905>.
> 
> 568        [RFC7153]  Rosen, E. and Y. Rekhter, "IANA Registries for BGP
> 569                   Extended Communities", RFC 7153, DOI 10.17487/RFC7153,
> 570                   March 2014, <https://www.rfc-editor.org/info/rfc7153>.
> 
> 572        [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
> 573                   Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
> 574                   Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
> 575                   2015, <https://www.rfc-editor.org/info/rfc7432>.
> 
> 577        [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
> 578                   2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
> 579                   May 2017, <https://www.rfc-editor.org/info/rfc8174>.
> 
> 581        [RFC8365]  Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R.,
> 582                   Uttaro, J., and W. Henderickx, "A Network Virtualization
> 583                   Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365,
> 584                   DOI 10.17487/RFC8365, March 2018,
> 585                   <https://www.rfc-editor.org/info/rfc8365>.
> 
> 587        [RFC8584]  Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake,
> 588                   J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet
> 589                   VPN Designated Forwarder Election Extensibility",
> 590                   RFC 8584, DOI 10.17487/RFC8584, April 2019,
> 591                   <https://www.rfc-editor.org/info/rfc8584>.
> 
> 593     Appendix A.  Contributors
> 
> 595        In addition to the authors listed on the front page, the following
> 596        co-authors have also contributed substantially to this document:
> 
> 598        Gaurav Badoni
> 599        Cisco
> 
> 601        Email: gbadoni@cisco.com
> 
> 603        Dhananjaya Rao
> 604        Cisco
> 
> 606        Email: dhrao@cisco.com
> 
> 608     Appendix B.  Acknowledgements
> 
> 610        Authors would like to acknowledge helpful comments and contributions
> 611        of Satya Mohanty and Bharath Vasudevan.  Also thank you to Anoop
> 612        Ghanwani and Gunter van de Velde for their thorough review with
> 613        valuable comments and corrections.
> 
> 615     Authors' Addresses
> 
> 617        Patrice Brissette (editor)
> 618        Cisco
> 619        Email: pbrisset@cisco.com
> 
> 621        Ali Sajassi
> 622        Cisco
> 623        Email: sajassi@cisco.com
> 
> 625        Luc Andre Burdet
> 626        Cisco
> 627        Email: lburdet@cisco.com
> 
> 629        John Drake
> 630        Independent
> 631        Email: je_drake@yahoo.com
> 
> 633        Jorge Rabadan
> 634        Nokia
> 635        Email: jorge.rabadan@nokia.com
> 
> EOF
> 
> 
> 
> _______________________________________________
> BESS mailing list -- bess@ietf.org
> To unsubscribe send an email to bess-leave@ietf.org

-- 
---
tte@cs.fau.de