<?xml version="1.0" encoding="US-ASCII"?>

<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>

<!-- used by XSLT processors -->
<?xml-stylesheet type='text/xsl' href='http://xml.resource.org/authoring/rfc2629.xslt'?>
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->

<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std"
     xmlns:xi="http://www.w3.org/2001/XInclude"
     docName="draft-ietf-bess-evpn-fast-df-recovery-10"
     updates="8584"
     consensus="true"
     submissionType="IETF"
     ipr="trust200902">

 <!-- ***** FRONT MATTER ***** -->

 <front>
   <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->
   <title abbrev="Fast Recovery for EVPN DF-Election">Fast Recovery for EVPN Designated Forwarder Election</title>

   <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->
  <author fullname="Patrice Brissette" initials="P." surname="Brissette" role="editor">
     <organization>Cisco</organization>
     <address>
       <email>pbrisset@cisco.com</email>
     </address>
   </author>

   <author fullname="Ali Sajassi" initials="A." surname="Sajassi">
     <organization>Cisco</organization>
     <address>
       <email>sajassi@cisco.com</email>
     </address>
   </author>

  <author fullname="Luc Andre Burdet" initials="LA." surname="Burdet">
     <organization>Cisco</organization>
     <address>
       <email>lburdet@cisco.com</email>
     </address>
   </author>

  <author fullname="John Drake" initials="J." surname="Drake">
     <organization>Independent</organization>
     <address>
       <email>je_drake@yahoo.com</email>
     </address>
   </author>

  <author fullname="Jorge Rabadan" initials="J." surname="Rabadan">
     <organization>Nokia</organization>
     <address>
       <email>jorge.rabadan@nokia.com</email>
     </address>
   </author>

   <date year="2024" />

   <!-- Meta-data Declarations -->
   <area>General</area>
   <workgroup>BESS Working Group</workgroup>

   <!-- WG name at the upperleft corner of the doc,
        IETF is fine for individual submissions. 
        If this element is not present, the default is "Network Working Group",
        which is used by the RFC Editor as a nod to the history of the IETF. -->

   <keyword>EVPN</keyword>
   <keyword>Designated Forwarder</keyword>
   <keyword>Convergence</keyword>
   <keyword>Recovery</keyword>

   <abstract>
     <t>The Ethernet Virtual Private Network (EVPN) solution in RFC 7432 provides
     Designated Forwarder (DF) election procedures for multihomed Ethernet Segments. These
     procedures have been enhanced further by applying the Highest
     Random Weight (HRW) algorithm for Designated Forwarder election
     to avoid unnecessary DF status changes upon a failure.
     This document improves these procedures by providing a fast Designated Forwarder 
     election upon recovery of the failed link or node associated
     with the multihomed Ethernet Segment.
     This document updates RFC 8584 by optionally introducing delays between
     some of the events therein.</t>
     <t>The solution is independent of the number of EVPN Instances (EVIs) associated with that Ethernet
     Segment and it is performed via a simple signaling in BGP between the
     recovered node and each of the other nodes in the multihoming group.</t>
   </abstract>
  
 </front>

 <middle>
   <section anchor="intro" title="Introduction">
     <t>The Ethernet Virtual Private Network (EVPN) solution <xref target="RFC7432"/> is
     widely used in data center (DC) applications for Network
     Virtualization Overlay (NVO) and DC interconnect (DCI) services, and
     in service provider (SP) applications for next generation virtual
     private LAN services.</t>
     
     <t><xref target="RFC7432"/> describes Designated Forwarder (DF) election procedures for
     multihomed Ethernet Segments. These procedures are enhanced further in
     <xref target="RFC8584"/> by applying the Highest Random Weight algorithm for DF
     election in order to avoid unnecessary DF status changes upon a link
     or node failure associated with the multihomed Ethernet Segment.</t>
     <t>This document makes further improvements to the DF election procedures in
     <xref target="RFC8584"/> by providing an option for a fast DF election upon
     recovery of the failed link or node associated with the multihomed
     Ethernet Segment. This DF election is achieved independent of the number
     of EVPN Instances (EVIs) associated with that Ethernet Segment and it is performed via
     straightforward signaling in BGP between the recovered node and each of the other nodes
     in the multihomed Ethernet Segment redundancy group.<br/>
     This document updates the DF Election Finite State Machine (FSM) described in <relref target="RFC8584" section="2.1"/>,
     by optionally introducing delays between some events, as further detailed in <xref target="fsm_8584"/>.
     The solution is based on a simple one-way signaling mechanism.</t>

   <section title="Requirements Language">
    <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they appear in all
   capitals, as shown here.</t>
    </section>
     
     <section anchor="terminology" title="Terminology">
        <t>
         <dl>
           <dt>PE:</dt><dd>Provider Edge device.</dd>
           <dt>Designated Forwarder (DF):</dt><dd>A PE that is currently forwarding
           (encapsulating/decapsulating) traffic for a given VLAN in and out of
           a site.</dd>
           <dt>NDF:</dt><dd>Non-Designated Forwarder, a PE that is currently blocking traffic (see
           DF above).</dd>
           <dt>EVI:</dt><dd>An EVPN instance spanning the Provider Edge (PE) devices
      participating in that EVPN.</dd>
           <dt>HRW:</dt><dd>Highest Random Weight algorithm, <xref target="HRW98"/> </dd>
           <dt>Service carving:</dt><dd>DF Election is also referred to as "service carving" in <xref
           target="RFC7432"/></dd>
           <dt>SCT:</dt><dd>Service Carving Timestamp, defined in this document, the time at
           which all nodes participating in an Ethernet Segment perform DF Election.</dd>
         </dl>
	</t>
     </section>

   <section anchor="challenges" title="Challenges with Existing Mechanism">
        <t>In EVPN technology, multiple Provider Edge (PE) devices have the ability to encapsulate
        and decapsulate data belonging to the same VLAN. Under certain conditions, this
        may cause duplicated Ethernet packets and potential loops if there is a momentary
        overlap in forwarding roles between two or more PE devices, potentially also leading
        to broadcast storms of frames forwarded back into the VLAN.</t>

        <t>EVPN <xref target="RFC7432"/> currently specifies timer-based synchronization among PE
        devices within a Ethernet Segment redundancy group. This approach can lead to duplications and potential
        loops due to multiple Designated Forwarders (DFs) if the timer interval is too short,
        or to packet drops if the timer interval is too long.</t>

        <t>Split-horizon filtering, as described in <relref target="RFC7432" section="8.3"/>,
        can prevent loops but does not address duplicates.
        However, if there are overlapping Designated Forwarders of two
        different sites simultaneously for the same VLAN, the site identifier will differ when the
        packet re-enters the Ethernet Segment. Consequently, the split-horizon check will fail,
        resulting in layer-2 loops.</t>

        <t>The updated DF procedures outlined in <xref target="RFC8584"/>
        use the well-known
        Highest Random Weight&nbsp;(HRW) algorithm to prevent the reshuffling of VLANs among
        PE devices within the Ethernet Segment redundancy group during failure or recovery events. This
        approach minimizes the impact on VLANs not assigned to the failed or recovered ports
        and eliminates the occurrence of loops or duplicates during such events.</t>

        <t>However, upon PE insertion or a port being newly added to a multihomed Ethernet Segment,
        HRW cannot help either as a transfer of DF role to the new port must occur
        while the old DF is still active.</t>

        <figure anchor="topology" title="CE1 multihomed to PE1 and PE2.">
         <artwork><![CDATA[
                                  +---------+
               +-------------+    |         |
               |             |    |         |
             / |    PE1      |----|         |   +-------------+
            /  |             |    |  MPLS/  |   |             |---CE3
           /   +-------------+    |  VxLAN/ |   |     PE3     |
      CE1 -                       |  Cloud  |   |             |
           \   +-------------+    |         |---|             |
            \  |             |    |         |   +-------------+
             \ |     PE2     |----|         |
               |             |    |         |
               +-------------+    |         |
                                  +---------+
    ]]>
	</artwork></figure>

        <t>In <xref target="topology"/>, when PE2 is inserted in the Ethernet Segment or its
        CE1-facing interface recovered, PE1 will transfer
        the DF role of some VLANs to PE2 to achieve load balancing. However,
        because there is no handshake mechanism between PE1 and PE2,
        overlapping of DF roles for a given VLAN is possible which leads to duplication of
        traffic as well as layer-2 loops.</t>

        <t>Current EVPN specifications <xref target="RFC7432"/> and <xref target="RFC8584"/>
        rely on a timer-based approach for transferring the DF role to the newly inserted device.
        This can cause the following issues:

        <ul>
            <li>Loops/Duplicates if the timer value is too short</li>
            <li>Prolonged Traffic Blackholing if the timer value is too long</li>
        </ul>
        </t>
   </section>

         
      <section anchor="advantages" title="Design Principles for a Solution">

        <t>The clock-synchronization solution for fast DF recovery presented in this document
        follows several design principles and offers
        multiple advantages, namely:
        <ul>
         <li>Complex handshake signaling mechanisms and state machines are
            avoided in favor of a simple uni-directional signaling approach.</li>
          <li>The fast DF recovery solution maintains backwards-compatibility (see <xref
          target="ntpcompat"/>) by ensuring that PEs reject any unrecognized new BGP EVPN Extended Community.</li>
          <li>Existing DF Election algorithms remain supported.</li>
          <li>The fast DF recovery solution is independent of any BGP delays in propagation of Ethernet Segment
          routes (Route Type 4)</li>
          <li>The fast DF recovery solution is agnostic of the actual time synchronization mechanism
          used, however, an NTP-based representation of time is used for EVPN signaling.</li>
        </ul>
        </t>

      </section>

   </section>


   <section anchor="sync" title="DF Election Synchronization Solution">

      <t>The fast DF recovery solution relies on the concept of common clock alignment between partner PEs participating
      in a common Ethernet Segment, i.e., PE1 and PE2 in <xref target="topology"/>. The main idea is to have all peering PEs of that
      Ethernet Segment perform DF election, and apply the result at the same pre-announced time. </t>
      
      <t>The DF Election procedure, as described in <xref target="RFC7432"/> and as optionally
      signaled in <xref target="RFC8584"/>, is applied.
      All PEs attached to a given Ethernet Segment are clock-synchronized
      using a networking protocol for clock synchronization (e.g., NTP, PTP).
      Whenever possible, recovery activities for failed PEs SHOULD NOT be initiated until after the clock
      synchronization operations have converged to benefit from this document's fast DF recovery
      procedures.
      When a new PE is inserted in an Ethernet Segment or a failed PE of the Ethernet
      Segment recovers, that PE communicates to peering partners the current time plus the value of
      the timer for partner discovery from step 2 in <relref target="RFC7432" section="8.5"/>.
      This constitutes an "end time" or "absolute time" as seen from the local PE.
      That absolute time is called the "Service Carving Time" (SCT).</t>

      <t>A new BGP EVPN Extended Community, the Service Carving Timestamp is advertised along with
      the Ethernet Segment Route Type 4 (RT-4) and communicates the Service Carving Time to other
      partners to ensure an orderly transfer of forwarding duties.</t>

      <t>Upon receipt of the new BGP EVPN Extended Community, partner PEs can determine the service carving time
      of the newly inserted PE. To eliminate any potential for duplicate traffic or loops, the
      concept of skew is introduced: a small time offset to ensure a controlled and orderly
      transition when multiple Provider Edge (PE) devices are involved. The receiving partner PEs
      subtract a skew
      (default = 10ms) to the Service Carving Time to enforce this mechanism. 
      The previously inserted PE(s) must perform service carving first for NDF to DF transitions, followed shortly by the NDF
      to DF transitions on both PEs, after the specified skew delay. On the recovering PE, all services are already in NDF state and no
      skew for DF to NDF transitions is required.</t>

      
      <t>To summarize, all peering PEs perform service carving almost simultaneously at the time
      announced by the newly added/recovered PE. The newly inserted PE initiates the SCT,
      and triggers service carving immediately on its local timer expiry. The previously inserted PE(s) receiving Ethernet Segment route (RT-4) with an SCT BGP extended community,
      perform service carving shortly before Service Carving Time for DF to NDF transitions, and at
      Service Carving Time for NDF to DF transitions.</t>

      <section anchor="ntpencoding" title="BGP Encoding">
        <t>A new BGP extended community is defined to communicate the
        Service Carving Timestamp for each Ethernet Segment.</t>

        <t>A new transitive extended community where the Type field is 0x06, and
        the Sub-Type is 0x0F is advertised along with the Ethernet
        Segment route. The expected Service Carving Time is encoded as an
        8-octet value as follows:
	
        <figure title="Service Carving Time"><artwork><![CDATA[
                     1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type = 0x06   | Sub-Type(0x0F)|      Timestamp Seconds        ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~  Timestamp Seconds            | Timestamp Fractional Seconds  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
            ]]>
        </artwork></figure>
        </t>

        <t> 
        The timestamp exchanged uses the NTP prime epoch of January 1, 1900 <xref target="RFC5905"/>
        and an adapted form of the 64-bit NTP Timestamp Format. The NTP Era value is not exchanged and participating
        PEs may consider the timestamps to be in the same Era as their local value.
        A DF Election operation occurring exactly at the Era transition boundary some time in 2036 is outside of the scope of this document.<br/>
        The 64-bit NTP Timestamp Format consists of a 32-bit part for Seconds and a 32-bit
        part for Fraction, which are encoded in the Service Carving Time as follows:
        <ul>
        <li>Timestamp Seconds: 32-bit NTP seconds are encoded in this field.</li>
        <li>Timestamp Fractional Seconds: the high order 16 bits of the NTP 'Fraction' field are encoded in this
        field.</li>
        </ul>
        </t>

        <t>When rebuilding a 64-bit NTP Timestamp Format using the values from a received SCT BGP extended community, the lower order 16 bits of the
        Fractional field are set to 0. The use of a 16-bit fractional seconds value yields adequate precision of 15 microseconds
        (2^-16 s).</t>

        <t>This document introduces a new flag called Time
        Synchronization indicated by "T" in the DF Election Capabilities registry defined in <xref
        target="RFC8584"/> for use in DF Election Extended Community. 
	
        <figure title="DF Election Extended Community"><artwork><![CDATA[
                     1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type = 0x06   | Sub-Type(0x06)| RSV |  DF Alg |    Bitmap     ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~     Bitmap    |            Reserved                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                 Figure 4: DF Election Extended Community
            ]]>
        </artwork></figure>

        <figure title="DF Election Capabilities"><artwork><![CDATA[
                     1         1
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |A| |T|                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                 Figure 5: DF Election Capabilities
            ]]>
        </artwork></figure>
        </t>

        <t>
        <ul>
        <li>Bit 3: Time Synchronization (corresponds to Bit 27 of the DF Election Extended
         Community). When set
        to 1, it indicates the desire to use Time Synchronization capability
        with the rest of the PEs in the Ethernet Segment.</li>
        </ul>
        </t>

        <t>
        This capability is utilized in conjunction with the agreed-upon DF Election Type.
        For instance, if all the PE devices in the Ethernet Segment indicate the desire to use the
        Time Synchronization capability and request the DF Election Type to be Highest Random Weight (HRW),
        then the HRW algorithm is used in conjunction with this capability. A PE which does not
        support the procedures set out in this document, or receives a route from another PE in
        which the capability is not set MUST NOT delay Designated Forwarder election as this could
        lead to duplicate traffic in some instances (overlapping Designated Forwarders).</t>

      </section>

      <section anchor="fsm_8584" title="Updates to RFC8584">
        <t>This document introduces an additional delay to the events and
        transitions defined for the default DF election algorithm FSM in
        <relref target="RFC8584" section="2.1"/> without changing the FSM state or event definitions
        themselves.</t>
 
        <t>Upon receiving a RCVD_ES message, the peering PE's Finite State Machine (FSM) transitions
        from the DF_DONE (indicating the DF election process was complete) state to the DF_CALC
        (indicating that a new DF calculation is needed) state. Due to the Service Carving Time
        (SCT) included in the Ethernet-Segment update, the completion of the DF_CALC state and the
        subsequent transition back to the DF_DONE state are delayed. This delay ensures proper
        synchronization and prevents conflicts. Consequently, the accompanying forwarding updates to
        the Designated Forwarder (DF) and Non-Designated Forwarder (NDF) states are also deferred.</t>

        <t>Item 9. in <relref target="RFC8584" section="2.1"/>, the list "Corresponding actions when transitions
are performed or states are entered/exited" is changed as follows:</t>
        <ol start="9">
        <li>DF_CALC on CALCULATED: Mark the election result for the VLAN or
        VLAN Bundle.
        <ol type="9.%d">
        <li>If an SCT timestamp is present during the RCVD_ES event of Action 11, wait until the
        time indicated by the SCT minus skew before proceeding to step 9.3.</li>
        <li>If an SCT timestamp is present during the RCVD_ES event of Action 11, wait until the
        time indicated by the SCT before proceeding to step 9.4.</li>
        <li>Assume the role of NDF for the local PE concerning the VLAN or VLAN Bundle, and transition to the DF_DONE state.</li>
        <li>Assume the role of DF for the local PE concerning the VLAN or VLAN Bundle, and transition to the DF_DONE state.</li>
        </ol>
        </li>
         </ol>

        <t>This revised approach ensures proper timing and synchronization in the DF election
        process, avoiding conflicts and ensuring accurate forwarding updates.</t>
        </section>


      </section>


      <section anchor="example" title="Synchronization Scenarios">

        <t>Consider <xref target="topology"/> as an example, where initially PE2 has failed and PE1 has taken over.
        This scenario illustrates the problem with the DF-Election mechanism described in <relref target="RFC7432" section="8.5"/>,
        specifically in the context of the timer value configured for all PEs on the Ethernet
        Segment.</t>

        <t>Procedure based on <relref target="RFC7432" section="8.5"/> with the default 3 second timer in step 2:
        <ol>
          <li>Initial state: PE1 is in a steady-state and PE2 is recovering.</li>
          <li>Recovery: PE2 recovers at an absolute time of t=99.</li>
          <li>Advertisement: PE2 advertises RT-4, sent at t=100, to partner PE1.</li>
          <li>Timer Start: PE2 starts a 3 second timer to allow the reception of RT-4 from other PE
          nodes.</li>
          <li>Immediate carving: PE1 performs service carving immediately upon RT-4 reception, i.e., t=100 plus some BGP propagation delay.</li>
          <li>Delayed Carving: PE2 performs service carving at time t=103.</li>
        </ol>
        </t>
            
        <t><xref target="RFC7432"/> favors traffic drops over duplicate traffic.
	    With the above procedure, traffic drops will occur as part of each PE recovery sequence
        since PE1 transitions some VLANs to Non-Designated Forwarder (NDF) immediately upon RT-4
        reception.<br/>
        The timer value (default = 3 seconds) directly affects the duration of the packet
        drops. A shorter (or zero) timer may result in duplicate traffic or traffic loops.</t>


        <t>Procedure based on the Service Carving Time (SCT) approach:
        <ol>
          <li>Initial state: PE1 is in a steady state, and PE2 is recovering.</li>
          <li>Recovery: PE2 recovers at an absolute time of t=99.</li>
          <li>Timer Start: PE2 starts at t=100 a 3 second timer to allow the reception of RT-4 from other PE
          nodes.</li>
          <li>Advertisement: PE2 advertises RT-4, sent at t=100, with a target SCT value of t=103 to
          partner PE1.</li>
          <li>Service Carving Timer: PE1 starts the service carving timer, with the remaining time
          until t=103.</li>
          <li>Simultaneous Carving: Both PE1 and PE2 carve at an absolute time of t=103.</li>
        </ol>
        </t>

        <t>
        To maintain the preference for minimal loss over duplicate traffic, PE1 SHOULD carve
        slightly before PE2 (with skew). The recovering PE2 performs both DF to NDF and NDF to DF
        transitions per VLAN at the timer's expiry. The original PE1, which received the SCT, applies the following:
	  <ul>
	    <li>DF to NDF Transition(s): at t=SCT minus skew, where both PEs are NDF for the skew duration.</li>
	    <li>NDF to DF Transition(s): at t=SCT.</li>
	  </ul>
        This split-behavior ensures a smooth DF role transition with minimal loss.
	</t>
	
        <t>Using the SCT approach, the negative effect of the timer to allow the reception of
        Ethernet Segment RT-4 from other PE nodes is mitigated. Furthermore, the BGP
        transmission delay (from PE2 to PE1) of the ES RT-4 becomes a non-issue. The SCT approach shortens the
        3-second timer window to the order of milliseconds.</t>

        <t>The peering timer is a configurable value where 3 seconds represents the default.
        Configuring a timer value of 0, or so small as to expire during propagation of the BGP
        routes, is outside the scope of this document.
        In reality, the use of the SCT approach presented in this documents encourages the use of
        larger peering timer values to overcome any sort of BGP route propagation delays.</t>
	
        <section anchor="concurrent" title="Concurrent Recoveries">
        <t>In the eventuality 2 or more PEs in a peering Ethernet Segment group are recovering
        concurrently or roughly the same time, each will advertise a Service Carving Timestamp.
        This SCT value would correspond to what each recovering PE considers the "end time" for DF
        Election. A similar situation arises in sequentially recovering PEs, when a second PE
        recovers approximately at the time of the first PE's advertised SCT expiry, and with its own
        new SCT-2 outside of the initial SCT window.</t>
        
        <t>In the case of multiple concurrent DF elections, each initiated by one of the recovering
        PEs, the SCTs must be ordered chronologically. All PEs shall execute only a single DF
        Election at the service carving time corresponding to the largest (latest) received timestamp value.
        This DF Election will involve all active PEs in a unified DF Election update.</t>

        <t>Example:
        <ol>
          <li>Initial State: PE1 is in a steady state, with services elected at PE1.</li>
          <li>Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4 with a target SCT
          value of t=103 to its partners (PE1).</li>
          <li>Timer Initiation by PE2: PE2 starts a 3 second timer to allow the reception of RT-4
          from other PE nodes.</li>
          <li>Timer Initiation by PE1: PE1 starts the service carving timer, with the remaining time
          until t=103.</li>
          <li>Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4 with a target SCT
          value of t=105 to its partners (PE1, PE2).</li>
          <li>Timer Initiation by PE3: PE3 starts a 3 second timer to allow the reception of RT-4
          from other PE nodes.</li>

          <li>Timer Update by PE2: PE2 cancels the running timer and starts the service carving
          timer with the remaining time until t=105.</li>
          <li>Timer Update by PE1: PE1 updates its service carving timer, with the remaining time
          until t=105.</li>
          <li>Service Carving: PE1, PE2, and PE3 perform service carving at the absolute time of t=105.</li>
        </ol>
        </t>

        <t>In the eventuality a PE in an Ethernet Segment group recovers during the discovery window
        specified in <relref target="RFC7432" section="8.5"/>, and does not support or advertise the
        T-bit, then all PEs in the current peering sequence SHALL immediately revert to the default
        <xref target="RFC7432"/> behavior.</t>

        </section>
        </section>

        <section anchor="ntpcompat" title="Backwards Compatibility">
          <t>For the DF election procedures to achieve global convergence and unanimity within a
          redundancy group, it is essential that all participating PEs agree on the DF election
          algorithm to be employed. However, it is possible that some PEs may continue to use the
          existing modulo-based DF election algorithm from <xref target="RFC7432"/> and not utilize the new Service Carving Time
          (SCT) BGP extended community. PEs that operate using the baseline DF election mechanism
          will simply discard the new SCT BGP extended community as unrecognized.</t>
	  
          <t>A PE can indicate its willingness to support clock-synchronized carving by signaling
          the new 'T' DF Election Capability and including the new SCT BGP extended community along
          with the Ethernet Segment Route Type 4. If one or more PEs attached to the Ethernet
          Segment do not signal T=1, then all PEs in the Ethernet Segment SHALL revert to the
          timer-based approach as specified in <xref target="RFC7432"/>. This reversion is particularly crucial in
          preventing VLAN shuffling when more than two PEs are involved.</t>

          <t>In the event a new or extra RT-4 is received without the new 'T' DF Election
          Capability in the midst of an ongoing DF Election sequence, all SCT-based delays are
          cancelled and the DF Election immediately applied as specified in <xref
          target="RFC7432"/>, as if no SCT had been previously exchanged.</t>

      </section>

      <section anchor="security" title="Security Considerations">
        <t>The mechanisms in this document use the EVPN control plane as defined in
        <xref target="RFC7432"/>. Security considerations described in
        <xref target="RFC7432"/> are equally applicable.</t>

        <t>For the new SCT Extended Community, attack vectors may be setting the value to zero, to a
        value in the past or to large times in the future. The procedures in this document address
        implicitly what occurs with a carving time in the past, as this would be a naturally
        occurring event with a large BGP propagation delay: the receiving PE SHALL treat
        the DF Election at the peer as having occurred already, and proceed without starting any
        timer to further delay service carving. For timestamp values in the future, a rogue PE may be advertising a value
        inconsistent with its local behavior. This is no different from a rogue PE setting all its
        DF Election results inconsistently to its
        peers using (or ignoring adherence to) the procedures from <xref target="RFC7432"/>, and
        the result would similarly be duplicate or dropped traffic. It is left to implementations to
        decide what constitutes an "unreasonably large" SCT value.</t>

        <t>This document uses MPLS and IP-based tunnel technologies to support data plane transport.
        Security considerations described in <xref target="RFC7432"/> and in <xref target="RFC8365"/> are equally applicable.</t>      
      </section>

      <section anchor="IANA" title="IANA Considerations">

        <t>IANA maintains the "EVPN Extended Community Sub-Types" registry set
       up by <xref target='RFC7153'/>.  IANA is requested to confirm the First Come First
       Served assignment as follows:
        <figure><artwork><![CDATA[
   Sub-Type Value   Name                        Reference
   --------------   -------------------------   -------------
         0x0F       Service Carving Timestamp   This document
        ]]></artwork></figure>
</t>

        <t>IANA maintains the "DF Election Capabilities" registry set up by
        <xref target="RFC8584"/>. IANA is requested to make the following assignment from
   this registry:

        <figure><artwork><![CDATA[
    Bit         Name                         Reference
    ----        ----------------             -------------
    3           Time Synchronization         This document
        ]]></artwork></figure>

        </t>
      </section>
    </middle>

 <!--  *****BACK MATTER ***** -->

<back>
    <!-- References split into informative and normative -->
    <references title="Normative References">
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8174.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.7153.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.7432.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8365.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.8584.xml"/>
        <xi:include href="https://www.rfc-editor.org/refs/bibxml/reference.RFC.5905.xml"/>
    </references>
    <!-- References split into informative and normative -->
    <references title="Informative References">
        <reference anchor="HRW98" target="https://www.microsoft.com/en-us/research/wp-content/
              uploads/2017/02/HRW98.pdf">
          <front>
            <title>Using Name-Based Mappings to Increase Hit Rates</title>
            <author initials="D" surname="Thaler">
              <organization/>
            </author>
            <author initials="C" surname="Ravishankar">
              <organization/>
            </author>
            <date year="1998"/>
          </front>
        </reference>
    </references>


    <section anchor="contributors" title="Contributors">
    <t>In addition to the authors listed on the front page, the following co-authors
    have also contributed substantially to this document:</t>
  
    <t>Gaurav Badoni<br/>Cisco</t>
    <t>Email: gbadoni@cisco.com</t>

    <t>Dhananjaya Rao<br/>Cisco</t>
    <t>Email: dhrao@cisco.com</t>
    </section>

    <section anchor="acknowledgements" title="Acknowledgements">
        <t>Authors would like to acknowledge helpful comments
        and contributions of Satya Mohanty and Bharath Vasudevan.
        Also thank you to Anoop Ghanwani and Gunter van de Velde for their thorough review with valuable comments and
        corrections.</t>
    </section>

</back>
</rfc>

