<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com -->
<!-- This can be converted using the Web service at http://xml.resource.org/experimental.html
     (which supports the latest, sometimes undocumented and under-tested, features.) -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<!-- You want a table of contents -->
<?rfc symrefs="yes"?>
<!-- Use symbolic labels for references -->
<?rfc sortrefs="yes"?>
<!-- This sorts the references -->
<?rfc iprnotified="no" ?>
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<?rfc compact="yes"?>
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<rfc category="std" docName="draft-ji-ccwg-distributed-lossless-mechanism-00"
     ipr="trust200902">
  <front>
    <title abbrev="Congestion Control for Distributed AIDC Lossless Network">A
    congestion control mechanism based on distributed AIDC lossless
    network</title>

    <author fullname="Siwei Ji" initials="S." surname="Ji">
      <organization>Chinat Telecom</organization>

      <address>
        <postal>
          <street>Beiqijia Town, Changping District</street>

          <city>Beijing, 102209</city>

          <country>China</country>
        </postal>

        <email>jisw@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Cong Li " initials="C." surname="Li">
      <organization>Chinat Telecom</organization>

      <address>
        <postal>
          <street>Beiqijia Town, Changping District</street>

          <city>Beijing, 102209</city>

          <country>China</country>
        </postal>

        <email>licong@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Keyi Zhu" initials="K." surname="Zhu">
      <organization>Huawei Technologies</organization>

      <address>
        <postal>
          <street>Huawei Campus, No.156 Beiqing Road</street>

          <city>Beijing, 100095</city>

          <country>China</country>
        </postal>

        <email>zhukeyi@huawei.com</email>
      </address>
    </author>

    <date day="21" month="October" year="2024"/>

    <area>Web and Internet Transport</area>

    <workgroup>Congestion Control Working Group</workgroup>

    <keyword>Congestion control mechanism, lossless network, RDMA</keyword>

    <abstract>
      <t>This document proposes a congestion control mechanism based on
      distributed AIDC lossless network. It can effectively solve the problem
      of declining model training performance due to congestion and packet
      loss on long-distance links when training large models across multiple
      data centers within a region. In addition, this document outlines the
      practice scenario of this congestion control mechanism.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>With the rapid development of big data and artificial intelligence
      (AI) technology, it is getting more clear that AI solutions represented
      by large models have gradually penetrated into various industries, and
      the demands for computing power is increasing. A large-scale GPU cluster
      is a necessary condition for large model training. However, when
      deploying a cluster with 10,000 or even 100,000 GPUs, the computing
      power of a single intelligent DC is limited due to the issues such as
      insufficient space/power and heat dissipation of the computer room. In
      order to solve this problem, multiple intelligent DCs within a region
      can be interconnected into a large virtual intelligent computing
      cluster, which realizes collaborative computing among multiple
      intelligent DCs through distributed AIDC lossless network (also known as
      RDMA remote). It meets the demands for high computing power.</t>

      <t>However, in the process of exploring using multiple intelligent DCs
      to build a larger-scale intelligent computing cluster, we have
      encountered many challenges. For example, RDMA remote will generate
      traffic flows across long distances. If congestion occurs on
      long-distance links, traditional congestion control mechanisms such as
      PFC/ECN may become invalid because of longer congestion feedback time,
      resulting in insufficient buffer of network devices and packet loss
      eventually.</t>

      <t>In order to solve the problems of congestion and packet loss in
      interconnection of DCs across long distances, this document proposes a
      congestion control mechanism that effectively alleviates network
      congestion by shortening the congestion feedback time and adjusting the
      flow rate of the transmitting node based on the congestion degree.</t>

      <section title="Requirements Language">
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described in BCP
        14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only
        when, they appear in all capitals, as shown here.</t>
      </section>

      <section title="Terminology">
        <t>The following terms are used in this document:</t>

        <t>RDMA remote: Interconnect multiple intelligent DCs within a region
        into a large virtual intelligent computing cluster, realizing
        collaborative computing among multiple intelligent DCs.</t>

        <t>PFC&#65288;Priority-based Flow Control&#65289;: It can provide
        priority-based flow control hop-by-hop, enabling multiple types of
        traffic flows to run on Ethernet links without affecting each
        other.</t>

        <t>ECN&#65288;Explicit Congestion Notification&#65289;: A congestion
        control mechanism that reduces the flow rate of the transmitting node
        by sending CNPs from the receiving node to the transmitting node,
        achieving end-to-end congestion management.</t>

        <t>CNP&#65306;Congestion Notification Packet.</t>
      </section>
    </section>

    <section title="Congestion Control Mechanism ">
      <section title="Congestion Control Principle">
        <t>At present, the most widely used congestion control mechanism in
        RoCE network is ECN. When congestion occurs on the network device, the
        device sends a packet with an ECN label to the receiving node, and the
        receiving node then sends CNPs to the transmitting node to notify the
        node to reduce the transmitting rate of the packets, thus alleviating
        network congestion. However, in distributed AIDC lossless networks,
        training large models in cooperation across multiple DCs generates
        long-distance data transmission. If congestion occurs on the
        long-distance link, the CNP packets generated by the traditional ECN
        mechanism has a longer feedback path, which may cause the flow rate of
        the transmitting node not to be reduced in time, resulting in packet
        loss and affecting the training performance of the large models
        eventually. To meet the lossless requirements of distributed AIDC
        networks, this document proposes a congestion control mechanism that
        transfers the "congestion point" occurs on the long-distance link to
        the network device closest to the transmitting node, thus dealing with
        congestion problems over long distances with low latency.</t>
      </section>

      <section title="Congestion Control Process">
        <t>Figure 1 shows the specific process of congestion control
        mechanism. H1 and H2 are respectively the transmitting node and
        receiving node, R11 is the next-hop device closest to the transmitting
        node (known as proximal device) , R12 is the device on the
        long-distance link, and the distance between R11 and R12 is in the
        range of hundreds of kilometers.</t>

        <t><figure>
            <artwork><![CDATA[                              
               1.notification message
               <-------------------                                                        
+-------+     +------+  120km +------+     +-------+  
|  H1   #-----#  R11 #--------#  R12 #-----#   H2  |
+-------+     +------+        +------+     +-------+
     2.flow-control 
     protocol packets  
   <--------------
 
Figure 1: The Process of Congestion Control Mechanism]]></artwork>
          </figure></t>

        <t>&bull; First, each device monitors the network state, including the
        queue accumulation condition and buffer usage of each port,
        determining whether congestion occurs on the link;</t>

        <t>&bull; If congestion occurs on the link, and the congested device
        (R12) is not the proximal device (R11) of the transmitting node, R12
        will send a notification message to R11. The notification message
        contains information such as the port number where congestion occurs,
        the queue depth and the buffer usage of the congested port;</t>

        <t>&bull; R11 determines the congestion degree of the device based on
        the content of the notification message, and calculates the number of
        CNP packets or other flow-control protocol packets that need to be
        sent. The flow-control protocol packets contain information about the
        congested traffic flows;</t>

        <t>&bull; After receiving the flow-control protocol packets, H1
        reduces the transmitting rate of the corresponding congested traffic
        flows to alleviate the congestion of network devices.</t>

        <t>The traffic flow of large models has a characteristic of
        periodicity, that is, if a certain flow is congested in the first
        training period, it will be congested in every subsequent period.
        Therefore, this document designs the network devices to record the
        information of the forwarding packets in the flow table entry,
        including which flows are congested. When the congested flow occurs
        periodically, R11 directly sends CNP or other flow-control protocol
        packets to H1 based on the learned flow table entries for transmitting
        rate control. The remote congested device (R12) does not need to send
        notification message any more. In this way, after obtaining the
        congestion information of the entire network in the first training
        period, the traffic flows can be lossless in remaining periods.</t>
      </section>
    </section>

    <section title="Practice Scenario">
      <t>The lossless interconnection technology for distributed AIDC lossless
      networks is a research hotspot in recent years. At present, the
      congestion control mechanism proposed in this document has been applied
      in the testing environment of the current network.</t>

      <t>Figure 2 and Figure 3 show the test environments of two AI training
      clusters, where each cluster deploy 512 GPUs respectively. The distance
      between cluster A and cluster B is 120km, and the spine switches in two
      clusters are interconnected through wavelength division equipments with
      the capacity of 25.6T to train large models with billions of parameters
      collaboratively.</t>

      <t><figure>
          <artwork><![CDATA[             +-------------+                +-------------+
             |    Spine1   |                |    Spine2   |
             +-+---+--+--+-+                +--+---+--+-+-+
              /    |  |  |                     |   |  |  |
             /     |  |  |                     |   |  |  |  
            /   +--+--+--+---------------------+   |  +  +
           /   /   |  |  |   +---------------------+ /    \
          /   /    |  |  +---|-----------------+----/----+ \     
         /   /     +  +------|----------+          /      \ \
        /   /       \        |          |         /        \ \    
+------+---++      +-+-------+-+      +-+-------+-+      +--+-+------+      
|   leaf1   |      |   leaf2   |      |   leaf3   | .... |   leaf16  |
+--+----+---+      +--+----+---+      +--+----+---+      +--+----+---+
   |    |             |    |             |    |             |    |
   H1...H4           H5...H8            H9...H12           H61...H64
                                        
                     Figure 2:   Cluster A]]></artwork>
        </figure></t>

      <t><figure>
          <artwork><![CDATA[             +-------------+                +-------------+
             |    Spine3   |                |    Spine4   |
             +-+---+--+--+-+                +--+---+--+-+-+
              /    |  |  |                     |   |  |  |
             /     |  |  |                     |   |  |  |  
            /   +--+--+--+---------------------+   |  +  +
           /   /   |  |  |   +---------------------+ /    \
          /   /    |  |  +---|-----------------+----/----+ \     
         /   /     +  +------|----------+          /      \ \
        /   /       \        |          |         /        \ \    
+------+---++      +-+-------+-+      +-+-------+-+      +--+-+------+      
|   leaf17  |      |   leaf18  |      |   leaf19  | .... |   leaf32  |
+--+----+---+      +--+----+---+      +--+----+---+      +--+----+---+
   |    |             |    |             |    |             |    |
  H65...H68           H69...H72         H73...H76          H125...H128
                                       
                      Figure 3:   Cluster B]]></artwork>
        </figure></t>

      <t>The experimental results show that the training performance of
      distributed intelligent DCs reaches over 90% of that of the centralized
      single intelligent DC under the same number of GPUs, proving the
      feasibility of distributed AIDC lossless network scheme and the proposed
      congestion control mechanism.</t>
    </section>

    <section title="Conclusion">
      <t>Building distributed AI training clusters across multiple data
      centers is one of the important research directions for the future of
      AIDC lossless networks. The congestion control mechanism proposed in
      this document can effectively solve the problems of congestion and
      packet loss in long-distance DCs interconnection by shortening the
      congestion feedback time and adjusting the flow rate of the transmitting
      node reasonably based on the congestion degree. It plays a positive role
      in promoting the construction of distributed AIDC lossless networks.</t>
    </section>

    <section anchor="security" title="Security Considerations">
      <t>There is no additional security risk introduced by this design.</t>
    </section>

    <section title="IANA Considerations">
      <t>This document introduces no additional considerations for IANA.</t>
    </section>

    <!---->
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>

      <?rfc include="reference.RFC.8174"?>

      <?rfc include='reference.RFC.3168'?>
    </references>

    <references title="Informative References">
      <?rfc include='reference.I-D.huang-rtgwg-wan-lossless-uc'
?>

      <?rfc include='reference.I-D.hcl-rtgwg-ai-network-problem'
?>

      <?rfc include='reference.I-D.he-huang-rtgwg-wan-lossless-framework'
?>

      <?rfc ?>
    </references>
  </back>
</rfc>
