A congestion control mechanism based on distributed AIDC lossless network

With the rapid development of big data and artificial intelligence (AI) technology, it is getting more clear that AI solutions represented by large models have gradually penetrated into various industries, and the demands for computing power is increasing. A large-scale GPU cluster is a necessary condition for large model training. However, when deploying a cluster with 10,000 or even 100,000 GPUs, the computing power of a single intelligent DC is limited due to the issues such as insufficient space/power and heat dissipation of the computer room. In order to solve this problem, multiple intelligent DCs within a region can be interconnected into a large virtual intelligent computing cluster, which realizes collaborative computing among multiple intelligent DCs through distributed AIDC lossless network (also known as RDMA remote). It meets the demands for high computing power. However, in the process of exploring using multiple intelligent DCs to build a larger-scale intelligent computing cluster, we have encountered many challenges. For example, RDMA remote will generate traffic flows across long distances. If congestion occurs on long-distance links, traditional congestion control mechanisms such as PFC/ECN may become invalid because of longer congestion feedback time, resulting in insufficient buffer of network devices and packet loss eventually. In order to solve the problems of congestion and packet loss in interconnection of DCs across long distances, this document proposes a congestion control mechanism that effectively alleviates network congestion by shortening the congestion feedback time and adjusting the flow rate of the transmitting node based on the congestion degree.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

The following terms are used in this document: RDMA remote: Interconnect multiple intelligent DCs within a region into a large virtual intelligent computing cluster, realizing collaborative computing among multiple intelligent DCs. PFC（Priority-based Flow Control）: It can provide priority-based flow control hop-by-hop, enabling multiple types of traffic flows to run on Ethernet links without affecting each other. ECN（Explicit Congestion Notification）: A congestion control mechanism that reduces the flow rate of the transmitting node by sending CNPs from the receiving node to the transmitting node, achieving end-to-end congestion management. CNP：Congestion Notification Packet.

At present, the most widely used congestion control mechanism in RoCE network is ECN. When congestion occurs on the network device, the device sends a packet with an ECN label to the receiving node, and the receiving node then sends CNPs to the transmitting node to notify the node to reduce the transmitting rate of the packets, thus alleviating network congestion. However, in distributed AIDC lossless networks, training large models in cooperation across multiple DCs generates long-distance data transmission. If congestion occurs on the long-distance link, the CNP packets generated by the traditional ECN mechanism has a longer feedback path, which may cause the flow rate of the transmitting node not to be reduced in time, resulting in packet loss and affecting the training performance of the large models eventually. To meet the lossless requirements of distributed AIDC networks, this document proposes a congestion control mechanism that transfers the "congestion point" occurs on the long-distance link to the network device closest to the transmitting node, thus dealing with congestion problems over long distances with low latency.

Figure 1 shows the specific process of congestion control mechanism. H1 and H2 are respectively the transmitting node and receiving node, R11 is the next-hop device closest to the transmitting node (known as proximal device) , R12 is the device on the long-distance link, and the distance between R11 and R12 is in the range of hundreds of kilometers.

• First, each device monitors the network state, including the queue accumulation condition and buffer usage of each port, determining whether congestion occurs on the link; • If congestion occurs on the link, and the congested device (R12) is not the proximal device (R11) of the transmitting node, R12 will send a notification message to R11. The notification message contains information such as the port number where congestion occurs, the queue depth and the buffer usage of the congested port; • R11 determines the congestion degree of the device based on the content of the notification message, and calculates the number of CNP packets or other flow-control protocol packets that need to be sent. The flow-control protocol packets contain information about the congested traffic flows; • After receiving the flow-control protocol packets, H1 reduces the transmitting rate of the corresponding congested traffic flows to alleviate the congestion of network devices. The traffic flow of large models has a characteristic of periodicity, that is, if a certain flow is congested in the first training period, it will be congested in every subsequent period. Therefore, this document designs the network devices to record the information of the forwarding packets in the flow table entry, including which flows are congested. When the congested flow occurs periodically, R11 directly sends CNP or other flow-control protocol packets to H1 based on the learned flow table entries for transmitting rate control. The remote congested device (R12) does not need to send notification message any more. In this way, after obtaining the congestion information of the entire network in the first training period, the traffic flows can be lossless in remaining periods.

The lossless interconnection technology for distributed AIDC lossless networks is a research hotspot in recent years. At present, the congestion control mechanism proposed in this document has been applied in the testing environment of the current network. Figure 2 and Figure 3 show the test environments of two AI training clusters, where each cluster deploy 512 GPUs respectively. The distance between cluster A and cluster B is 120km, and the spine switches in two clusters are interconnected through wavelength division equipments with the capacity of 25.6T to train large models with billions of parameters collaboratively.

The experimental results show that the training performance of distributed intelligent DCs reaches over 90% of that of the centralized single intelligent DC under the same number of GPUs, proving the feasibility of distributed AIDC lossless network scheme and the proposed congestion control mechanism.

Building distributed AI training clusters across multiple data centers is one of the important research directions for the future of AIDC lossless networks. The congestion control mechanism proposed in this document can effectively solve the problems of congestion and packet loss in long-distance DCs interconnection by shortening the congestion feedback time and adjusting the flow rate of the transmitting node reasonably based on the congestion degree. It plays a positive role in promoting the construction of distributed AIDC lossless networks.

There is no additional security risk introduced by this design.

This document introduces no additional considerations for IANA.