<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-liu-opsawg-cco-cm-requirement-02"
     ipr="trust200902">
  <front>
    <title abbrev="Operations and Management Area Working Group">Requirements
    from Control and Management Viewpoint for Collective Communication
    Optimization</title>

    <author fullname="Chang Liu" initials="C." surname="Liu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>liuchangjc@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Shiping Xu" initials="S." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>xushiping@chinamobile.com</email>
      </address>
    </author>

    <!---->

    <date day="14" month="October" year="2024"/>

    <area>Ops &amp; Management</area>

    <workgroup>Operations and Management Area Working Group</workgroup>

    <keyword>collective communication;in-network computing</keyword>

    <abstract>
      <t>Collective communication optimization is crucial to improve the
      performance of distributed applications, due to that communication has
      become bottleneck to degrade applications with the growth of scale of
      distributed systems. The industry and academy has worked on proposing
      solutions to upgrade collective communication operations. However, there
      has been a problem of lacking for unified guidelines.</t>

      <t>This draft provide requirements on collective communication
      optimization from the control and management viewpoint.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>In recent years, with the development and evolution of various
      applications and business, especially the rapid growth of AI
      applications, distributed computing performance has become more and more
      important and has gradually become a key factor which limits the
      progress of these applications. Due to the large amount of collective
      communication involved in distributed computing, collective
      communication performance is crucial. However, there exists many
      problems to be solved for collective communication. On one hand, many
      collective communication operations implemented by message-level
      communication libraries like MPI and NCCL mainly utilize the unicast
      point-to-point communication mechanism, leading to the redundancy of
      network traffic, the underutilization of network resources and waste of
      network capabilities. On the other hand, since the underlying network
      protocols and collective communication are not co-designed, there is a
      semantic gap between inter-process message transportation and packet
      forwarding. Therefore, there is huge space for the optimization of
      collective communication. At present, the industry and academy are also
      actively promoting the development, implementation and deployment of
      collective communication optimization.</t>

      <t>The research group Computing in the Network, COIN for short, also
      focus on this topic. The goal is mainly to investigate how network data
      plane programmability can improve Internet architecture, with a too
      broad focus scope including network functions offloading, machine
      learning acceleration, in-network caching and in-network control, etc.
      In addition to the solution of collective operation offloading COIN talk
      about, multicast substituting for single point unicast, scheduling tasks
      and planning transportation paths by topology awareness and bridging
      semantic gap between inter-process message transportation and packet
      forwarding can also play a significant role on optimizing collective
      communication.</t>

      <t>This draft provide some necessary requirements from the network
      control and management viewpoint, combined with the optimization
      solutions of collective communication offloading, multicast mechanisms,
      topology awareness and semantic gap bridge between inter-process message
      transportation and packet forwarding, to guideline the standardization
      work of collective communication optimization.</t>
    </section>

    <section title="Requirements">
      <section title="Memory Management">
        <t>Scarce memory resources provided by network devices for collective
        communication MUST be scheduled and controlled, e.g. assigning a
        scheduling priority to collective communication offloading tasks.
        Compared to the amount of collective communication message in the
        applications such as AI, HPC, etc., it is severely mismatched and
        extremely scarce for memory resource provided by network devices for
        collective communication, such as network programmable switches.</t>

        <t>Use Case<xref target="ESA"/>. The memory of programmable switch is
        scarce for the amount of gradient transmitted in distributed training.
        There is some existing work to solve this problem like pool-based
        streaming and dynamic sharing, which are not enough yet. A use case of
        fully utilizing the memory of programmable switch is that the control
        and management module of switch assigns a priority to the aggregation
        task, to dynamically and preemptively schedule the aggregation tasks
        in the data plane, thus making more full use of memory in the form of
        switch aggregators.</t>

        <figure align="center"
                title="The Mismatched between Device Memory and Communication Volume">
          <artwork align="center" type="ascii-art">+----------+  +----------+           +----------+
|          |  |          |           |          |
| worker 1 |  | worker 2 |           | worker n |
|          |  |          |  ... ...  |          |
+----+-----+  +----+-----+           +-----+----+
     |             |                       |
     |             |                       |
     +------+------+-----------------------+
            |GB/TB-level gradients
+-----------+-----------+           +-----------+
|           |           |           |           |
|  +--------+--------+  |           |           |
|  |     Switch      |  |           |  Control  |
|  |   Aggregators   |  |  Manage   |     &amp;     |
|  +-----------------&lt;--+-----------+ Management|
|  |Memory for others|  | Schedule  |           |
|  +-----------------+  |           |           |
| Switch Memory = 10MB  |           |           |
+-----------------------+           +-----------+</artwork>
        </figure>
      </section>

      <section title="Topology Management">
        <t>Topology awareness and mapping work are REQUIRED to be done to put
        some of the end-host computing on the network nodes for collective
        communication optimization. In many collective operation tasks, the
        logical relationship between nodes is usually described in the form of
        graph, and then mapping to the physical network. Therefore, collective
        communication offloading requires awareness of the network topology
        and making efficient mappings.</t>

        <t>Use Case. In the parameter server architecture commonly used in
        distributed training, the parameter server can be reasonably mapped to
        spine switches in the fat tree physical network with being aware of
        network topology. Under this mapping mechanism, the traffic path is
        more simplified and the traffic volume of the whole network is greatly
        compressed. Compared to the traditional collective communication mode,
        the optimized end-to-network or end-to-network-to-end one with
        topology awareness and mapping makes the physical topology and the
        logical topology closer, more friendly and unified.</t>

        <figure align="center"
                title="Topology Management and Topology Mapping">
          <artwork type="ascii-art">                 Logical Topology
                +----------------+
                |Parameter Server|
                +--------+-------+
                         |
          +----------+---+------------+
          |          |                |
     +----+---+ +----+---+        +---+----+
     |Worker 1| |Worker 2| ...... |Worker n|
     +--------+ +--------+        +--------+
                         |Mapping
                         |
              +----------+---------+
              |Management &amp; Control|
              |                    |
              | Topology Awareness |
              | Paths Planning     |
              +----------+---------+
                         |
                         |Mapping
                         v
                 Physical Topology
             +-----+         +-----+
             |Spine|         |Spine|
             +--+--+         +--+--+
                |               |
     +----------+-+------------++-----------+
     |            |            |            |
  +--+--+      +--+--+      +--+--+      +--+--+
  |Leaf |      |Leaf |      |Leaf |      |Leaf |
  +--+--+      +--+--+      +--+--+      +--+--+
     |            |            |            |
  +--+--+      +--+--+      +--+--+      +--+--+
  |     |      |     |      |     |      |     |
+-+-+ +-+-+  +-+-+ +-+-+  +-+-+ +-+-+  +-+-+ +-+-+
|GPU| |GPU|  |GPU| |GPU|  |GPU| |GPU|  |GPU| |GPU|
+---+ +---+  +---+ +---+  +---+ +---+  +---+ +---+</artwork>
        </figure>
      </section>

      <section title="Interfaces Management">
        <t>Some collective communication interfaces MUST be defined and
        managed for application developers to shield tedious network
        engineering details, such as flow control, packet organization,
        chip-specific programming language, etc. If not, applications
        developers will need too much arcane knowledge and expertise, which is
        beyond their willingness and prevent from the evolution of the
        emerging applications.</t>

        <t>Use case. The industry and academy have actually proposed some
        abstractions of collective communication operations, such as
        collective communication libraries MPI, NCCL, NetRPC<xref
        target="NetRPC"/>, etc. In the control plane, these interfaces need to
        be configured and instantiated to complete the part of collective
        communication functionality.</t>
      </section>

      <section title="Data Management">
        <t>The semantic gap between application data unit and network data
        unit is REQUIRED to be bridged. This semantic mismatch poses a huge
        obstacle to the support of upper layer applications in the underlying
        network.</t>

        <t>Use Case. In the distributed training of LLMs, AllReduce, a common
        kind of collective communication operation, involves the
        transportation of large amount of message-level gradients or token
        data, which is orders of magnitude of packets in network in size.
        Because packet in network is designed originally not considering large
        data transmission, there is natural mismatch property between network
        and collective communication.</t>
      </section>

      <section title="Computional Resources Management">
        <t>The fusing of communication and computation operators is now a
        popular and consensus-based way to optimize LLMs training or HPC. This
        practice can make the scheduling of communication and computing more
        efficient. A universal and direct way is to consider computing when
        designing communication operators in the communication libraries. In
        this process, unified management of computational resources, e.g.,
        AI-cores of GPU or CPU-cores, is REQUIRED, to optimize the scalability
        of computing cluster.</t>

        <t>Use Case. Training large language models (LLMs) efficiently in a
        distributed setting is indeed a challenging task, primarily due to
        communication bottlenecks that arise when multiple nodes need to
        synchronize their data. Overlapping communication with computation, or
        said kernel fusion, is a key strategy to improve training efficiency.
        This approach aims to reduce idle times by initiating communication
        operators while the CPU or GPU is still performing computations, e.g.,
        fusing Allgather and Batch-Matmul into a single larger kernel to
        reduce kernel launch overhead and better overlap communication with
        computation.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <reference anchor="ESA">
        <front>
          <title>Efficient Data-Plane Memory Scheduling for In-Network
          Aggregation</title>

          <author fullname="Hao Wang" surname="Wang">
            <organization>iSING Lab, Hong Kong University of Science and
            Technology</organization>
          </author>

          <date year="2022"/>
        </front>
      </reference>

      <reference anchor="NetRPC">
        <front>
          <title>NetRPC: Enabling In-Network Computation in Remote Procedure
          Calls</title>

          <author fullname="Bohan Zhao" surname="Zhao">
            <organization>Tsinghua University</organization>
          </author>

          <date year="2023"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
