Internet-Draft | Routing in Dragonfly+ Topologies | March 2024 |
Afanasiev, et al. | Expires 5 September 2024 | [Page] |
This document provides an overview of Dragonfly+ network topology and describes routing implementation for IP networks with Dragonfly+ topology with support for non-minimal routing.t¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 5 September 2024.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Dragonfly [KIM2008] is a high-scalability, low-diameter, cost-efficient network topology that provides high bandwidth and large path diversity. Dragonfly topology was originally designed for HPC and supercomputing systems and is now adopted in more and more supercomputing networks. Its properties also make it an interesting candidate for data center network topology, especially Dragonfly+ variant [SPHINER2017] with leaf-spine intra-group topology. But building IP networks with Dragonfly+ topology is a non-trivial problem because IP networks lack many mechanisms traditionally available in HPC interconnection networks. Specifically , Dragonfly+ relies heavily on non-minimal routing and adaptive load balancing for efficient use of available network capacity.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This section introduces the terminology used in this document.¶
Network design requirements are largely the same as in [RFC7938]. The most notable difference is the extensive use of non-minimal paths.¶
Body text¶
Dragonfly topology was introduced by Kim et al. [KIM2008]. It aims to decrease the cost and diameter of the network while providing good scalability. Dragonfly is a hierarchical topology that divides routers into groups connected by long (inter-group) links in a fully-connected global network. Each group essentially implements high-radix virtual router. Dragonfly is a direct topology, in which every router has a set of terminal connections leading to endpoints, and a set of topological connections leading to other routers, some from the same group and some from the other groups. While original Dragonfly uses fully-connected intra-group topology it doesn't prevent using other intra-group topologies. Different intra-group topologies produce different Dragonfly "flavors". Inter-group topology is always fully connected. Dragonfly+ as proposed in [SPHINER2017] relies on an extended group topology in which intra-group routers are connected as a bipartite graph (leaf-spine or Clos-like topology). Dragonfly+ is superior to conventional Dragonfly due to the significantly larger number of hosts which it is able to support. In addition, Dragonfly+ supports similar or better bisectional bandwidth for various traffic patterns and requires smaller number of buffers to avoid credit loop deadlocks in lossless networks. Dragonfly+ is a indirect topology where only leaf nodes are connect to endpoints. TODO: spine sizing.¶
In Dragonfly and Dragonfly+ topologies there exists at least one direct global link between every pair of groups. Minimal intergroup routes traverse a single global link. The capacity of minimal routes between each pair of groups is lower than the aggregate link capacity of hosts in a group. Therefore, conventional minimal routing is not enough to obtain maximal throughput and efficiently support various traffic patters. [KIM2008] introduces the concept of non-minimal adaptive routing. For Dragonfly+ we can define three priority levels of inter-group routes. We use notations of ”L” and ”G” below to express where the route traverses local or global link, respectively.¶
LGLLGL routes normally appear only when some spines are not connected to at least one spine in every other group - in this case non-minimal routes through intermediate group might need to use different ingress and egress spines in the intermediate group. TODO: discuss imbalance, density and LGLLGL routes [WILKE2017]¶
One possible implementation is described in [WILKE2017]. TODO: describe wiring scheme invariant under group rotation (consistent renumbering of all groups by the same offset mod number of groups).¶
While routing and forwarding setup described in this document allows to propagate reachability information and install forwarding state required for Dragonfly+ topologies, including non-minimal paths, it's not enough to efficiently use Dragonfly network capacity, especially in presence of LGLLGL paths. Efficient traffic to paths mapping in Dragonfly network can not be described by static mechanisms because ideally we would like to¶
This requires dynamic adaptive load balancing and coupling between adaptive load balancing and congestion control. Adaptive load balancing MUST be able to work without complete knowledge of network link utilization and queue state since such state can significantly change over the period of several RTTs and collecting and distributing global network utilization information often enough in any network of practically interesting size in infeasible. Adaptive routing can also work as a complementary failure handling mechanism with much faster reaction time than routing convergence. TODO: separate document describing possible adaptive load balancing implementation using existing mechanisms.¶
This section describes routing design supporting non-minimal paths. It uses only existing mechanisms - VRFs, route leaking and EBGP as a routing protocol. EBGP is chosen for scalability and flexibility - routing policies and communities allow to implement additional logic and precisely control propagation of routing updates. Routing design is based on following principles:¶
To achieve desired forwarding behavior several VRFs are configured on every spine:¶
Additional VRF serving as a virtual link is configured if network is using LGLLGL paths - "reflect" VRF in each group containing local links. Since both local VRF and reflect VRF include leaf-spine links some form of VRF multiplexing over leaf-spine links is required when LGLLGL paths are used. Additional VRF serving as a virtual link is configured if network is using LGLLGL paths - reflect VRF in each group containing local links. Since both local VRF and reflect VRF include leaf-spine links some form of VRF multiplexing over leaf-spine links is required when LGLLGL paths are used. Local VRF: - imports minimal and non-minmal paths from the core VRF and installs them Core VRF - imports locally originated paths from local VRF in each group - imports transit paths from reflect VRF Reflect VRF - imports minimal paths from `core VRF¶
Each group is in a separate AS. Communities, routing policies and update propagation:¶
During import into local VRFs prepend ASPATH:¶
As result paths with C1, C2 and C3 will all have has the same ASPATH length in local VRFs and will be eligible for ECMP.¶
TODO¶
TODO¶
Body text¶
This memo includes no request to IANA.¶
This document should not affect the security of the Internet.¶