# Network-Accelerated Machine Learning Systems

## Network Abstraction for Distributed Deep Learning Training

**Investigators: Yiting Xia, Jialong Li, and Yiming Lei, in cooperation with Rui Pan (Princeton University), Zhiqiang Xie (Stanford University), and Binhang Yuan (ETH Zürich)**

Recent years have witnessed the rapid development of deep learning: each leap in the model quality comes with increased scales of neural networks, from AlexNet with 61M parameters in 2012 to MT-NLG with 530B parameters in 2022. Various parallel strategies have beenadopted by distributed deep learning training (DDLT) frameworks to accommodate the ever-growing model sizes. As a result, communication among distributed workers, especially over a shared, highly dynamic network with competing training jobs, has become a notable bottleneck of the training process.

The networking community has a long history of resolving bandwidth contentions with flow scheduling, from individual flow scheduling to Coflow scheduling. Surprisingly, despite the popularity of DDLT applications, we have found no flow scheduling solution supporting the diverse DDLT paradigms in GPU clusters! Our analysis suggests two reasons. The first reason is due to the challenge of defining a global optimization goal across training jobs. The various DDLT paradigms implement drastically different workflows, which may translate into incompatible network requirements causing network-wide optimization to diverge. As such, communication optimizations for DDLT focus on data parallelism only, and most work conduct per-job optimization, with estimations of the available bandwidth. Pioneering explorations for flow scheduling in DDLT, also limited to data parallelism, faced exactly this problem. Particularly, CadentFlow identified multiple performance metrics, e.g., weights, deadlines, and priorities, which may pull the optimization from different directions; MLNet proposed to schedule flows by priorities, but how to set priorities to reflect application needs is unknown. The second reason is the lack of network abstraction for DDLT. The Coflow network abstraction for traditional cluster applications falls short in DDLT. Coflow defines a collection of semantically-related flows and minimizes the completion time of the last flow. This goal motivates the optimizer to schedule the flows to finish at the same time. Oftentimes in DDLT, though, the followup computations consuming the flow data do not start at the same time. Taking pipeline parallelism for example, each worker computes on sequential micro-batches and sends the results to a successor worker. Consecutive workers have data dependencies, and the computations per worker follow the order of input data. For high training throughput, GPU workers must be well coordinated to preserve the pipeline throughout the training lifetime. Delay or reordering of data may increase GPU idleness and educe training efficiency. To match this strict computation pattern, data flows across micro-batches should (ideally) finish in a staggered manner. Formulating the flows as a Coflow tends to finish them simultaneously, making the duration of this computation phase even longer than bandwidth fair sharing!

Through extensive workflow analysis, we generalize this observation to other DDLT paradigms: regardless of the great diversity, each DDLT paradigm has a unique, pre-defined computation pattern that regulates the finish times of flow transmissions. These computation patterns, which are essentially computation dependencies (i.e., DAG) and times, are prevalent in distributed applications. Yet, the repetitiveness of DDLT jobs, e.g., similar or identical computations across training layers and iterations, makes it possible to extract the patterns through computation profiling and convey the application-level guidelines to network flows. Following this insight, we aspire to fill the gap of flow scheduling in DDLT. We propose the EchelonFlow network abstraction to finish flows according to strict DDLT computation patterns [1], and with EchelonFlow comes our global optimization goal of minimizing communication time while maintaining the computation patterns, like preserving the arrangement of an echelon formation. EchelonFlow s the first network abstraction for flow scheduling in diverse DDLT paradigms. It is also extensible to future DDLT paradigms, as long as their computation patterns can be profiled.

Our contributions in this project are from four fronts. (1) We formally define EchelonFlow and formulate a global optimization goal for it. (2) We prove important properties of EchelonFlow. Particularly, EchelonFlow scheduling can minimize completion times of mainstream DDLT paradigms, and EchelonFlow is a superset of Coflow. (3) Through case studies, we show the expressiveness of EchelonFlow by presenting popular DDLT paradigms with the EchelonFlow abstraction. (4) We sketch the system implementation to discuss the practicality of EchelonFlow scheduling.

**References**

[1] R. Pan, Y. Lei, J. Li, Z. Xie, B. Yuan, and Y. Xia. Eﬃcient ﬂow scheduling in distributed deep learning training with echelon formation. In HotNets ’22, 21st ACM Workshop on Hot Topics in Networks, Austin, TX, USA, 2022, pp. 93–100. ACM.

## Network-Aware GPU Sharing for Distributed Deep Learning

**Investigators: Yiting Xia, Jialong Li, and Yiming Lei, in cooperation with Vadim Farutin (Saarland University)**

In recent years, the advent of distributed training jobs has significantly hastened the deployment of GPU clusters. However, the prevalent strategy for running training jobs on these clusters is exclusive, which restricts one GPU to be assigned to a single training job. This monopolistic approach leads to low utilization of GPU clusters. Empirical data obtained from Alibaba’s production GPU clusters indicates that a mere 10% of the GPUs manage to achieve more than 80% GPU utilization, highlighting the considerable potential for enhancing GPU utilization.

With the aim of improving GPU utilization, recent research has explored the utilization of GPU resources in time-sharing way for training workloads, such as Gandiva, AntMan, and Salus. These approaches primarily focus on implementing GPU time-sharing among multiple jobs, considering the network as a black box. Generally, GPUs require network transmission after computation (Forward/Backward propagation). These approaches assume that computation is the bottleneck in DNN training and do not take network transmission time into account. However, as the complexity of models and datasets rapidly improves, network transmission has become a crucial factor in DNN training and cannot be ignored when applying GPU sharing.

In contrast to the above approaches, when scheduling training jobs, Muri considers four types of resources, including CPU, storage, and network resources. Muri assumes that at any given time, a training job can utilize only one type of resource, resulting in the idleness of other resources. Muri’s key idea involves grouping multiple jobs on the same GPU and using four types of resources in an interleaving manner. As a result, Muri achieves a lower job completion time (JCT) and higher resource utilization in comparison to other baselines. Despite achieving higher resource utilization, Muri faces several problems.

Problem 1: Muri assumes a fixed network transmission time. Muri assumes a fixed network transmission time by profiling resource usage duration in advance. This assumption is impractical since the network transmission time varies due to competition for bandwidth among multiple training jobs. Additionally, the network transmission time may differ in a new deployment environment with different bandwidth capacities. Although Muri considers network transmission time when applying resource sharing, the assumption of a fixed network transmission time is not applicable in practice.

Problem 2: Muri does not consider device placement when mapping jobs to GPUs. Muri assumes that each GPU occupied by the same training job will send/receive the same amount of data, and therefore only decides which jobs should be mapped to the same GPU without considering which partitions of these grouped jobs should be placed together. However, if this assumption does not hold, we need to consider where to place the partitions since the placement of partitions plays a crucial role in determining the JCT. A simple observation is that assigning two partitions with heavy-loaded traffic on the same GPU will result in a long network transmission time. With the rapid development of DNN training models, more training jobs will not follow this assumption. Therefore, besides grouping jobs, it is imperative to consider where to place partitions, taking into account the communication patterns and data traffic of each training job.

Problem 3: Muri only groups jobs using the same number of GPUs. Muri only groups jobs using the same number of GPUs, thereby missing opportunities to realize higher GPU utilization. The new design should aim to group any number of jobs, regardless of their GPU requirements.

To address the aforementioned problems, we propose the Network-Aware GPU Sharing (NAGS) design for DNN training jobs. The NAGS design introduces a new concept called sharing group, which accommodates an arbitrary number of training jobs with any number of GPUs. Two metrics, namely, sharing factor and sharing utilization, are used to evaluate the sharing group’s performance. The sharing factor measures the degradation in JCT due to GPU sharing, while the sharing utilization measures the GPU resource utilization. NAGS tries to answer this research question: For training jobs with arbitrary amounts of sending/receiving data, how can we group jobs and place partitions to minimize JCT degradation and maximize GPU resource utilization?

To answer this question, we first propose two theorems and provide the corresponding proofs. Based on these theorems, we derive the minimum sharing factor and the optimal partition placement for any given set of jobs. Furthermore, we prove that constructing sharing groups optimally is an NP-hard problem. To overcome this challenge, we develop heuristic methods to construct three types of sharing groups.

Notably, the NAGS design is not dependent on strong assumptions. Only two main assumptions are required in this work: (1) proﬁling the traffic matrix of each job in advance, and (2) applying the “big switch” network model. Preliminary results validate the superiority of NAGS over Muri and other baselines.