Cloud data centers are evolving from a gigantic monolithic cluster of computing servers to diverse forms. One trend is edge computing, where small data centers known as cloudlets are built close to users as easy access points for Internet of Things (IoT). The other trend is the increasing modular placement of hardware resources. For example, computing racks with disaggregated functions such as computing, memory, and storage have become fundamental building blocks of data centers; and a number of data centers are interconnected as a region in dense metro areas. These changes motivate new network architectures for faster transmission, more flexible connectivity, and greater extensibility for future expansions. This project is to push forward fundamental research on the future cloud fabric combining theory and practice. We abstract the requirements of the next-generation cloud environment and design proper network topologies with the hammer of graph theory. Then we realize these ideal network topologies through practical design of device packaging, wiring, traffic routing, fault tolerance, and network maintenance.
Optical data center network (DCN) fabrics are renovating the infrastructure design in the cloud. However, there is a gap between the diverse optical hardware architectures and system integration work to realize the architectures as end-to-end workable systems. This research direction is to design and implement practical systems to enable different optical architectures in production DCNs. Towards that, we abstract fundamental building blocks for optical DCNs, including global time synchronization with nanosecond-scale accuracy, generic routing regardless of the optical hardware, and an application-agnostic host stack. Up till now, we have implemented a prototype system with P4 on Tofino2 switches and libvma on Mellanox NICs. Extensive micro-benchmark studies with production DCN traffic show that our system keeps synchronization errors under 15ns and ensures zero packet loss with 99.93% achievable network utilization. We demonstrate three optical architectures on the system with real DCN applications and observe s milar flow completion times for mice flows compared to electrical DCNs. We are open-sourcing the system, as a tool for the networking community to test, improve, and deploy optical DCNs.
Cloud traffic is growing far beyond the speed of bandwidth upgrade in data centers, because traditional electrical switches have hit the scaling bottleneck of port density and power consumption. Looking forward, optical networking has natural advantages to become the solution to large-scale, high-performance, and power-efficient cloud infrastructure in the future. First, the optical network transmits signals at ultra-high bandwidth regardless of the modulation speed, which is future-proof to the ever-growing bandwidth requirements. Second, many optical components are passive, and optical switches consume orders-of-magnitude less power than their electrical counterparts. Third, some optical switching technologies can scale to high port density, saving the inefficient hierarchies in electrical networks. This project is to explore the opportunities of optical cloud infrastructure. It involves hardware-software co-design. We start from the physical-layer optical network interconnects and move upwards the network stack to make routing and transport protocols, systems, and network applications adapt to the hardware innovations. The goal is to provide an all-in-one solution for cloud service developers with a simple optical-network library that hides low-level complications.
Recent years have witnessed the rapid development of deep learning. Various parallel strategies have been adopted by distributed deep learning training (DDLT) frameworks to accommodate the ever-growing model sizes. As a result, communication among distributed workers, especially over a shared, highly dynamic network with competing training jobs, has become a notable bottleneck of the training process. We aim to accelerate inter-node communications in machine learning systems. In one project, we propose the first network abstraction for DDLT and devise a generic method to model the drastically diffrent computation patterns across training paradigms. We use the abstraction for flow scheduling in DDLT jobs and demonstrate its effctive with case studies. In another project, we introduce network-aware GPU sharing to improve effiency of job placement and GPU scheduling in machine learning clusters. Compared to previous scheduling mechanisms that assume fixed data transmission time, we for the fist time model the netwo k dynamically and provide tighter bounds on the data transmission time. Simulation results show our scheduling method achieves high GPU utilization with minimal slowdown of training time.
The complexity of large networks makes their management a daunting task. State-of-the-art network management systems program workflows of operational steps with arbitrary scripts, which pose substantial challenges to reliability. We leverage the fact that most modern network management systems are backed with a source-of-truth database and customize database techniques to the context of network management. The network management framework exposes a programming model to network operators for conveying the key management logic. Then the operators are completely shielded from reliability concerns, such as distributed devices, operational conflicts, task atomicity, and failures, which are instead handled by the runtime system using database techniques. Our simulation evaluation and production case studies demonstrate the system’s effectiveness in minimizing network vulnerable time and resolving task conflicts. We open-source our simulator and task traces for academic researchers to contribute to this industrial problem.