Emerging services and hardware advancements are reshaping the landscape of cloud computing. Cloud data centers are evolving from a gigantic monolithic cluster of computing servers to diverse forms. One trend is edge computing, where small data centers known as cloudlets are built close to users as easy access points for Internet of Things (IoT). The other trend is the increasing modular placement of hardware resources. For example, computing racks with disaggregated functions such as computing, memory, and storage have become fundamental building blocks of data centers; and a number of data centers are interconnected as a region in dense metro areas. These changes motivate new network architectures for faster transmission, more flexible connectivity, and greater extensibility for future expansions. This project is to push forward fundamental research on the future cloud fabric combining theory and practice. We abstract the requirements of the next-generation cloud environment and design proper network topologies with the hammer of graph theory. Then we realize these ideal network topologies through practical design of device packaging, wiring, traffic routing, fault tolerance, and network maintenance.
Cloud traffic is growing far beyond the speed of bandwidth upgrade in data centers, because traditional electrical switches have hit the scaling bottleneck of port density and power consumption. Looking forward, optical networking has natural advantages to become the solution to large-scale, high-performance, and power-efficient cloud infrastructure in the future. First, the optical network transmits signals at ultra-high bandwidth regardless of the modulation speed, which is future-proof to the ever-growing bandwidth requirements. Second, many optical components are passive, and optical switches consume orders-of-magnitude less power than their electrical counterparts. Third, some optical switching technologies can scale to high port density, saving the inefficient hierarchies in electrical networks. This project is to explore the opportunities of optical cloud infrastructure. It involves hardware-software co-design. We start from the physical-layer optical network interconnects and move upwards the network stack to make routing and transport protocols, systems, and network applications adapt to the hardware innovations. The goal is to provide an all-in-one solution for cloud service developers with a simple optical-network library that hides low-level complications.
Today's machine learning systems are increasingly distributed, as user services (e.g. computer vision, natural language processing, and recommendation systems, etc.) keep generating bigger models and calling for higher training speed. In distributed parallel training, a number of parameter servers and workers that are spread across different machines exchange high-volume and high-fanout gradients iteratively. In recent years, the remarkable success of computation accelerators, such as GPU and TPU, is skewing the ratio of computation to communication towards the latter, making many machine-learning jobs network-bounded. This project is to accelerate inter-node communications in machine learning systems. We dig the tunnel from two ends. One end is to benchmark the performance of machine learning systems under different network conditions to understand the fundamental requirements on the network, so as to rethink existing solutions, such as RDMA. The other end is to leverage network offloading facilities, such as programmable switch and smart NIC, as network accelerators to boost bottleneck operations.