Network and Cloud Systems

Low-Latency Routing in Optical Data Center Networks

Investigators: Yiting Xia, Jialong Li, and Yiming Lei, in cooperation with Federico De Marchi (Saarland University), Raj Joshi (National University of Singapore), and Balakrishnan Chandrasekaran (Vrije Universiteit Amsterdam)


The growth of data center networks (DCNs) have largely benefited from the Moore’s law for networking—the bandwidth of electrical switches doubles every two years at the same cost and power. As this bandwidth scaling is slowing down, the networking community has started exploring high-radix passive optical network interconnects, which have lower per-port cost and consume less power than electrical switches. The latest optical-DCN designs deliver up to 4-times the bandwidth and consume only 23%–26% power of a cost-equivalent electrical DCN. A typical optical DCN fabric comprises of a number of optical switches that interconnect electrical top-of-rack switches (ToRs) and end servers (refer Fig. 34.1). The fabric uses circuit switching to establish dedicated optical circuits that are time-shared amongst the different ToR pairs. The delays incurred in establishing the circuits, however, substantially affect latency-sensitive traffic (or “mice” flows) that then use these circuits.

In this paper, we focus on minimizing the impact of these delays on latency-sensitive traffic. Prior attempts at solving this problem either use an electrical-optical hybrid network and send mice flows over the “always-on” electrical network, or reduce the circuit-establishment delays using novel optical switching hardware. The electrical-optical dual-fabric, however, doubles the deployment and maintenance costs of DCNs, while the latter requires extensive customizations to commercial network devices and the standard network stack, e.g., to adapt to the 1ns optical switching speed in Sirius. In contrast to such prior work, we propose a simple solution that leverages programmable switches: a routing algorithm with the specific objective of accelerating mice flows. The idea of leveraging routing to accelerate latency-sensitive flows has been used in prior work, albeit within a narrow scope. Opera pursued a meticulous co-design of the optical network topology and routing for guaranteeing that mice flows always have (muti-hop) optical paths available via intermediate ToRs. Similar to prior designs, Opera assumes, however, that packets must be buffered on end servers, as optical switches are bufferless. As soon as an optical path is available, packets hop on that path and ride it until the destination. They cannot hop off at intermediate ToRs even if a different optical path later offers an earlier arrival time (at the destination). Routing in Opera is, hence, sub-optimal: It searches for non-stop paths, rather than the fastest paths. We offer support for packets to “hop-off” at ToRs by rethinking packet buffering on ToRs.

Buffering at ToRs was deemed impossible due to the limited packet buffer on switches, the difficulty in synchronizing switches to coordinate with optical circuit configurations, and the lack of processing logic for scheduling packet transmissions at precise times. Recent programmable switches offer rich functionalities to clear these technical obstacles. Switches can, for example, provide temporal buffering for a small number of packets, be time-synchronized at nanosecond-level precision, and provide time-based scheduled packet transmission via calendar queues. We exploit the recent technological innovations in programmable switches to realize a novel routing algorithm for minimizing the delays experienced by latency-sensitive flows and summarize our contributions as follows [1]: (a) we present a Hop-On Hop-Off (HOHO) routing algorithm that provides the fastest paths—packets can “hop on” and “hop off” at intermediate ToRs to select the best optical paths that minimize their arrival time at the destination; (b) we prove the optimality and robustness of the HOHO algorithm, and sketch its implementation on programmable switches, including the time synchronization, routing lookup, and packet buffering mechanisms; (c) in our packet-level simulations with real DCN traffic, HOHO reduces the flow completion times (FCTs) of latency-sensitive flows by up to 35% and reduces the average path length by 15% compared to Opera. HOHO uses at most 7 queues per egress port and a packet buffer of about 3.24MB, which is far below the capacity limit of commercial switch ASICs.

[1] J. Li, Y. Lei, F. De Marchi, R. Joshi, B. Chandrasekaran, and Y. Xia. Hop-On Hop-Off routing: A fast tour across the optical data center network for latency-sensitive flows. In APNet ’22, 6th Asia-Pacific Workshop on Networking, 2022.

A General Framework for Fast-Switched Optical Data Center Networks

Investigators: Yiting Xia, Jialong Li, and Yiming Lei, in cooperation with Federico De Marchi (Saarland University), Zhengqing Liu (Ecole  Polytechnique, Paris), Raj Joshi (National University of Singapore), and Balakrishnan Chandrasekaran (Vrije Universiteit Amsterdam)


The last 15 years has witnessed the emergence and development of optical data center networks (DCNs). A series of optical DCN architectures have been proposed to leverage the bandwidth, power, and cost advantages of optical interconnects. Compared to electrical interconnects in traditional DCNs, optical interconnects use circuit switching to establish dedicated optical circuits between end points and shift the circuits across “time slices” to create time-shared networks. Circuit reconfiguration incurs a “switching delay” determined by the specific optical switching technology adopted by the architecture. Thisjourneystartedfromslow-switched opticalDCNs, with tens of milliseconds of switching delays. Limited by the switching speed, this type of optical network has to work in tandem with an electrical network to avoid network partitioning, e.g., either augmenting the electrical DCN with on-demand circuits to offload heavy traffic, or serving as “patch panels” for electrical switches and reconfigure the network topology on a seconds to hours granularity. For example, Jupiter—Google’s DCN fabric—has achieved 5× capacity increase, 41% power reduction, and 30% cost reduction after deploying slow-switched optical interconnects in the network core. These optical interconnects provide large port counts to interconnect electrical switches and reconfigure the DCN topology when needed, e.g., at device upgrade and failuretimes, or once a few hours as the DCN traffic evolves.

Fast-switched optical DCNs, whose switching delays vary between several nanoseconds to tens of microseconds, have been increasingly recognized in recent years driven by the prevalence of mice flows in DCN applications. These architectures shift circuits continuously to route traffic in the optical domain on an all-optical network fabric. The removal of the electrical network further reduces cost compared to slow-switched optical DCNs, but at the same time deviates from the all-to-all connectivity assumed by conventional DCN designs. How to build the networked system to adapt to the transient circuits, with merely microsecond-scale durations under fast switching, is still largely unknown. We foresee challenges from multiple fronts for the implementation and eventual deployment of fast-switched optical DCNs. (1) How should network devices be time-synchronize network-wide at sub-microsecond or even nanosecond accuracy to keep traffic in sync with the rapidly reconfigured circuits, preferably in-band over the optical network fabric now that the electrical network is out of the way? (2) As the circuit duration drops to the same scale as the DCN RTT and delays on the host stack, how should the Top-of-Rack switch (ToR) and host systems be designed to maintain good performance? (3) Even if implemented, each optical architecture is a closed ecosystem with heavily coupled optical hardware and networked system, how to upgrade the network from one architecture to another after deployment?

In this project, we address these challenges with OpenOptics, a general framework to make fast-switched optical DCNs practically realizable. OpenOptics aims to do to fast-switched optical DCNs what OpenFlow did to traditional networks. It allows specific optical hardware to be integrated into the general framework in a plug-and-play manner to have a workable end-to-end system, and cloud applications can run without changes as if on traditional DCNs. As optical technologies advance, different optical architectures can be realized straightforwardly on top of OpenOptics, and the system can remain intact when the DCN fabric is upgraded to newer optical hardware. By decoupling the software system from the optical hardware, we make the niche area of optical DCNs more accessible to network researchers. We will open-source OpenOptics to encourage real-world testing and education of fast-switched optical DCNs. The enabler of generality in OpenOptics is HOHO routing, a unified routing algorithm we published at a workshop to apply to different fast-switched optical DCN architectures. Most fast-switched optical DCN architectures use a pre-defined optical schedule, i.e., a repetitive sequence of circuit connections over time slices, to avoid expensive real-time traffic estimation and circuit planning under the short time slice durations. HOHO routing takes advantage of this fact to abstract each architecture by its optical schedule. It takes the optical schedule as input and computes offline the lowest-latency paths (proven to be optimal) for mice flows. We replace the specific routing algorithm of each architecture with HOHO routing, which produces better paths for mice flows and preserves the direct paths between source and destination ToRs for elephant flows.

With the unified HOHO routing, we can unify the To Rand host systems across architectures as well. We implement the runtime system for the offline HOHO algorithm on ToRs and hosts, using P4 on Intel Tofino2 switches and VMA on Mellanox NICs. We bear the aforementioned challenges in mind and embrace a systematic design by testing the boundaries of these commercial tools for fast-switched optical DCNs. Specifically, we realize network-wide in-band ToR synchronization based on profiled synchronization errors between Tofino2 switches;we implement HOHO routing on ToRs with careful measurements of system delays on Tofino2 switches per the critical steps; and we build an application-agnostic host network with a fair judgement of the overheads of kernel and kernel-bypass options. Our micro-benchmark evaluation of OpenOptics performance shows that our in-band ToR synchronization can keep the synchronization errors under 15ns, our ToR system achieves zero packet loss with 99.93% achievable network utilization, and our host system sends 99.4% packets inside the scheduled time slices. We demonstrate the generality of OpenOptics by realizing Mordia, RotorNet, and Opera—three fast-switched optical DCN architectures—on top of it. Case studies running Memcached and Gloo applications on them show that the tail flow completion times for mice flows in OpenOptics is only 16% worse than that of an electrical DCN.