Investigators: Yiting Xia, Jialong Li, and Yiming Lei, in cooperation with Federico De Marchi (Saarland University), Raj Joshi (National University of Singapore), and Balakrishnan Chandrasekaran (Vrije Universiteit Amsterdam)
The growth of data center networks (DCNs) have largely beneﬁted from the Moore’s law for networking—the bandwidth of electrical switches doubles every two years at the same cost and power. As this bandwidth scaling is slowing down, the networking community has started exploring high-radix passive optical network interconnects, which have lower per-port cost and consume less power than electrical switches. The latest optical-DCN designs deliver up to 4-times the bandwidth and consume only 23%–26% power of a cost-equivalent electrical DCN. A typical optical DCN fabric comprises of a number of optical switches that interconnect electrical top-of-rack switches (ToRs) and end servers (refer Fig. 34.1). The fabric uses circuit switching to establish dedicated optical circuits that are time-shared amongst the diﬀerent ToR pairs. The delays incurred in establishing the circuits, however, substantially aﬀect latency-sensitive traﬃc (or “mice” ﬂows) that then use these circuits.
In this paper, we focus on minimizing the impact of these delays on latency-sensitive traﬃc. Prior attempts at solving this problem either use an electrical-optical hybrid network and send mice ﬂows over the “always-on” electrical network, or reduce the circuit-establishment delays using novel optical switching hardware. The electrical-optical dual-fabric, however, doubles the deployment and maintenance costs of DCNs, while the latter requires extensive customizations to commercial network devices and the standard network stack, e.g., to adapt to the 1ns optical switching speed in Sirius. In contrast to such prior work, we propose a simple solution that leverages programmable switches: a routing algorithm with the speciﬁc objective of accelerating mice ﬂows. The idea of leveraging routing to accelerate latency-sensitive ﬂows has been used in prior work, albeit within a narrow scope. Opera pursued a meticulous co-design of the optical network topology and routing for guaranteeing that mice ﬂows always have (muti-hop) optical paths available via intermediate ToRs. Similar to prior designs, Opera assumes, however, that packets must be buﬀered on end servers, as optical switches are buﬀerless. As soon as an optical path is available, packets hop on that path and ride it until the destination. They cannot hop oﬀ at intermediate ToRs even if a diﬀerent optical path later oﬀers an earlier arrival time (at the destination). Routing in Opera is, hence, sub-optimal: It searches for non-stop paths, rather than the fastest paths. We oﬀer support for packets to “hop-oﬀ” at ToRs by rethinking packet buﬀering on ToRs.
Buﬀering at ToRs was deemed impossible due to the limited packet buﬀer on switches, the diﬃculty in synchronizing switches to coordinate with optical circuit conﬁgurations, and the lack of processing logic for scheduling packet transmissions at precise times. Recent programmable switches oﬀer rich functionalities to clear these technical obstacles. Switches can, for example, provide temporal buﬀering for a small number of packets, be time-synchronized at nanosecond-level precision, and provide time-based scheduled packet transmission via calendar queues. We exploit the recent technological innovations in programmable switches to realize a novel routing algorithm for minimizing the delays experienced by latency-sensitive ﬂows and summarize our contributions as follows : (a) we present a Hop-On Hop-Oﬀ (HOHO) routing algorithm that provides the fastest paths—packets can “hop on” and “hop oﬀ” at intermediate ToRs to select the best optical paths that minimize their arrival time at the destination; (b) we prove the optimality and robustness of the HOHO algorithm, and sketch its implementation on programmable switches, including the time synchronization, routing lookup, and packet buﬀering mechanisms; (c) in our packet-level simulations with real DCN traﬃc, HOHO reduces the ﬂow completion times (FCTs) of latency-sensitive ﬂows by up to 35% and reduces the average path length by 15% compared to Opera. HOHO uses at most 7 queues per egress port and a packet buﬀer of about 3.24MB, which is far below the capacity limit of commercial switch ASICs.
 J. Li, Y. Lei, F. De Marchi, R. Joshi, B. Chandrasekaran, and Y. Xia. Hop-On Hop-Oﬀ routing: A fast tour across the optical data center network for latency-sensitive ﬂows. In APNet ’22, 6th Asia-Paciﬁc Workshop on Networking, 2022.
Investigators: Yiting Xia, Jialong Li, and Yiming Lei, in cooperation with Federico De Marchi (Saarland University), Zhengqing Liu (Ecole Polytechnique, Paris), Raj Joshi (National University of Singapore), and Balakrishnan Chandrasekaran (Vrije Universiteit Amsterdam)
The last 15 years has witnessed the emergence and development of optical data center networks (DCNs). A series of optical DCN architectures have been proposed to leverage the bandwidth, power, and cost advantages of optical interconnects. Compared to electrical interconnects in traditional DCNs, optical interconnects use circuit switching to establish dedicated optical circuits between end points and shift the circuits across “time slices” to create time-shared networks. Circuit reconﬁguration incurs a “switching delay” determined by the speciﬁc optical switching technology adopted by the architecture. Thisjourneystartedfromslow-switched opticalDCNs, with tens of milliseconds of switching delays. Limited by the switching speed, this type of optical network has to work in tandem with an electrical network to avoid network partitioning, e.g., either augmenting the electrical DCN with on-demand circuits to oﬄoad heavy traﬃc, or serving as “patch panels” for electrical switches and reconﬁgure the network topology on a seconds to hours granularity. For example, Jupiter—Google’s DCN fabric—has achieved 5× capacity increase, 41% power reduction, and 30% cost reduction after deploying slow-switched optical interconnects in the network core. These optical interconnects provide large port counts to interconnect electrical switches and reconﬁgure the DCN topology when needed, e.g., at device upgrade and failuretimes, or once a few hours as the DCN traﬃc evolves.
Fast-switched optical DCNs, whose switching delays vary between several nanoseconds to tens of microseconds, have been increasingly recognized in recent years driven by the prevalence of mice ﬂows in DCN applications. These architectures shift circuits continuously to route traﬃc in the optical domain on an all-optical network fabric. The removal of the electrical network further reduces cost compared to slow-switched optical DCNs, but at the same time deviates from the all-to-all connectivity assumed by conventional DCN designs. How to build the networked system to adapt to the transient circuits, with merely microsecond-scale durations under fast switching, is still largely unknown. We foresee challenges from multiple fronts for the implementation and eventual deployment of fast-switched optical DCNs. (1) How should network devices be time-synchronize network-wide at sub-microsecond or even nanosecond accuracy to keep traﬃc in sync with the rapidly reconﬁgured circuits, preferably in-band over the optical network fabric now that the electrical network is out of the way? (2) As the circuit duration drops to the same scale as the DCN RTT and delays on the host stack, how should the Top-of-Rack switch (ToR) and host systems be designed to maintain good performance? (3) Even if implemented, each optical architecture is a closed ecosystem with heavily coupled optical hardware and networked system, how to upgrade the network from one architecture to another after deployment?
In this project, we address these challenges with OpenOptics, a general framework to make fast-switched optical DCNs practically realizable. OpenOptics aims to do to fast-switched optical DCNs what OpenFlow did to traditional networks. It allows speciﬁc optical hardware to be integrated into the general framework in a plug-and-play manner to have a workable end-to-end system, and cloud applications can run without changes as if on traditional DCNs. As optical technologies advance, diﬀerent optical architectures can be realized straightforwardly on top of OpenOptics, and the system can remain intact when the DCN fabric is upgraded to newer optical hardware. By decoupling the software system from the optical hardware, we make the niche area of optical DCNs more accessible to network researchers. We will open-source OpenOptics to encourage real-world testing and education of fast-switched optical DCNs. The enabler of generality in OpenOptics is HOHO routing, a uniﬁed routing algorithm we published at a workshop to apply to diﬀerent fast-switched optical DCN architectures. Most fast-switched optical DCN architectures use a pre-deﬁned optical schedule, i.e., a repetitive sequence of circuit connections over time slices, to avoid expensive real-time traﬃc estimation and circuit planning under the short time slice durations. HOHO routing takes advantage of this fact to abstract each architecture by its optical schedule. It takes the optical schedule as input and computes oﬄine the lowest-latency paths (proven to be optimal) for mice ﬂows. We replace the speciﬁc routing algorithm of each architecture with HOHO routing, which produces better paths for mice ﬂows and preserves the direct paths between source and destination ToRs for elephant ﬂows.
With the uniﬁed HOHO routing, we can unify the To Rand host systems across architectures as well. We implement the runtime system for the oﬄine HOHO algorithm on ToRs and hosts, using P4 on Intel Toﬁno2 switches and VMA on Mellanox NICs. We bear the aforementioned challenges in mind and embrace a systematic design by testing the boundaries of these commercial tools for fast-switched optical DCNs. Speciﬁcally, we realize network-wide in-band ToR synchronization based on proﬁled synchronization errors between Toﬁno2 switches;we implement HOHO routing on ToRs with careful measurements of system delays on Toﬁno2 switches per the critical steps; and we build an application-agnostic host network with a fair judgement of the overheads of kernel and kernel-bypass options. Our micro-benchmark evaluation of OpenOptics performance shows that our in-band ToR synchronization can keep the synchronization errors under 15ns, our ToR system achieves zero packet loss with 99.93% achievable network utilization, and our host system sends 99.4% packets inside the scheduled time slices. We demonstrate the generality of OpenOptics by realizing Mordia, RotorNet, and Opera—three fast-switched optical DCN architectures—on top of it. Case studies running Memcached and Gloo applications on them show that the tail ﬂow completion times for mice ﬂows in OpenOptics is only 16% worse than that of an electrical DCN.