• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 19 Feb 2026] Title:Trivance: Latency-Optimal AllReduce by Shortcutting Multiport Networks View PDF HTML (experimental)Abstract:AllReduce is a fundamental collective operation in distributed computing and a key performance bottleneck for large-scale training and inference. • Its completion time is determined by the number of communication steps, which dominates latency-sensitive workloads, and the communication distance affecting both latency- and bandwidth-bound regimes. • Direct-connect topologies, such as torus networks used in Google’s TPUv4, are particularly prone to large communication distances due to limited bisection bandwidth. • Latency-optimal algorithms such as Bruck’s complete AllReduce in $\log_3 n$ steps on a bidirectional ring, but incur large communication distances that result in substantial congestion. • In contrast, recent approaches such as Swing reduce communication distance and congestion, but are inherently required to perform $\log_2 n$ steps to complete AllReduce, sacrificing latency-optimality. • In this paper, we present Trivance, a novel AllReduce algorithm that completes within $\log_3 n$ steps, while reducing congestion compared to Bruck’s algorithm by a factor of three and preserving bandwidth-optimality.
Article Summaries:
- Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 19 Feb 2026] Title:Trivance: Latency-Optimal AllReduce by Shortcutting Multiport Networks View PDF HTML (experimental)Abstract:AllReduce is a fundamental collective operation in distributed computing and a key performance bottleneck for large-scale training and inference. Its completion time is determined by the number of communication steps, which dominates latency-sensitive workloads, and the communication distance affecting both latency- and bandwidth-bound regimes. Direct-connect topologies, such as torus network
Sources: