Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU

• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 18 Aug 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU View PDF HTML (experimental)Abstract:Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs • Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU cores per node • Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle • This paper presents novel optimizations to large GPU-aware all-reduce operations by extending the lane-aware algorithm to heterogeneous architectures and notably using multiple CPU cores per GPU to accelerate these operations • Using GPUDirect RDMA and host copy communications respectively, these multi-CPU-accelerated GPU-aware all-reduces yield speedups over system MPI of up to $3$x on LLNL’s Tuolumne supercomputer and up to $2 • 45$x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA’s Delta super

Article Summaries:

Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 18 Aug 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU View PDF HTML (experimental)Abstract:Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the a

Sources:

https://arxiv.org/abs/2508.13397 (Latest source article published: 2026-02-26 05:00 UTC)