Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU
• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 18 Aug 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Optimizing Allreduce Operations