Joint Training on AMD and NVIDIA GPUs

• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 20 Feb 2026] Title:Joint Training on AMD and NVIDIA GPUs View PDF HTML (experimental)Abstract:As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. • This paper presents a technical solution for heterogeneous mixed training in AMD-NVIDIA environments. • We first adopt a compatibility-oriented approach based on CPU-Forwarding Communication, with differentiated communication back-end selection across parallel groups and multi-NIC parallel data transfer. • To achieve higher performance, we further propose another Device-Direct Communication approach, integrating a CPU-offloading P2P mechanism to enable direct cross-vendor GPU data transfer without host-memory staging. • Experiments on LLaMA-8B and Qwen2-7B demonstrate that the proposed Device-Direct Communication approach achieves up to 98% of the throughput of an NVIDIA homogeneous system, while preserving training stability and correctness. • References & Citations export BibTeX citation Loading…

Article Summaries:

Summary

A recent study addresses the growing need for heterogeneous GPU clusters in large‑language‑model training. The authors propose two communication strategies for mixed AMD‑NVIDIA environments: a CPU‑forwarding approach that selects different back‑ends for parallel groups and uses multi‑NIC data transfer, and a device‑direct method that offloads peer‑to‑peer (P2P) transfers to GPUs, bypassing host memory. Experiments on LLaMA‑8B and Qwen2‑7B show that the device‑direct scheme reaches up to 98 % of the throughput of a homogeneous NVIDIA system while maintaining training stability and correctness. The work offers a practical solution for scaling models across diverse GPU vendors.

Sources:

https://arxiv.org/abs/2602.18007