• We are open-sourcing the initial version of RCCLX - an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. • RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend. • Communication patterns for AI models are constantly evolving, as are hardware capabilities. • We want to iterate on collectives, transports, and novel features quickly on AMD platforms. • Earlier, we developed and open-sourced CTran, a custom transport library on the NVIDIA platform. • With RCCLX, we have integrated CTran to AMD platforms, enabling the AllToAllvDynamic - a GPU-resident collective.

Article Summaries:

  • Meta has released RCCLX, an open‑source GPU communication library for AMD platforms that builds on its earlier RCCL and Torchcomms integration. RCCLX incorporates the CTran transport, enabling a GPU‑resident AllToAllvDynamic collective, and introduces two key performance features. Direct Data Access (DDA) offers lightweight intra‑node collectives that reduce AllReduce latency by up to 30 % on MI300X GPUs, cutting time‑to‑incremental‑token by roughly 10 %. Low‑precision collectives (FP32/BF16) further accelerate training and inference on Instinct MI300/MI350 GPUs. Meta plans to add the remaining CTran features in the coming months.

Sources: