Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

• In LLM training, Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models is challenging. • EP communication is essentially all-to-all, but due to its dynamics and sparseness (only topk experts per AI token instead of all experts), it’s challenging to implement and optimize. • This post details an efficient MoE EP communication solution, Hybrid-EP, and its use in the NVIDIA Megatron family of frameworks, on NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet platforms. • It also dives into the effectiveness of Hybrid-EP in real-world model training. • Efficiency challenges of hyperscale MoE model training DeepSeek-V3 is a representative model of the new generation of large-scale fine-grained MoE models. • Such models balance computational overhead with model performance through “hyperparameter size sparse activation,” but they also pose serious challenges for existing large-model training frameworks.

Article Summaries:

NVIDIA has introduced Hybrid‑EP, a new communication library that optimizes expert‑parallel (EP) training for large‑scale mixture‑of‑experts (MoE) models such as DeepSeek‑V3. The library targets the all‑to‑all communication bottleneck that can consume over 50 % of training time, and addresses load imbalance caused by dynamic routing of “hot” experts. Hybrid‑EP leverages NVIDIA’s Quantum InfiniBand and Spectrum‑X Ethernet, combining RDMA‑NVLink hardware with software‑level dispatch and gather operators to approach hardware bandwidth limits. Integrated into the Megatron‑Core framework, it supports FP8 precision, activation offloading, and fine‑grained recalculation, enabling more efficient, scalable MoE training on next‑generation GPUs.

Sources:

https://developer.nvidia.com/blog/optimizing-communication-for-mixture-of-experts-training-with-hybrid-expert-parallel/