Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

• Computer Science > Machine Learning [Submitted on 6 Mar 2025 (v1), last revised 24 Feb 2026 (this version, v4)] Title:Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling View PDF HTML (experimental)Abstract:Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. • However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. • Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency This paper proposes Semantic Parallelism, a novel parallelism paradigm that minimizes the steep communication costs in EP-centric MoE serving via model-data collaborative scheduling. • We implement Semantic Parallelism in a framework called Sem-MoE. • Sem-MoE maximally collocates experts and their activating tokens onto the same device using proactively modeled activation likelihood between them and introduces three key techniques: (1) Offline model scheduling, which preliminarily clusters and collocates experts onto devices based on their co-activation tendencies for certain classes of input. • (2) Online inter-request data scheduling for Attention-DP setups, which proactively reb

Article Summaries:

Computer Science > Machine Learning [Submitted on 6 Mar 2025 (v1), last revised 24 Feb 2026 (this version, v4)] Title:Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling View PDF HTML (experimental)Abstract:Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Ne

Sources:

https://arxiv.org/abs/2503.04398 (Latest source article published: 2026-02-25 05:00 UTC)