• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 25 Feb 2026] Title:Multi-Layer Scheduling for MoE-Based LLM Reasoning View PDF HTML (experimental)Abstract:Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, but serving them efficiently at scale remains a critical challenge due to their substantial computational and latency demands • While most existing inference frameworks rely on simple scheduling strategies such as First-Come-First-Serve (FCFS) at the engine level and Round-Robin (RR) at the scheduler or coordinator level, they often fail to fully utilize system resources and may suffer from issues such as head-of-line blocking and load imbalance • Recent advances in Mixture-of-Experts (MoE) models have also introduced new challenges in scheduling arising from expert parallelism and routing complexity • This research proposes a multi-layer scheduling framework tailored for MoE-based LLM serving • It targets scheduling at three levels: request-level, enginelevel, and expert-level • At the request level, we explore algorithms such as Shortest-Job-First (SJF) and priority-aware aging to improve throughput and reduce la

Article Summaries:

  • Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 25 Feb 2026] Title:Multi-Layer Scheduling for MoE-Based LLM Reasoning View PDF HTML (experimental)Abstract:Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, but serving them efficiently at scale remains a critical challenge due to their substantial computational and latency demands. While most existing inference frameworks rely on simple scheduling strategies such as First-Come-First-Serve (FCFS) at the engine level and Round-Robin (RR) at the scheduler or coordinator level,

Sources: