• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 18 Feb 2026] Title:FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving View PDF HTML (experimental)Abstract:The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). • This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. • While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. • This necessitates an adaptive preemption mechanism. • However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. • In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency.
Article Summaries:
- FlowPrefill is a new large‑language‑model (LLM) serving system that tackles head‑of‑line (HoL) blocking during the compute‑intensive prefill phase. By decoupling preemption granularity from scheduling frequency, it avoids the trade‑off between responsiveness and throughput inherent in chunked prefill. The system introduces operator‑level preemption, allowing fine‑grained interruption at operator boundaries, and event‑driven scheduling, which triggers decisions only on request arrival or completion. Evaluations on real production traces show FlowPrefill can increase goodput by up to 5.6× compared to existing solutions while meeting diverse service‑level objectives.
Sources: