• Prefill/decode disaggregation improves latency-throughput tradeoff for large language model serving. • Energy consumption remains high; autoscaling is too coarse-grained for rapid workload changes. • BiScale introduces a two-tier energy optimization framework for disaggregated LLM serving. • Coarse tier computes phase-aware placement and baseline frequencies to minimize energy while meeting SLOs. • Fine tier adapts GPU frequency per iteration using MPC for prefill and slack-aware control for decode. • Evaluation on 16x H100 cluster with Llama 3.3 70B shows up to 48% energy savings while meeting SLOs.
Article Summaries:
- BiScale is a two‑tier energy‑optimization framework for disaggregated large‑language‑model (LLM) serving. It jointly optimizes GPU placement and dynamic voltage‑frequency scaling (DVFS) across prefill and decode stages using predictive latency and power models. At a coarse timescale, BiScale computes phase‑aware placement and baseline frequencies that satisfy service‑level objectives (SLOs) for time‑to‑first‑token (TTFT) and total‑processing‑time (TPOT). At a fine timescale, it adapts GPU frequency per iteration: model‑predictive control for prefill to account for queue dynamics, and slack‑aware adjustment for decode’s smoother, memory‑bound behavior. Evaluated on a 16‑node H100 cluster running Llama 3.3 70B, BiScale meets TTFT/TPOT SLOs while cutting energy use by up to 39 % in prefill and 48 % in decode compared to the DistServe baseline.
Sources: