• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 18 Feb 2026] Title:LLM-Driven Intent-Based Privacy-Aware Orchestration Across the Cloud-Edge Continuum View PDF HTML (experimental)Abstract:With the rapid advancement of large language models (LLMs), efficiently serving LLM inference under limited GPU resources has become a critical challenge. • Recently, an increasing number of studies have explored applying serverless computing paradigms to LLM serving in order to maximize resource utilization. • However, LLM inference workloads are highly diverse, and modern GPU clusters are inherently heterogeneous, making it necessary to dynamically adjust deployment configurations online to better adapt to the elastic and dynamic nature of serverless environments. • At the same time, enabling such online reconfiguration is particularly challenging due to the stateful nature of LLM inference and the massive size of model parameters. • In this paper, we propose a dynamic pipeline reconfiguration approach that enables online adjustment of pipeline configurations while minimizing service downtime and performance degradation. • Our method allows the system to select the optimal pipeline configuration in response to changing workloads.

Article Summaries:

  • Researchers have introduced a dynamic pipeline reconfiguration method for large‑language‑model (LLM) inference on heterogeneous GPU clusters. The approach targets the challenge of serving LLMs in serverless environments, where workloads vary and GPU resources are limited. By adjusting pipeline configurations online, the system minimizes service downtime-reported to be under 50 ms-and limits performance overhead to less than 10 % for both time‑to‑first‑token (TTFT) and time‑per‑output‑token (TPOT). Experiments on NVIDIA A100 and L40 GPUs demonstrate the method’s effectiveness in maintaining efficient, low‑latency LLM inference across the cloud‑edge continuum.

Sources: