• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 25 Feb 2026] Title:DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference View PDF HTML (experimental)Abstract:The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation • In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle • This asymmetry severely constrains overall system throughput • We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading • Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network • DualPath combines this optimized data path – which inherently avoids network congestion and avoids interference with latency-critical model execution communications – with a global scheduler that dynamically balances
Article Summaries:
- Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 25 Feb 2026] Title:DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference View PDF HTML (experimental)Abstract:The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constr
Sources:
- https://arxiv.org/abs/2602.21548 (Latest source article published: 2026-02-26 05:00 UTC)