Learning to Evict from Key-Value Cache

• Learning to Evict from Key-Value Cache Learning to Evict from Key-Value Cache AuthorsLuca Moschella, Laura Manduchi, Ozan Sener View publication Copy Bibtex The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. • Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a tokenâ s future utility and introduce computational overhead. • We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. • To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. • Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. • Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines.

Article Summaries:

Learning to Evict from Key‑Value Cache

Researchers Luca Moschella, Laura Manduchi, and Ozan Sener propose a reinforcement‑learning framework, KV Policy (KVP), to manage the key‑value (KV) cache of large language models (LLMs). KVP trains lightweight per‑head agents on pre‑computed generation traces, using only key and value vectors to rank tokens by predicted future utility. The agents learn eviction policies that adapt to any cache budget without modifying the underlying LLM or adding inference overhead. Evaluations on the RULER long‑context benchmark and the OASST2‑4k dialogue dataset show KVP surpasses heuristic baselines. Zero‑shot tests on LongBench, BOOLQ, and ARC demonstrate the policy’s generalization to longer contexts and diverse downstream tasks.

Learning to Evict from Key-Value Cache AuthorsLuca Moschella, Laura Manduchi, Ozan Sener Learning to Evict from Key-Value Cache AuthorsLuca Moschella, Laura Manduchi, Ozan Sener The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a tokenâs future utility and introduce computational overhead. We reframe KV cache eviction as

Sources:

https://machinelearning.apple.com/research/evict (Latest source article published: 2026-02-23 00:00 UTC)