• Asynchronous Verified Semantic Caching for Tiered LLM Architectures Asynchronous Verified Semantic Caching for Tiered LLM Architectures AuthorsAsmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam, Weihua Zhu View publication Copy Bibtex Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. • Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. • In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. • We introduce Krites, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. • On the critical path, Krites behaves exactly like a standard static threshold policy. • When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt.

Article Summaries:

  • Researchers introduce Krites, an asynchronous caching policy for tiered large‑language‑model (LLM) systems that improves static cache utilization without affecting latency. Traditional tiered designs use a single embedding‑similarity threshold to decide whether a prompt can be answered from a static, curated cache or must be generated anew, creating a trade‑off between missed opportunities and incorrect reuse. Krites monitors prompts whose nearest static neighbor falls just below the threshold; it then asynchronously asks an LLM judge to verify the static response. Approved matches are promoted to the dynamic cache, enabling future requests to reuse curated answers. Trace‑driven simulations on conversational and search workloads show up to a 3.9‑fold increase in requests served by static or verified responses, with no change to critical‑path latency.

Sources: