Make Every Draft Count: Hidden State based Speculative Decoding

• Computer Science > Computation and Language [Submitted on 2 Feb 2026] Title:Make Every Draft Count: Hidden State based Speculative Decoding View PDF HTML (experimental)Abstract:Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel • However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are discarded, resulting in waste of computation • Motivated by the goal of recollecting this wasted computation, we propose a novel system that transforms discarded drafts into reusable tokens • Our key insight is to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse • To implement such a system, first we introduce a draft model architecture based on auto-regressive hidden states, which preserves richer semantics th

Article Summaries:

Computer Science > Computation and Language [Submitted on 2 Feb 2026] Title:Make Every Draft Count: Hidden State based Speculative Decoding View PDF HTML (experimental)Abstract:Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are disc

Sources:

https://arxiv.org/abs/2602.21224 (Latest source article published: 2026-02-26 05:00 UTC)