Trace Length is a Simple Uncertainty Signal in Reasoning Models

• Trace Length is a Simple Uncertainty Signal in Reasoning Models Trace Length is a Simple Uncertainty Signal in Reasoning Models AuthorsSiddhartha Devicâ , Charlotte Pealeâ ¡, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, Aravind Gollakota View publication Copy Bibtex Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. • In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. • Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. • Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., â overthinkingâ ). • We investigate the mechanisms behind trace lengthâ s performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. • We identify high-entropy or â forkingâ tokens as playing a key role in the mechanism.

Article Summaries:

Summary

A recent study demonstrates that the length of a reasoning trace- the sequence of intermediate steps produced by large language models (LLMs)-serves as an effective, zero‑shot uncertainty signal. Across multiple models, datasets, and prompts, trace length correlates with accuracy similarly to verbalized confidence scores, yet provides complementary information. The authors show that post‑training for reasoning alters this relationship, with longer traces no longer simply indicating over‑thinking. Analysis reveals that high‑entropy or “forking” tokens drive the signal, and that the effect persists after controlling for problem difficulty and length bias. The work positions trace length as a practical confidence metric for large reasoning models.

Sources:

https://machinelearning.apple.com/research/trace-length