How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

• Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment View PDF HTML (experimental)Abstract:The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. • While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output formats, they also introduce new challenges related to output uncertainty, stemming from the inherently probabilistic nature of LLMs. • Output uncertainty is an inescapable challenge in automatic assessment, as assessment results often play a critical role in informing subsequent pedagogical actions, such as providing feedback to students or guiding instructional decisions. • Unreliable or poorly calibrated uncertainty estimates can lead to unstable downstream interventions, potentially disrupting students’ learning processes and resulting in unintended negative consequences. • To systematically understand this challenge and inform future research, we benchmark a broad range of uncertainty quantification methods in the context of LLM-based automatic assessment.

Article Summaries:

Researchers have benchmarked uncertainty‑quantification methods for large language model (LLM) automatic grading systems, addressing a growing concern that probabilistic LLM outputs can produce unreliable confidence estimates. The study evaluated a wide range of metrics across multiple assessment datasets, LLM families, and decoding strategies to map how uncertainty behaves in grading scenarios. Findings highlight which uncertainty measures are most reliable and how model choice, task type, and generation control affect calibration. The work offers actionable insights for developing more trustworthy, uncertainty‑aware grading tools, aiming to reduce the risk of unstable downstream educational interventions.

Sources:

https://arxiv.org/abs/2602.16039