Quantifying construct validity in large language model evaluations

• LLM benchmarks often misrepresent true model capabilities due to contamination and annotator errors. • Construct validity is essential to ensure benchmarks truly measure desired capabilities. • Latent factor models ignore scaling laws, conflating size with capability. • Scaling laws ignore measurement error, yielding uninterpretable, overfit capability estimates. • Structured capabilities model integrates scale and error, extracting interpretable, generalisable capabilities. • On OpenLLM Leaderboard, it outperforms latent factor models and scaling laws.

Article Summaries:

Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:Quantifying construct validity in large language model evaluations View PDF HTML (experimental)Abstract:The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating bench

Sources:

https://arxiv.org/abs/2602.15532