CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation View PDF HTML (experimental)Abstract:Many benchmarks for automated causal inference evaluate a system’s performance based on a single numerical output, such as an Average Treatment Effect (ATE). • This approach conflates two distinct steps in causal analysis: identification-formulating a valid research design under stated assumptions-and estimation-implementing that design numerically on finite data. • We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 138 real-world datasets, curated from 85 peer-reviewed research papers and four widely-used causal-inference textbooks. • For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. • By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. • Baseline results with a state-of-the-art LLM show that, while the model correctly identifies the high-level strategy in 84 % of cases, full identification-specification correctness drops to only 30 %, revealing that the bottleneck lies in the nuanced details of research design rather than in computation.

Article Summaries:

The authors present CausalReasoningBenchmark, a new evaluation framework for automated causal inference. It contains 173 real‑world queries drawn from 138 datasets, compiled from 85 research papers and four standard textbooks. Each query requires a system to output a structured identification specification-detailing the strategy, treatment, outcome, control variables, and design elements-and a numerical point estimate with its standard error. By scoring these components separately, the benchmark distinguishes errors in causal reasoning from computational mistakes. Preliminary tests with a leading large language model show 84 % accuracy on high‑level strategy identification but only 30 % on full specification correctness, highlighting a design‑detail bottleneck. The benchmark is publicly released on Hugging Face to spur development of more robust causal‑inference tools.

Sources:

https://arxiv.org/abs/2602.20571 (Latest source article published: 2026-02-25 05:00 UTC)