Evaluation

TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents

• Computer Science > Computation and Language [Submitted on 5 Feb 2026] Title:TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents View PDF HTML (experimental)

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification

DREAM: Deep Research Evaluation with Agentic Metrics

• DREAM introduces agentic evaluation for AI research agents, addressing lack of ground truth. • Highlights Mirage of Synthesis: surface fluency can mask factual and reasoning flaw

A Systematic Evaluation of the Potential of Carbon-Aware Execution for Scientific Workflows

• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 20 Aug 2025 (v1), last revised 20 Feb 2026 (this version, v3)] Title:A Systematic Evaluation of the

Simple Baselines are Competitive with Code Evolution

• Code evolution uses LLMs to mutate code, yet lacks baseline comparisons. • Authors test two simple baselines across math bounds, agentic scaffolds, and ML contests. • Baselines m

A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models

• Computer Science > Human-Computer Interaction [Submitted on 10 Jan 2026] Title:A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dia

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

• Computer Science > Artificial Intelligence [Submitted on 22 Jan 2026] Title:BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors View PDF HTML (experimental)Abstract:Large

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

• Computer Science > Artificial Intelligence [Submitted on 22 Jan 2026] Title:BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors View PDF HTML (experimental)Abstract:Large

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

• Computer Science > Artificial Intelligence [Submitted on 29 Jan 2026] Title:PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading View P

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

• Computer Science > Artificial Intelligence [Submitted on 29 Jan 2026] Title:PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading View P

Group Note Draft: W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0

• The Accessibility Guidelines Working Group has published the first draft of a Group Note titled W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 . • WCAG-EM desc

Group Note Draft: W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0

• The Accessibility Guidelines Working Group has published the first draft of a Group Note titled W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 . • WCAG-EM desc

Community Evals: Because we're done trusting black-box leaderboards over the community

• Evaluation metrics saturated; MMLU >91%, GSM8K >94%, yet real‑world tasks still fail. • Inconsistent benchmark scores across papers, model cards, and platforms create no single t

CAISI Evaluation of DeepSeek AI Models Finds Shortcomings and Risks

• Official websites use .govA.govwebsite belongs to an official government organization in the United States. • Secure .gov websites use HTTPSAlock(LockA locked padlock) orhttps://