TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents

TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents

• Computer Science > Computation and Language [Submitted on 5 Feb 2026] Title:TRACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents View PDF HTML (experimental)

Research & Labs · February 26, 2026 (updated February 26, 2026) · 2 min · 254 words
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification

Research & Labs · February 25, 2026 (updated February 25, 2026) · 2 min · 331 words
DREAM: Deep Research Evaluation with Agentic Metrics

DREAM: Deep Research Evaluation with Agentic Metrics

• DREAM introduces agentic evaluation for AI research agents, addressing lack of ground truth. • Highlights Mirage of Synthesis: surface fluency can mask factual and reasoning flaw

Research & Labs · February 24, 2026 (updated February 24, 2026) · 1 min · 181 words
A Systematic Evaluation of the Potential of Carbon-Aware Execution for Scientific Workflows

A Systematic Evaluation of the Potential of Carbon-Aware Execution for Scientific Workflows

• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 20 Aug 2025 (v1), last revised 20 Feb 2026 (this version, v3)] Title:A Systematic Evaluation of the

Simple Baselines are Competitive with Code Evolution

Simple Baselines are Competitive with Code Evolution

• Code evolution uses LLMs to mutate code, yet lacks baseline comparisons. • Authors test two simple baselines across math bounds, agentic scaffolds, and ML contests. • Baselines m

Research & Labs · February 20, 2026 (updated February 24, 2026) · 1 min · 184 words
A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models

A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models

• Computer Science > Human-Computer Interaction [Submitted on 10 Jan 2026] Title:A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dia

Research & Labs · February 19, 2026 (updated February 24, 2026) · 2 min · 239 words
Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Research & Labs · February 19, 2026 (updated February 24, 2026) · 2 min · 255 words
Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Research · February 19, 2026 (updated February 19, 2026) · 2 min · 226 words
BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

• Computer Science > Artificial Intelligence [Submitted on 22 Jan 2026] Title:BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors View PDF HTML (experimental)Abstract:Large

Research · February 17, 2026 (updated February 19, 2026) · 2 min · 243 words
BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

• Computer Science > Artificial Intelligence [Submitted on 22 Jan 2026] Title:BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors View PDF HTML (experimental)Abstract:Large

Research & Labs · February 17, 2026 (updated February 24, 2026) · 2 min · 243 words
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

• Computer Science > Artificial Intelligence [Submitted on 29 Jan 2026] Title:PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading View P

Research & Labs · February 17, 2026 (updated February 24, 2026) · 2 min · 280 words
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

• Computer Science > Artificial Intelligence [Submitted on 29 Jan 2026] Title:PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading View P

Research · February 17, 2026 (updated February 19, 2026) · 2 min · 280 words

Group Note Draft: W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0

• The Accessibility Guidelines Working Group has published the first draft of a Group Note titled W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 . • WCAG-EM desc

Group Note Draft: W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0

• The Accessibility Guidelines Working Group has published the first draft of a Group Note titled W3C Accessibility Guidelines Evaluation Methodology (WCAG-EM) 2.0 . • WCAG-EM desc

Community Evals: Because we're done trusting black-box leaderboards over the community

Community Evals: Because we're done trusting black-box leaderboards over the community

• Evaluation metrics saturated; MMLU >91%, GSM8K >94%, yet real‑world tasks still fail. • Inconsistent benchmark scores across papers, model cards, and platforms create no single t

CAISI Evaluation of DeepSeek AI Models Finds Shortcomings and Risks

CAISI Evaluation of DeepSeek AI Models Finds Shortcomings and Risks

• Official websites use .govA.govwebsite belongs to an official government organization in the United States. • Secure .gov websites use HTTPSAlock(LockA locked padlock) orhttps://