Benchmark

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

• Computer Science > Artificial Intelligence [Submitted on 25 Feb 2026] Title:ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices View

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

• Computer Science > Computers and Society [Submitted on 9 Feb 2026] Title:CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation View PDF HTML (experimental)Ab

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification

PreScience: A Benchmark for Forecasting Scientific Contributions

• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:PreScience: A Benchmark for Forecasting Scientific Contributions View PDFAbstract:Can AI systems train

ABD: Default Exception Abduction in Finite First Order Worlds

• ABD benchmark tests default‑exception abduction in finite first‑order logical worlds. • Models generate sparse exception formulas to restore satisfiability under abnormality pred

Agentic AI for Scalable and Robust Optical Systems Control

• AgentOptics framework uses agentic AI for autonomous optical system control via MCP. • Interprets natural language tasks, executes protocol-compliant actions across heterogeneous

Asymptotic Semantic Collapse in Hierarchical Optimization

• Asymptotic Semantic Collapse: dominant context absorbs individual semantics in multi‑agent language systems. • Dominant Anchor Node with infinite inertia drives asymptotic alignm

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

• INDUCTION benchmark tests finite-structure concept synthesis in first‑order logic across small relational worlds. • Models output a single logical formula that uniformly explains

$INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection$

INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection

• INSURE‑Dial is the first public benchmark for compliance‑aware voice agents in insurance calls. • Corpus contains 50 real AI‑initiated calls and 1,000 synthetic calls, averaging

ReportLogic: Evaluating Logical Quality in Deep Research Reports

• LLMs increasingly synthesize research into structured reports, but logical reliability remains unassessed. • ReportLogic benchmark quantifies report‑level logical quality for dee

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

• AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Unde

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages View PDF HTML (experi

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation View PDF HTML (experimental)Abs

How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

• Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment Vi

How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

• Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment Vi

Benchmark of 1.4 million checked protein structures could sharpen AI predictions

• University of Missouri researchers have released the world’s largest collection of protein models with quality assessment-a groundbreaking new resource that could accelerate drug

AI system TongGeometry generates and solves olympiad-level geometry problems

• TongGeometry, an AI system, autonomously generates olympiad-level geometry problems for high-level competition. • It also solves these problems, matching human expert accuracy an

Benchmark cuts Metaplanet target, says earnings show 'promise and peril' of bitcoin pivot

• Analysts say Metaplanet’s bitcoin-linked income business is becoming critical to funding expansion while avoiding forced BTC sales.

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

• Computer Science > Artificial Intelligence [Submitted on 5 Feb 2026] Title:TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

• TemporalBench offers a multi-domain benchmark for temporal reasoning in LLM agents. • Four-tier taxonomy tests historical structure, context-free, contextual, and event-condition

Benchmark cuts Coinbase price target by 37% but says business is 'more diversified and durable' than ever

• Benchmark’s Mark Palmer cut his COIN price target to $267 from $421, citing worsening crypto market conditions, while reiterating a buy rating.

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

• Arabic LLMs often evaluated only on Modern Standard Arabic, neglecting dialects. • Dialects like Emirati carry unique vocabulary, syntax, and cultural context. • Existing benchma

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

• AssetOpsBench introduces a multi-agent benchmark for industrial asset lifecycle management. • Covers 2.3M sensor telemetry points and 4.2K work orders. • Evaluates agents across

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

• When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure la

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

• Researchers tested jailbreak via Scots Gaelic translation, initially replicating 43% success claim. • GPT-4 responded with bomb instructions in Gaelic, but full output differed f

Are We Ready for Multi-Image Reasoning? Launching VHs: The Visual Haystacks Benchmark!

• Humans can sift through thousands of images, spotting subtle patterns-a skill AI still struggles to match. • Traditional VQA systems answer questions about single images, missing