ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

• Computer Science > Artificial Intelligence [Submitted on 25 Feb 2026] Title:ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices View

Research & Labs · February 26, 2026 (updated February 26, 2026) · 2 min · 249 words
CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

• Computer Science > Computers and Society [Submitted on 9 Feb 2026] Title:CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation View PDF HTML (experimental)Ab

Research & Labs · February 25, 2026 (updated February 25, 2026) · 2 min · 253 words
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification

Research & Labs · February 25, 2026 (updated February 25, 2026) · 2 min · 331 words
PreScience: A Benchmark for Forecasting Scientific Contributions

PreScience: A Benchmark for Forecasting Scientific Contributions

• Computer Science > Artificial Intelligence [Submitted on 24 Feb 2026] Title:PreScience: A Benchmark for Forecasting Scientific Contributions View PDFAbstract:Can AI systems train

Research & Labs · February 25, 2026 (updated February 25, 2026) · 2 min · 279 words
ABD: Default Exception Abduction in Finite First Order Worlds

ABD: Default Exception Abduction in Finite First Order Worlds

• ABD benchmark tests default‑exception abduction in finite first‑order logical worlds. • Models generate sparse exception formulas to restore satisfiability under abnormality pred

Research & Labs · February 24, 2026 (updated February 24, 2026) · 1 min · 167 words
Agentic AI for Scalable and Robust Optical Systems Control

Agentic AI for Scalable and Robust Optical Systems Control

• AgentOptics framework uses agentic AI for autonomous optical system control via MCP. • Interprets natural language tasks, executes protocol-compliant actions across heterogeneous

Asymptotic Semantic Collapse in Hierarchical Optimization

Asymptotic Semantic Collapse in Hierarchical Optimization

• Asymptotic Semantic Collapse: dominant context absorbs individual semantics in multi‑agent language systems. • Dominant Anchor Node with infinite inertia drives asymptotic alignm

Research & Labs · February 24, 2026 (updated February 24, 2026) · 1 min · 207 words
INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

• INDUCTION benchmark tests finite-structure concept synthesis in first‑order logic across small relational worlds. • Models output a single logical formula that uniformly explains

Research & Labs · February 24, 2026 (updated February 24, 2026) · 1 min · 191 words
INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection

INSURE-Dial: A Phase-Aware Conversational Dataset \& Benchmark for Compliance Verification and Phase Detection

• INSURE‑Dial is the first public benchmark for compliance‑aware voice agents in insurance calls. • Corpus contains 50 real AI‑initiated calls and 1,000 synthetic calls, averaging

Research & Labs · February 24, 2026 (updated February 24, 2026) · 1 min · 211 words
ReportLogic: Evaluating Logical Quality in Deep Research Reports

ReportLogic: Evaluating Logical Quality in Deep Research Reports

• LLMs increasingly synthesize research into structured reports, but logical reliability remains unassessed. • ReportLogic benchmark quantifies report‑level logical quality for dee

Research & Labs · February 24, 2026 (updated February 24, 2026) · 1 min · 177 words
AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

• AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Unde

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages View PDF HTML (experi

Research & Labs · February 20, 2026 (updated February 24, 2026) · 2 min · 257 words
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation View PDF HTML (experimental)Abs

Research & Labs · February 20, 2026 (updated February 24, 2026) · 2 min · 235 words
How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

• Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment Vi

Research · February 19, 2026 (updated February 19, 2026) · 2 min · 249 words
How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

• Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:How Uncertain Is the Grade? • A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment Vi

Research & Labs · February 19, 2026 (updated February 24, 2026) · 2 min · 260 words

Benchmark of 1.4 million checked protein structures could sharpen AI predictions

• University of Missouri researchers have released the world’s largest collection of protein models with quality assessment-a groundbreaking new resource that could accelerate drug

Science · February 18, 2026 (updated February 24, 2026) · 1 min · 84 words

AI system TongGeometry generates and solves olympiad-level geometry problems

• TongGeometry, an AI system, autonomously generates olympiad-level geometry problems for high-level competition. • It also solves these problems, matching human expert accuracy an

Science · February 17, 2026 (updated February 24, 2026) · 1 min · 129 words

Benchmark cuts Metaplanet target, says earnings show 'promise and peril' of bitcoin pivot

• Analysts say Metaplanet’s bitcoin-linked income business is becoming critical to funding expansion while avoiding forced BTC sales.

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

• Computer Science > Artificial Intelligence [Submitted on 5 Feb 2026] Title:TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series

Research · February 17, 2026 (updated February 19, 2026) · 2 min · 267 words
TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

• TemporalBench offers a multi-domain benchmark for temporal reasoning in LLM agents. • Four-tier taxonomy tests historical structure, context-free, contextual, and event-condition

Research & Labs · February 17, 2026 (updated February 24, 2026) · 1 min · 154 words

Benchmark cuts Coinbase price target by 37% but says business is 'more diversified and durable' than ever

• Benchmark’s Mark Palmer cut his COIN price target to $267 from $421, citing worsening crypto market conditions, while reiterating a buy rating.

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

• Arabic LLMs often evaluated only on Modern Standard Arabic, neglecting dialects. • Dialects like Emirati carry unique vocabulary, syntax, and cultural context. • Existing benchma

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

• AssetOpsBench introduces a multi-agent benchmark for industrial asset lifecycle management. • Covers 2.3M sensor telemetry points and 4.2K work orders. • Evaluates agents across

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

• When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure la

Research · August 28, 2024 (updated February 19, 2026) · 2 min · 219 words
How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

• Researchers tested jailbreak via Scots Gaelic translation, initially replicating 43% success claim. • GPT-4 responded with bomb instructions in Gaelic, but full output differed f

Research & Labs · August 28, 2024 (updated February 24, 2026) · 1 min · 187 words
Are We Ready for Multi-Image Reasoning? Launching VHs: The Visual Haystacks Benchmark!

Are We Ready for Multi-Image Reasoning? Launching VHs: The Visual Haystacks Benchmark!

• Humans can sift through thousands of images, spotting subtle patterns-a skill AI still struggles to match. • Traditional VQA systems answer questions about single images, missing

Research & Labs · July 20, 2024 (updated February 24, 2026) · 1 min · 200 words