Benchmarking

Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings

• Computer Science > Computation and Language [Submitted on 28 Jan 2026] Title:Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings V

Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Constrained Sensing

• Computer Science > Computers and Society [Submitted on 9 Feb 2026] Title:Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Cons

Beyond single-channel agentic benchmarking

• Current AI safety benchmarks assess agents in isolation, ignoring human‑AI interaction dynamics. • Single‑channel evaluation misrepresents operational safety, unlike redundancy‑b

DREAM: Deep Research Evaluation with Agentic Metrics

• DREAM introduces agentic evaluation for AI research agents, addressing lack of ground truth. • Highlights Mirage of Synthesis: surface fluency can mask factual and reasoning flaw

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks View PDF HTML (experimental)Abstract:LL

LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs View PDF

Linux 7.0 Showing Some Early Performance Regressions On Intel Panther Lake

• Linux 7.0 kernel introduces regressions on Intel Panther Lake, reducing CPU and iGPU performance. • Benchmarks on MSI Prestige 14 with Core Ultra X7 358H show slower results than

Quantifying construct validity in large language model evaluations

• LLM benchmarks often misrepresent true model capabilities due to contamination and annotator errors. • Construct validity is essential to ensure benchmarks truly measure desired

LLM Inference Benchmarking - Measure What Matters

• By Piyush Srivastava, Karnik Modi, Stephen Varela, and Rithish Ramesh Production-grade LLM inference is a complex systems challenge, requiring deep co-designs - from hardware pri

Advancing AI benchmarking with Game Arena

• DeepMind expands Game Arena, adding Werewolf and poker to benchmark AI beyond perfect-information games. • Werewolf tests social deduction, communication, and deception handling

Ceilometer: Uber's Adaptive Benchmarking Framework

• Introduction At Uber, scale and reliability define our infrastructure. • Every new server type, kernel upgrade, and configuration change must be rigorously vetted before it touch