• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Towards a Science of AI Agent Reliability View PDF HTML (experimental)Abstract:AI agents are increasingly deployed to execute important tasks. • While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. • This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. • Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. • Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. • Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability.

Article Summaries:

  • Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:Towards a Science of AI Agent Reliability View PDF HTML (experimental)Abstract:AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbat

Sources: