IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

• IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST Ayhan Sebin Saurabh Jha Rohan Arora Daby Sow Mert Cemri Melissa Pan Ion Stoica ITBench HF Space ITBench HF Dataset MAST HF Dataset ITBench Github MAST Github IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops. • Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. • To solve this black-box problem, we applied MAST (Multi-Agent System Failure Taxonomy), an emerging practice for diagnosing agentic reliability ). • By leveraging MAST to analyze ITBench-the industry benchmark for SRE, Security, and FinOps automation-we turned raw execution traces into structured failure signatures, revealing exactly what broke and how to fix it. • We annotated 310 ITBench SRE traces across three distinct model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. • Key Findings: - Frontier models like Gemini-3-Flash fail cleanly (2.6 failure modes/trace), typically hitting isolated bottlenecks like verification.

Article Summaries:

IBM Research and UC Berkeley applied the Multi‑Agent System Failure Taxonomy (MAST) to the IT‑Bench benchmark, which tests large‑language‑model agents on real‑world IT automation tasks such as incident triage, log queries, and Kubernetes operations. By annotating 310 execution traces from Gemini‑3‑Flash, Kimi‑K2, and GPT‑OSS‑120B, the study identified distinct failure patterns: frontier models mainly hit isolated bottlenecks like verification, while larger open models suffered cascading failures. The most common error was incorrect verification, and Kimi‑K2 frequently terminated prematurely or failed to recognize task completion. The authors recommend externalizing verification, adding explicit termination checks, and handling ambiguous inputs to improve agent reliability in enterprise settings.

Sources:

https://huggingface.co/blog/ibm-research/itbenchandmast (Latest source article published: 2026-02-18 16:15 UTC)