WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

• Computer Science > Artificial Intelligence [Submitted on 20 Feb 2026] Title:WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics View PDF HTML (experimental)Abstract:LLM-based systems increasingly generate structured workflows for complex tasks. • In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. • We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics. • It works by applying realistic, controlled perturbations to golden workflows. • WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. • We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals.

Article Summaries:

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi‑Agent Workflow Metrics

A new benchmark, WorkflowPerturb, has been introduced to assess the reliability of metrics used to evaluate large‑language‑model (LLM) generated workflows. The dataset contains 4,973 “golden” workflows and 44,757 perturbed versions created by applying three realistic perturbations-Missing Steps, Compressed Steps, and Description Changes-at severity levels of 10 %, 30 %, and 50 %. Researchers used the benchmark to benchmark various metric families, examining their sensitivity and calibration through expected score trajectories and residuals. Findings reveal systematic differences across metric families and support severity‑aware interpretation of workflow evaluation scores. The dataset will be released upon acceptance.

Sources:

https://arxiv.org/abs/2602.17990