Optimization Instability in Autonomous Agentic Workflows for Clinical Symptom Detection

• Computer Science > Artificial Intelligence [Submitted on 17 Feb 2026] Title:Optimization Instability in Autonomous Agentic Workflows for Clinical Symptom Detection View PDF HTML (experimental)Abstract:Autonomous agentic workflows that iteratively refine their own behavior hold considerable promise, yet their failure modes remain poorly characterized. • We investigate optimization instability, a phenomenon in which continued autonomous improvement paradoxically degrades classifier performance, using Pythia, an open-source framework for automated prompt optimization. • Evaluating three clinical symptoms with varying prevalence (shortness of breath at 23%, chest pain at 12%, and Long COVID brain fog at 3%), we observed that validation sensitivity oscillated between 1.0 and 0.0 across iterations, with severity inversely proportional to class prevalence. • At 3% prevalence, the system achieved 95% accuracy while detecting zero positive cases, a failure mode obscured by standard evaluation metrics. • We evaluated two interventions: a guiding agent that actively redirected optimization, amplifying overfitting rather than correcting it, and a selector agent that retrospectively identified the best-performing iteration successfully prevented catastrophic failure. • With selector agent oversight, the system outperformed expert-curated lexicons on brain fog detection by 331% (F1) and chest pain by 7%, despite requiring only a single natural language term as input.

Article Summaries:

A recent study examined how autonomous AI workflows that self‑optimize can become unstable when detecting rare clinical symptoms. Using the open‑source Pythia framework, researchers tested three symptoms-shortness of breath (23 % prevalence), chest pain (12 %), and Long COVID brain fog (3 %). They found that successive optimization iterations caused validation sensitivity to swing from 1.0 to 0.0, especially for the low‑prevalence brain fog, where the model achieved high accuracy yet missed all positive cases. Two mitigation strategies were evaluated: an active guiding agent that worsened overfitting, and a retrospective selector agent that chose the best iteration. The selector approach prevented catastrophic failure and outperformed expert‑curated lexicons by 331 % for brain fog and 7 % for chest pain. The work highlights a critical failure mode of self‑optimizing AI and shows that retrospective selection can stabilize low‑prevalence classification tasks.

Sources:

https://arxiv.org/abs/2602.16037