• Deployments drive 70% of incidents, making rapid fault detection critical for modern DevOps. • Datadog’s Automatic Faulty Deployment Detection uses APM telemetry to spot problematic releases. • Initial data set lacked labels, so engineers turned to weak supervision techniques. • Iterative unsupervised steps helped balance data and handle diverse application profiles. • The resulting supervised model accurately flags faulty deployments across multiple data centers. • Methodology can extend to other domains facing unlabeled, imbalanced data challenges.

Article Summaries:

  • Datadog’s new Automatic Faulty Deployment Detection feature tackles the problem that 70 % of production incidents stem from deployments. The team began with a massive, unlabeled log of service changes and faced three core challenges: no ground‑truth labels, a highly imbalanced class (faulty deployments are rare), and varied application profiles. To overcome this, they first defined a “faulty” deployment as one that triggers a significant rise in error rate. Using weak supervision and an iterative unsupervised pipeline, they generated pseudo‑labels, trained a supervised model, and refined it to handle diverse traffic patterns. The result is a scalable, data‑driven system that flags problematic releases across many environments.

Sources: