Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog

• Laura de Vesine Rob Thomas Maciej Kowalewski In March 2023, Datadog experienced a rare, widespread incident that left large parts of our infrastructure only partially functional, but from a customerâs perspective, our platform looked completely down. • This square-wave failure patternâup, then instantly downârevealed critical limitations in how our systems degraded and our ability to serve our customers. • Since then, weâve rethought our approach to failure across our products and infrastructure. • In this post, we explain what we learned from the incident, how we responded, and what it takes to build for graceful degradation at scale. • The problem: What our March 2023 incident taught us Revisiting our March 2023 incident, the immediate cause was an unsupervised global update, which caused a restart interaction that removed connectivity to approximately 50â60% of our Kubernetes nodes in production. • While that level of platform loss would have a significant impact on any system, the effect on the Datadog platform was a nearly complete loss of user-facing functionality.

Article Summaries:

Datadog’s March 2023 outage, triggered by an unsupervised global update that disconnected 50‑60 % of its Kubernetes nodes, caused the platform to appear fully down for customers, despite some services still running. The incident exposed limits in Datadog’s degradation strategy and highlighted that root‑cause analysis alone was insufficient. The company shifted focus to why the failure was so severe for users, noting that its design prioritized data correctness over partial availability. In response, Datadog is re‑engineering its products and infrastructure to support graceful degradation at scale, aiming to prevent future outages from appearing catastrophic.

Sources:

https://www.datadoghq.com/blog/engineering/rethinking-reliability/