• Kai Zong Khor William Yu Many Datadog products offer a live view of their telemetry, allowing you to access your data in near real time from across your infrastructure. • Live views improve responsiveness, but they also introduce strict requirements on data ingestion latency and scalability, as the systems backing the live view need to be able to handle large volumes of data in real time. • At Datadog, we encountered these challenges as usage of our Processes and Containers products grew. • Our system needed to be able to handle millions of incoming processes per second while still serving live data updates to users actively viewing the page. • In this post, weâll discuss the challenges we encountered scaling our real-time data pipeline for processes and containers. • Weâll explain how we significantly improved the efficiency of this system to keep up with growing usageâreducing real-time traffic volume by 100x, slashing infrastructure costs by 98%, lowering the Datadog Agentâs resource utilization, and improving the user experienceâall at the same time.
Article Summaries:
- Datadog announced a major overhaul of its real‑time process and container monitoring pipeline. As usage of the Processes and Containers products grew, the system had to ingest millions of metrics per second while delivering live updates every two seconds. The original architecture required all tenant data to be stored in a single in‑memory server, forcing costly vertical scaling and high data‑traffic volumes. The new design reduces real‑time traffic by 100×, cuts infrastructure costs by 98%, lowers Datadog Agent resource usage, and improves user experience-all while maintaining responsive live views.
Sources: