• Running a global observability platform means one thing above all: your infrastructure must never go down. • When you’re responsible for monitoring thousands of customers’ applications 24/7, network failures aren’t just inconvenient, they’re existential threats. • At New Relic, hundreds of clusters run on multiple clouds, and regions. • These clusters depend on a complex web of network connections: regional transit gateways, inter-regional hubs, and cross-cloud links. • While we have resiliency to avoid single points of failures, If too many connections fail, our ability to provide real-time observability is compromised. • We needed to know instantly if connectivity fails at any layer; within availability zones, between regions, or across cloud providers.

Article Summaries:

  • New Relic built an internal network‑monitoring system called Weather Station to keep its global observability platform from failing. The platform runs hundreds of clusters across multiple clouds and regions, relying on transit gateways, inter‑regional hubs, and cross‑cloud links. Traditional outage detection was too slow, so New Relic deployed a dedicated monitoring network that mirrors production topology and runs continuous connectivity checks. Weather Station now performs over 100,000 checks per hour, validating paths within availability zones, between regions, and across cloud providers. The system automatically adapts to new regions or clouds, providing engineers instant visibility into which network segment has failed.

Sources: