Hypergrowth isn't always easy

• Tailscale faced shaky uptime during holiday season, prompting transparency. • Public uptime history page offers detailed incident logs and metrics. • Coordination server issues can cause slow or partial outages, affecting some tailnets. • Incident on Jan 5 lasted 24 minutes, impacted few tailnets, increased latency. • Engineers detected issue early, repaired by taking shard offline, causing brief impact. • Company commits to continuous improvement, documenting failures and lessons learned.

Article Summaries:

Tailscale has acknowledged that its uptime has been shakier than usual over the past month, including during the holiday season. The company maintains a public status page with detailed incident logs, such as a 24‑minute outage on January 5 that affected a small number of tailnets and caused increased latency. Tailscale explains that the incident was a planned, limited‑scope failure that was resolved quickly, and it is committed to continuous improvement by measuring blast radius, severity, and recovery time. The post also clarifies the evolution of its “coordination service,” which now runs on multiple servers and handles each tailnet on a single coordination node at any given moment.

Sources:

https://tailscale.com/blog/hypergrowth-isnt-always-easy