Improving MySQL® Cluster Uptime: Making MGR Viable at Scale

• Improving MySQL® Cluster Uptime: Making MGR Viable at Scale 4 December 2025 / GlobalIntroduction This is the second blog in a two-part series that describes how Uber adopted MySQL® Group Replication to improve MySQL cluster uptime. • In the first part, we explored the architectural shift that took place within Uber’s MySQL infrastructure-from a reactive, externally-driven failover system to an internal consensus-based architecture powered by MGR (MySQL Group Replication). • We introduced the concept of a three-node consensus group running in single-primary mode, explained the advantages of Paxos-based elections, and discussed how this setup addresses the core reliability challenges we previously faced. • In this part, we take this story further: How did we build this system in a scalable, operationally sound way? • How do we safely onboard and offboard clusters? • What happens when a node fails or is replaced?

Article Summaries:

Uber’s second‑part blog explains how the company scaled MySQL Group Replication (MGR) to boost cluster uptime. The team replaced a reactive fail‑over system with a three‑node consensus architecture running in single‑primary mode, using Paxos‑based elections to avoid split‑brain scenarios. They built an automated control plane that handles onboarding, off‑boarding, and rebalancing of clusters with one‑click workflows. The process selects a healthy bootstrap node, adds the remaining nodes, synchronizes data, and then deactivates the old replication. Off‑boarding reverses the steps, restoring asynchronous replication. Automation and strict configuration settings enable reliable, low‑disruption operation at Uber’s scale.

Sources:

https://www.uber.com/blog/improving-mysql-cluster-uptime-part2/