How Uber Scaled Data Replication to Move Petabytes Every Day

• Uber’s global data lake exceeds 350 PB, spanning on‑prem and cloud regions. • Limited bandwidth and multi‑region complexity make timely, reliable data replication a critical challenge. • Uber leverages Hive Sync, built on Hadoop Distcp, to orchestrate large‑scale data copies. • Distcp’s default design limits throughput; Uber introduced custom optimizations to boost speed and resilience. • The architecture splits work into Copy Mappers, an Application Master, and a Committer for efficient parallelism. • These enhancements enable Uber to reliably move petabytes daily, supporting disaster recovery and analytics workloads.

Article Summaries:

Uber has upgraded its data‑replication system to move more than 350 PB of data daily across on‑premise and cloud environments. The company relies on Hive Sync, which uses Apache Hadoop Distcp for large‑scale copying, but Distcp’s original design struggled with Uber’s scale and bandwidth limits. To address this, Uber optimized Distcp’s architecture-improving job scheduling, container allocation, and block‑level copying-and integrated it with Hive Sync’s bulk and incremental sync capabilities. The result is a more reliable, faster, and disaster‑ready data lake that can keep multi‑region datasets synchronized with minimal latency.

Sources:

https://www.uber.com/blog/scaled-data-replication/