Apache Hudi™ at Uber: Engineering for Trillion-Record-Scale Data Lake Operations

• Apache Hudi™ at Uber: Engineering for Trillion-Record-Scale Data Lake Operations 16 January / GlobalThe Foundation of Uber’s Data Platform Uber operates one of the most diverse and demanding data ecosystems in the world. • Every trip taken, order delivered, ad served, or real-time arrival time recalculated generates an unending stream of data. • These data points come from hundreds of microservices, thousands of cities, and millions of riders, each with its own velocity, shape, and business-criticality. • At the heart of this ecosystem lies Uber’s data lake: a multi-hundred-petabyte repository that fuels operational decisions, machine learning models, experimentation platforms, and real-time business intelligence. • Powering this data lake requires far more than storing large volumes of data. • Constant mutation, high cardinality, fast-changing schemas, and a relentless requirement for data freshness characterize Uber’s workloads.

Article Summaries:

Uber announced that it has built and deployed Apache Hudi, an open‑source data lake storage engine, to meet the company’s need for trillion‑record‑scale, near‑real‑time data processing. The platform was created to address Uber’s rapidly growing, highly mutable data ecosystem-spanning millions of riders, thousands of microservices, and hundreds of cities-where existing file‑based lake solutions could not support ACID transactions, fast upserts, or incremental processing. Hudi adds database‑like primitives to the lake while preserving scalability, enabling Uber to keep tens of thousands of datasets fresh within minutes and powering operational decisions, ML models, and business intelligence across all product lines.

Sources:

https://www.uber.com/blog/apache-hudi-at-uber/