From Batch to Streaming: Accelerating Data Freshness in Uber's Data Lake

• Stay up to date with the latest from Uber Engineering Stay up to date with the latest from Uber Engineering Share Facebook X social Linkedin Envelope Link Introduction At Uber, the data lake is a foundational platform powering analytics and machine learning across the company. • Historically, ingestion into the lake was powered by batch jobs with freshness measured in hours. • As business needs evolved toward near-real-time insights, we re-architected ingestion to run on Apache Flink®, enabling fresher data, lower costs, and scalable operations at petabyte scale. • Over the past year, we built and validated IngestionNext, a new streaming-based ingestion system centered on Flink. • We proved its performance on some of Uber’s largest datasets, designed the control plane for operating thousands of jobs, and addressed streaming-specific challenges such as small file generation, partition skew, and checkpoint synchronization. • This blog describes the design of IngestionNext and early results that show improved freshness and meaningful efficiency gains compared to batch ingestion.

Article Summaries:

Uber has re‑engineered its data lake ingestion pipeline, moving from hourly batch jobs to a streaming architecture powered by Apache Flink. The new system, dubbed IngestionNext, ingests events from Kafka, writes them to Hudi‑formatted files, and is managed by a control plane that automates job lifecycle and scaling across thousands of datasets. This shift cuts data freshness from hours to minutes, enabling faster experimentation and model deployment, while reducing resource waste by eliminating fixed‑interval batch scheduling. Key challenges-such as small‑file generation, partition skew, and checkpoint synchronization-were addressed to maintain performance and reliability at petabyte scale.

Sources:

https://www.uber.com/blog/from-batch-to-streaming-accelerating-data-freshness-in-ubers-data-lake/