• EXPEDIA GROUP TECHNOLOGY - DATA Stop overwriting - start merging: a smarter approach to updating Iceberg tables Photo by NEOM on Unsplash Apache Iceberg has emerged as a leading table format for data lakes, offering robust support for schema evolution, hidden partitioning, and time travel. • When managing large-scale data, especially in partitioned tables, choosing the right data modification strategy is crucial. • This article explores why MERGE INTO with the Merge-on-Read (MOR) strategy is often more advantageous than INSERT OVERWRITE, particularly as partitioning schemes evolve. Understanding MERGE INTO vs. • INSERT OVERWRITE Apache Iceberg supports two primary strategies for handling row-level updates: Copy-on-Write (COW): Updates are applied by rewriting the entire file, ensuring that all changes are reflected immediately. • This approach provides strong consistency but can be resource-intensive. Merge-on-Read (MOR): Updates are stored as separate delta logs, which are merged with the base data during query execution. • This strategy optimizes write performance and is particularly beneficial for real-time analytics. While both COW and MOR are designed for row-level updates, another common approach - INSERT OVERWRITE, rewrites entire partitions, which can be inefficient and lead to inconsistencies when partitioning schemes change. INSERT OVERWRITE Purpose : Completely replaces data in a table or partition with the result of a query.
Article Summaries:
- Apache Iceberg’s MERGE INTO with Merge‑on‑Read (MOR) is presented as a more efficient alternative to INSERT OVERWRITE for updating partitioned tables. While Iceberg supports Copy‑on‑Write (COW) and Merge‑on‑Read (MOR) for row‑level changes, INSERT OVERWRITE rewrites entire partitions, which can be costly and inconsistent when partition schemes evolve. MERGE INTO targets only affected rows, appending delta logs that are merged at query time, reducing I/O, speeding writes, and lowering storage costs. The article highlights these performance and cost advantages, recommending MERGE INTO for incremental updates, CDC, and slowly changing dimensions.
Sources: