Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

• Felix Loesing | Software Engineer In 2025, we set out to drastically reduce out-of-memory errors (OOMs) and cut resource usage in our Spark applications by automatically identifying tasks with higher memory demands and retrying them on larger executors with a feature we call Auto Memory Retries. • Spark Platform Pinterest runs a large-scale Apache Spark deployment to satisfy the increasing demands of internal customers, such as AI/ML, experimentation, and reporting. • We process 90k+ Spark jobs daily on tens of thousands of compute nodes with hundreds of PB in shuffle size.¹ Our clusters are run on Kubernetes and mainly use Spark 3.2, with an upgrade to Spark 3.5 in progress. • We use Apache Celeborn as our shuffle service, Apache Yunikorn as our scheduler, accelerate computation with Apache Gluten & Meta’s Velox, and use our in-house submission service called Archer. • Check out this blogpost to learn more about our data infrastructure here . • Problem Identification Historically, we knew that OOM errors were frequent in our clusters due to small executor sizes.

Article Summaries:

Pinterest’s Spark Platform has introduced “Auto Memory Retries” to cut out‑of‑memory (OOM) failures in its large‑scale Apache Spark deployment. By automatically detecting tasks that exceed memory limits and retrying them on larger executors, the system avoids the need for manual tuning of executor sizes. The feature targets the 4.6 % of job failures caused by OOMs, which at Pinterest’s scale involve 90 k+ daily jobs and massive shuffle data. Early results show reduced resource consumption for OOM‑failed jobs, improving reliability for AI/ML, experimentation, and reporting workloads running on Kubernetes‑managed clusters.

Sources:

https://medium.com/pinterest-engineering/drastically-reducing-out-of-memory-errors-in-apache-spark-at-pinterest-c55d7dac2257?source=rss----4c5a5f6279b6---4