• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 14 Apr 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training View PDF HTML (experimental)Abstract:The rapid growth of deep learning models has increased the demand for efficient distributed training strategies. • Fully sharded approaches like ZeRO-3 and FSDP partition model parameters across GPUs and apply optimizations such as prefetching and unsharding to reduce communication overhead. • However, these systems lack fine-grained control over memory and communication scheduling, making it difficult to balance computation–communication overlap with memory requirements. • Coordinating multiple optimizations such as prefetching and unsharding is also difficult, since their effects on memory usage can influence each other. • To tackle these challenges, we propose DeepCompile, a compiler-based optimization framework for distributed training. • DeepCompile transforms user-defined models into computation graphs and applies a series of profiling-guided optimization passes, each modifying the graph based on profiling information such as execution time and memory usage.
Article Summaries:
- DeepCompile: Compiler‑Driven Optimizations for Distributed Deep‑Learning Training
Researchers introduced DeepCompile, a compiler‑based framework that transforms user models into computation graphs and applies profiling‑guided optimization passes. Each pass can insert, reorder, or remove operations such as all‑gather and memory allocation, enabling fine‑grained control over communication‑computation overlap and memory usage. The system adds proactive prefetching, selective unsharding, and adaptive offloading to coordinate multiple optimizations. Benchmarks show DeepCompile delivers up to 1.28× speedups over ZeRO‑3, 1.54× over FSDP, and a 7.01× throughput increase in GPU‑constrained settings, addressing key bottlenecks in large‑scale distributed training.
Sources: