• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 19 Feb 2026] Title:GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations View PDF HTML (experimental)Abstract:Collocating deep learning training tasks improves GPU utilization but causes drastic slowdowns due to resource contention and risks Out-of-Memory (OOM) failures. • Accurate memory estimation is essential for robust collocation, while GPU utilization – a key proxy for resource contention – enables interference-aware scheduling to reduce slowdowns and improve throughput. • Existing GPU memory estimators span three paradigms – analytical models, CPU-side libraries, and ML-based estimators – each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. • GPU heterogeneity further complicates estimation, as identical tasks can exhibit markedly different memory footprints across hardware generations. • GPU utilization remains comparatively understudied, further complicated by the non-additive nature of utilization metrics and hardware sensitivity. • We conduct a systematic analysis of representative estimators from each paradigm – Horus, PyTorch FakeTensor, and our lightweight ML-based estimator – evaluating accuracy, generalizability, and practical overhead.
Article Summaries:
- A recent study examines how to predict GPU memory usage and utilization for deep‑learning training tasks, a key factor for efficient task collocation on shared GPUs. The authors compare three estimation approaches-analytical models, CPU‑side libraries, and lightweight machine‑learning predictors-using a synthetic dataset of MLPs, CNNs, and Transformers. Their experiments reveal that analytical models are hardware‑specific, CPU libraries require intrusive integration, and ML models struggle to generalize across architectures. The paper also highlights the limited research on GPU utilization metrics, noting their non‑additive nature and hardware sensitivity. All datasets and tools are released to aid further investigation.
- Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 19 Feb 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations View PDF HTML (experimental)Abstract:Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation – a key proxy for contention – enables interference-aware sche
Sources:
- https://arxiv.org/abs/2602.17817 (Latest source article published: 2026-02-25 05:00 UTC)