Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

• Computer Science > Machine Learning [Submitted on 24 Oct 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution View PDF HTML (experimental)Abstract:Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. • On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. • The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). • To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. • Notably, these models capture the performance characteristics of GPU kernels and account for their dispatch times. • A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU-GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers (close to the achievable maximum values of 2.01x and 1.87x, respectively, found by exhaustive grid search on a Pi

Article Summaries:

Researchers have developed a method to speed up deep‑learning inference on mobile devices by coordinating CPU and GPU execution at a fine‑grained level. The approach uses OpenCL’s shared virtual memory to reduce synchronization overhead and machine‑learning models to predict task runtimes, accounting for GPU kernel dispatch delays. Evaluated on four smartphones, the technique delivers up to 1.89× speedup for linear layers and 1.75× for convolutional layers-close to the theoretical 2.01× and 1.87× limits found by exhaustive search on a Pixel 5. The work demonstrates that collaborative CPU‑GPU execution can significantly lower inference latency on resource‑constrained mobile hardware.

Sources:

https://arxiv.org/abs/2510.21081