Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

• Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration AuthorsBruno Mlodozeniecâ **, Pierre Ablin, Louis BÃ©thune, Dan Busbridge, Michal Klein, Jason Ramapuram, Marco Cuturi View publication Copy Bibtex Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. • Recent works on neural network parameterisations, such as Î¼P, have enabled transfer of optimal global hyperparameters across model sizes. • These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. • We extend these works in two key ways. • To handle scaling along most important scaling axes, we propose the Complete(d) Parameterisation that unifies scaling in width and depth â using an adaptation of CompleteP â as well as in batch-size and training duration. • Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer.

Article Summaries:

The paper extends recent work on hyperparameter transfer for large‑scale models by introducing the Complete(d) Parameterisation, which unifies scaling across width, depth, batch size, and training duration. Building on the μP framework, the authors explore per‑module hyperparameter optimisation-learning rates, AdamW settings, weight decay, initialization scales, and residual block multipliers-across a high‑dimensional search space. They identify practical challenges and offer guidelines for navigating this landscape. Experiments on modern large language models demonstrate that, with the new parameterisation, transferred per‑module hyperparameters yield significant training speed gains while maintaining stability.

Sources:

https://machinelearning.apple.com/research/completed-hyperparameter