• Abstract Deep learning models are usually trained with stochastic gradient descent-based algorithms, but these optimizers face inherent limitations, such as slow convergence and stringent assumptions for convergence. • In particular, data heterogeneity arising from distributed settings poses significant challenges to their theoretical and numerical performance. • Here we develop an algorithm called PISA (preconditioned inexact stochastic alternating direction method of multipliers). • Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz continuity of the gradient on a bounded region, thereby removing the need for other conditions commonly imposed by stochastic methods. • This capability enables the proposed algorithm to tackle the challenge of data heterogeneity effectively. • Moreover, the algorithmic architecture enables scalable parallel computing and supports various preconditions, such as second-order information, second moment and orthogonalized momentum by Newton-Schulz iterations.

Article Summaries:

  • A new optimization framework, PISA (preconditioned inexact stochastic ADMM), has been introduced for training deep learning models. Unlike conventional stochastic gradient descent (SGD) methods, PISA guarantees convergence under only a Lipschitz‑continuous gradient assumption, eliminating many restrictive conditions that hinder existing optimizers. The algorithm incorporates flexible preconditioners-including second‑order information, second‑moment statistics, and orthogonalized momentum via Newton-Schulz iterations-enabling efficient, scalable parallel training. Two lightweight variants, SISA and NSISA, were evaluated on vision, language, reinforcement learning, generative adversarial, and recurrent models, consistently outperforming state‑of‑the‑art optimizers in numerical experiments.

Sources: