• Fine-tuning vision-language models on memory-constrained devices A new hybrid optimization approach allows edge devices to fine-tune vision-language models using only forward passes, achieving up to 7% higher accuracy than existing techniques. • Copy link Email X LinkedIn Facebook Line Reddit QZone Sina Weibo WeChat WhatsApp Fine-tuned vision-language models (VLMs) have shown remarkable performance across many computer vision tasks. • However, backpropagation - the standard method for adjusting model weights during fine tuning, which works backward from output error - is computationally expensive and thus impractical on resource-constrained edge devices. • An alternative is fine-tuning strategies that rely solely onforwardpasses, significantly lowering the computational requirements. • Zeroth-order (ZO) estimation is one such method, but existing ZO-based VLM fine-tuning methods remain substantially inferior to backpropagation-based training in terms of accuracy and convergence. • One major challenge is ZO’s high variance, which can make estimated gradients - the directions of weight adjustment resulting from a batch of training data - inconsistent and noisy.

Article Summaries:

  • Researchers presented SharpZO at NeurIPS 2025, a hybrid optimization method that fine‑tunes vision‑language models using only forward passes, making it suitable for edge devices with limited memory. SharpZO combines a global exploration stage-employing a sharpness‑aware covariance‑matrix adaptation evolution strategy (CMA‑ES) to smooth the loss landscape-with a local‑search stage that applies a modified zeroth‑order (ZO) estimator to suppress noisy gradient estimates. Experiments show SharpZO improves accuracy by up to 7 % over existing forward‑only methods such as ZIP and BlackVIP, and in several tasks its performance approaches that of first‑order backpropagation techniques like CoOP.

Sources: