• Editor’s note: We are republishing a blog post from the Mobius team, originally published in 2023, that introduced a now widely used quantization algorithm. • We plan to continue this line of work by collaborating with the open source community on inference optimization and will be sharing more updates soon. • ~ ~ ~ Large Language Models (LLMs) have revolutionized various subfields of machine learning like natural language processing, speech recognition and computer vision, enabling machines to understand and generate outputs with unprecedented accuracy and fluency. • However, one of the most critical challenges in deploying LLMs is their expensive memory requirements, for both training and inference. • Quantization methods such as bitsandbytes, GPTQ and AWQ have made it possible to use large models such as the popular Llama-2 with significantly less memory, enabling the machine learning community to conduct remarkable research using a single consumer-grade GPU. • In this article, we propose a new quantization technique called Half-Quadratic Quantization (HQQ).

Article Summaries:

  • The Mobius team has released a new weight‑only quantization method called Half‑Quadratic Quantization (HQQ). HQQ requires no calibration data, yet it achieves compression quality comparable to calibration‑based techniques such as GPTQ and AWQ. In tests, HQQ quantizes the 70‑billion‑parameter Llama‑2 model to 2‑bit in under five minutes-over 50 times faster than GPTQ-while the resulting model outperforms the full‑precision Llama‑2‑13B at similar memory usage. The team plans to collaborate with the open‑source community on further inference optimizations and will share additional updates soon.

Sources: