Differential Transformer V2

• DiffTransformer V2 doubles query heads, keeps KV heads constant for efficient attention. • Uses differential attention: subtracts paired heads within same GQA group. • Applies sigmoid-weighted lam to balance head contributions, ensuring stable gradients. • Eliminates custom kernels, leveraging flash_attn for speed and simplicity, reducing implementation complexity. • Maintains projection size, enabling seamless integration with baseline Transformers, preserving model capacity. • Demonstrated faster decoding and comparable performance to DiffV1, validating design choices. • Open-source implementation available on GitHub (microsoft/unilm), encouraging community contributions and rapid iteration. • Supports large language models with minimal overhead, making it production-ready.

Article Summaries:

Differential Transformer V2 (DIFF V2) is a lightweight attention variant that doubles the number of query heads while keeping key‑value heads unchanged. By subtracting paired heads within the same GQA group, it eliminates the need for custom kernels and maintains the same decoding speed as a standard Transformer, even on memory‑bound LLM workloads. The design keeps the head dimension aligned, reducing the extra dimension back to the baseline size after the differential operation, which lowers output‑projection parameters and FLOPs. During pre‑training on H‑ and B‑series GPUs, throughput loss is negligible, and when combined with linear‑time prefilling techniques (e.g., YOCO), DIFF V2 offers efficient long‑sequence processing.

Sources:

https://huggingface.co/blog/microsoft/diff-attn-v2