Differential Transformer V2

Differential Transformer V2

• DiffTransformer V2 doubles query heads, keeps KV heads constant for efficient attention. • Uses differential attention: subtracts paired heads within same GQA group. • Applies si