Differential Transformer V2
• DiffTransformer V2 doubles query heads, keeps KV heads constant for efficient attention. • Uses differential attention: subtracts paired heads within same GQA group. • Applies si
• DiffTransformer V2 doubles query heads, keeps KV heads constant for efficient attention. • Uses differential attention: subtracts paired heads within same GQA group. • Applies si