• Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 27 Jan 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Mapping Gemma3 onto an Edge Dataflow Architecture View PDF HTML (experimental)Abstract:We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). • Our work introduces a set of hardware-aware techniques. • For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined attention mechanism. • For decoding, we introduce FusedDQP, which fuses dequantization and projection into a single kernel, and FlowKV, which re-structures attention to sustain high memory bandwidth utilization. • Together with a compact Q4NX 4-bit quantization format, these methods yield up to $5.2\times$ faster prefill and $4.8\times$ faster decoding versus the iGPU, and $33.5\times$ and $2.2\times$ over the CPU, respectively. • Power efficiency improves by as much as $67.2\times$ and $222.9\times$ compared to the iGPU and CPU.
Article Summaries:
- Computer Science > Distributed, Parallel, and Cluster Computing [Submitted on 27 Jan 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Mapping Gemma3 onto an Edge Dataflow Architecture View PDF HTML (experimental)Abstract:We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). Our work introduces a set of hardware-aware techniques. For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined at
Sources:
- https://arxiv.org/abs/2602.06063 (Latest source article published: 2026-02-25 05:00 UTC)