• Hi everyone, I’m a junior compiler engineer recently working on a backend for a custom NPU. • I’m looking for some architectural advice regarding the split of responsibilities between TVM (Frontend) and LLVM (Backend). • The Context: Our stack uses TVM as the frontend and LLVM as the backend. • The flow is roughly: TVM (Relay/TIR) -> LLVM IR -> LLVM Backend Optimization -> Machine Binary . • Currently, I am trying to implement a lowering pass for Convolution operations considering our NPU’s specific constraints. • The Problem: Our NPU has a Scratch Pad Memory (SPM) with limited size, meaning input features often won’t fit entirely in the SPM.

Article Summaries:

  • A junior compiler engineer is developing a backend for a custom NPU that uses TVM as the frontend and LLVM as the backend. The NPU’s limited Scratch Pad Memory (SPM) forces data to be tiled and moved from DRAM, but a naive C‑to‑LLVM approach produced overly complex nested loops and poor vectorization. The engineer proposes that TVM handle all tiling and data movement, emitting a custom intrinsic (e.g., llvm.npu.conv2d_tile) that assumes data resides in the SPM. LLVM would then lower this intrinsic to the NPU instruction. The question is whether this is the recommended architecture, how much logic should reside in the intrinsic, and whether LLVM should manage the memory hierarchy.
  • Hi everyone, I’m a junior compiler engineer recently working on a backend for a custom NPU. I’m looking for some architectural advice regarding the split of responsibilities between TVM (Frontend) and LLVM (Backend). The Context: Our stack uses TVM as the frontend and LLVM as the backend. The flow is roughly: TVM (Relay/TIR) -> LLVM IR -> LLVM Backend Optimization -> Machine Binary . Currently, I am trying to implement a lowering pass for Convolution operations considering our NPU’s specific constraints. The Problem: Our NPU has a Scratch Pad Memory (SPM) with limited size, meaning input featu

Sources: