[RFC] TensaLang: A tensor-first language for LLM inference, lowering through MLIR to CPU/CUDA

• Hello, I’ve been working on a project called TensaLang and it’s finally at a point worth sharing. • It’s a small language + compiler + runtime for writing LLM forward passes directly in source code, lowering through MLIR to CPU (LLVM JIT) or CUDA (NVVM). • GitHub: GitHub - BenChaliah/Tensa-Lang: TensaLang is a Tensor-first programming language, compiler, and runtime that let you write the Model’s inference engine (e.g. • LLMs) and sampling in high level language, then compile it through MLIR to Multiple targets (e.g. • CPU, CUDA, ROCm) Website/Docs: https://tensa-lang.org Example weights: DatarusAI/Tensa-Lang · Hugging Face Motivation Many inference runtimes couple model logic tightly to backend-specific kernels. • This creates friction on two fronts: - Targeting new hardware means building a new runtime or forking an existing one, because kernel logic, memory management, and scheduling are entangled with backend assumptions.

Article Summaries:

TensaLang - a tensor‑first language for LLM inference Ben Chaliah has released TensaLang, a lightweight language, compiler, and runtime that lets developers write large‑language‑model (LLM) forward passes directly in source code. The .tl language exposes tensors, loops, and reductions, while MLIR handles target‑specific lowering to CPU (LLVM JIT) or CUDA (NVVM). The stack includes pattern‑matched cuBLAS dispatch, fused attention modes, an arena allocator, and safetensors support. TensaLang is still in beta but has been tested on Llama‑2 7B and Qwen2.5‑Coder‑0.5B across CPU and GPU backends. The project aims to separate algorithmic logic from backend details, offering a readable, inspectable end‑to‑end inference pipeline.

Sources:

https://discourse.llvm.org/t/rfc-tensalang-a-tensor-first-language-for-llm-inference-lowering-through-mlir-to-cpu-cuda/89892