• Custom Kernels for All from Codex and Claude tl;dr: We built an agent skill that teaches coding agents how to write production CUDA kernels. • Then we pointed Claude and Codex at two real targets: a diffusers pipeline and a transformers model. • The agents produced working kernels for both, with correct PyTorch bindings and benchmarks, end to end. • Writing CUDA kernels is hard. • Writing CUDA kernels that correctly integrate with transformers and diffusers is harder. • There are architecture-specific memory access patterns, vectorization strategies, warp shuffle reductions, and a dozen integration pitfalls that trip up even experienced developers.

Article Summaries:

  • Hugging Face released a new “cuda‑kernels” skill that lets large‑language‑model agents-such as Claude and Codex-write production‑ready CUDA kernels for deep‑learning workloads. The skill packages GPU‑specific optimization knowledge (e.g., H100 vs. A100 memory layouts) and library integration details for Diffusers and Transformers. Users install the skill with a single command and then prompt the agent to generate kernels for tasks like RMSNorm or attention on a target model (e.g., Qwen3‑8B). The agents output fully‑compiled kernels with PyTorch bindings, benchmark results, and end‑to‑end performance gains, demonstrating automated, high‑quality kernel development.

Sources: