CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

• Computer Science > Machine Learning [Submitted on 4 Feb 2026] Title:CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models View PDF HTML (experimental)Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. • We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. • CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. • Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. • At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. • Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

Article Summaries:

Summary

Researchers introduced CodeScaler, an execution‑free reward model that scales reinforcement learning (RL) for code generation and speeds up test‑time inference. Trained on curated preference data from verified code problems, CodeScaler incorporates syntax‑aware extraction and reward shaping to maintain stability. On five coding benchmarks, it boosts Qwen3‑8B‑Base by an average of +11.72 points, surpassing binary execution‑based RL by +1.82 points, and enables RL on synthetic datasets without any test cases. At inference, it matches unit‑test methods while cutting latency tenfold. CodeScaler also outperforms existing reward models on RM‑Bench, improving code performance by +3.3 points and general reasoning by +2.7 points on average.

CodeScaler: Execution‑Free Reward Models for Code LLMs

Researchers have introduced CodeScaler, a reward model that eliminates the need for code execution during reinforcement learning and inference. Trained on preference data from verified coding problems, it uses syntax‑aware extraction and validity‑preserving shaping to stabilize optimization. On five coding benchmarks, CodeScaler boosts Qwen3‑8B‑Base by an average of 11.72 points, outperforming binary execution‑based RL by 1.82 points. It enables scalable reinforcement learning on synthetic datasets without test cases and cuts inference latency tenfold while matching unit‑test performance. The model also surpasses existing reward models on RM‑Bench, improving code, general, and reasoning scores by 3.3, 2.7, and 2.7 points respectively.

Sources:

https://arxiv.org/abs/2602.17684