• Code evolution uses LLMs to mutate code, yet lacks baseline comparisons. • Authors test two simple baselines across math bounds, agentic scaffolds, and ML contests. • Baselines match or outperform sophisticated pipelines in all three domains. • Performance hinges on search space design and prompt domain knowledge, not the search itself. • Agentic scaffolds suffer from high variance and small datasets; hand‑crafted majority vote wins. • Authors propose low‑stochasticity evaluation methods and best‑practice guidelines for future code evolution work.
Article Summaries:
- A recent study shows that simple baseline methods can match or outperform sophisticated code‑evolution pipelines that use large language models to mutate code. The authors evaluated two baselines across three tasks-improving mathematical bounds, designing agentic scaffolds, and machine‑learning competitions-and found the baselines competitive in all cases. Their analysis attributes success mainly to the quality of the search space and domain knowledge encoded in prompts, rather than the evolutionary search itself. The paper also highlights high variance and small datasets as pitfalls in scaffold design and proposes more robust evaluation techniques. The authors conclude with best‑practice recommendations for future code‑evolution research.
Sources: