• SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. • Our analysis shows flawed tests and training leakage. • We recommend SWE-bench Pro. • SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. • Our analysis shows flawed tests and training leakage. • We recommend SWE-bench Pro.

Article Summaries:

  • OpenAI has discontinued using the SWE‑bench Verified benchmark to gauge frontier model performance, citing contamination and unreliability. A recent audit revealed that 59.4 % of audited problems contain flawed test cases that incorrectly reject correct solutions, and that all tested frontier models could reproduce the gold patch or problem details-indicating they were trained on the same data. These issues mean progress on SWE‑bench Verified no longer reflects genuine coding ability but rather prior exposure to the benchmark. OpenAI recommends developers report results on the newer, uncontaminated SWE‑bench Pro and is developing further evaluations.
  • A recent report has led to the discontinuation of SWE‑bench Verified as a benchmark for evaluating software engineering progress. The analysis found that the dataset contains significant contamination and training leakage, which undermines the validity of its results. Test cases were identified as flawed, leading to inflated or misleading performance metrics for state‑of‑the‑art models. As a result, the authors recommend switching to the newer SWE‑bench Pro, which is designed to address these issues and provide a more reliable assessment of frontier coding capabilities.

Sources: