ResearchGym: Evaluating Language Model Agents on Real-World AI Research

• Computer Science > Artificial Intelligence [Submitted on 16 Feb 2026] Title:ResearchGym: Evaluating Language Model Agents on Real-World AI Research View PDF HTML (experimental)Abstract:We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. • To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. • From each paper’s repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper’s proposed method. • This results in five containerized task environments comprising 39 sub-tasks in total. • Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper’s metrics. • In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability–reliability gap.

Article Summaries:

Computer Science > Artificial Intelligence [Submitted on 16 Feb 2026] Title:ResearchGym: Evaluating Language Model Agents on Real-World AI Research View PDF HTML (experimental)Abstract:We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper’s repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper’s proposed method. This results in five containerized task environments comprising

Sources:

https://arxiv.org/abs/2602.15112