Why we no longer evaluate SWE-bench Verified

Why we no longer evaluate SWE-bench Verified

• SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. • Our analysis shows flawed tests and training leakage. • We recommend SWE-bench Pro. •

The Peace-Athabasca Delta is at risk. Here's what we can do to evaluate the threats

• River deltas are among the most complex and productive environments on Earth. • Yet, they face serious threats from upstream industrialization and climate change, which alter sup

Science · February 18, 2026 (updated February 24, 2026) · 1 min · 99 words
How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

• When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure la

Research · August 28, 2024 (updated February 19, 2026) · 2 min · 219 words