SWE-bench Pro
HoldTools
A proposed benchmark alternative intended to better evaluate coding progress with less contamination.
Why it's here
Placed in Hold: 1 article(s) of evidence from 1 source(s), led by research-stage coverage, with 0 in the last 30 days. Confidence 24%. Low accumulated evidence, so it defaults conservatively pending more signal.
Evidence (1)
- 7OpenAI Blog·2/23/2026researchOpenAI Stops Evaluating on SWE-bench Verified
OpenAI says SWE-bench Verified has become increasingly contaminated and no longer measures frontier coding progress reliably. The company cites flawed tests and training leakage, and recommends using SWE-bench Pro instead.