SWE-bench Verified
HoldTools
A benchmark suite for evaluating software engineering and coding agents on real-world GitHub issues.
Why it's here
Placed in Hold: 1 article(s) of evidence from 1 source(s), led by research-stage coverage, with 0 in the last 30 days. Confidence 24%. Low accumulated evidence, so it defaults conservatively pending more signal.
Evidence (1)
- 7OpenAI Blog·2/23/2026researchOpenAI Stops Evaluating on SWE-bench Verified
OpenAI says SWE-bench Verified has become increasingly contaminated and no longer measures frontier coding progress reliably. The company cites flawed tests and training leakage, and recommends using SWE-bench Pro instead.