SWE-bench Verified

Hold

Tools

A benchmark suite for evaluating software engineering and coding agents on real-world GitHub issues.

Why it's here

Placed in Hold: 1 article(s) of evidence from 1 source(s), led by research-stage coverage, with 0 in the last 30 days. Confidence 24%. Low accumulated evidence, so it defaults conservatively pending more signal.

Evidence (1)

7OpenAI Blog·2/23/2026research
OpenAI Stops Evaluating on SWE-bench Verified
OpenAI says SWE-bench Verified has become increasingly contaminated and no longer measures frontier coding progress reliably. The company cites flawed tests and training leakage, and recommends using SWE-bench Pro instead.