Benchmarks

Monthly-SWEBench

🏆 2026.05: GPT-5.5-High · Coding

Monthly-SWEBench Leaderboard

Contributors: Haiyang Shen*¹ · Xinbo Xu*¹ · Xuanzhong Chen¹ · Wendong Xu¹ · Elvis Zhang¹ · Kaiyuan Chen² · Rui Wang²
Xiaobo Hu² · Yang Liu² · Yixin Ren² · Yuan Gong² · Liang Chen¹ · Kuan Li¹
¹UniPat AI, ²xbench

Correspondence: <wendongxu@unipat.ai>, <kuanli@unipat.ai>

A continuously updated benchmark with fresh, real GitHub issues across multiple programming languages. Every month, we source ~100 new high-quality tasks from freshly merged PRs, keeping the benchmark alive and the leaderboard honest. Each leaderboard row also reports the agent harness used for that monthly run.

Dataset

100 Total Tasks

50 Bug Fix

50 Non-Bug Fix

19 LLMs Evaluated

Pass@1 Metric

2026-05 benchmark snapshot: 100 tasks selected from closed GitHub PRs and validated with oracle=1/nop=0. Avg Turns and timeout turn statistics count OpenHands agent iterations/trajectory steps, not raw LLM API calls. Doubao-Seed-2.1-Pro-High is included from current official artifacts; three non-clean Seed 2.1 Pro reruns are counted as failures under the benchmark missing-attempt policy.

Task Categories

100 real-world tasks selected from May 2026 closed GitHub PRs

bugfix

Bug-oriented fixes & maintenance patches

non_bugfix

Feature additions, API evolution & engineering improvements

Construction Pipeline

From GitHub PRs to a fixed evaluation pool

1
Candidates sampled from closed PRs with multi-round filtering by time window, repository scope, and task quality
2
Construct each task: environment, instruction, and verifier
3
All in Harbor format. Run oracle to validate solvability and verifier correctness; retain only stably-evaluable tasks

Quality Assurance

Multi-layer QC before entering the eval pool

Remove future branches to avoid contamination.
Multi-round filtering during PR selection and task construction
oracle pass — every task has a known-reachable reference solution
Agent rollouts to calibrate difficulty and fairness; reported snapshots include the agent harness used for each monthly run

Key Findings (2026-05)

1 Balanced monthly source set. The 2026-05 snapshot contains 100 tasks selected from that month's closed GitHub PRs, split evenly between 50 bug-fix and 50 non-bug-fix tasks.

2 Difficulty-controlled final set. The final 100 tasks were reselected from the 148 validated candidates to keep the leading score at 82.00% while preserving model disagreement.

3 Current leading tier. GPT-5.5-High leads the overall leaderboard; the top tier is GPT-5.5-High at 82.00%, Gemini-3.5-Flash-High at 81.00%, and Claude-Opus-4.8-XHigh at 80.00%.

4 Seed 2.1 Pro added. Doubao-Seed-2.1-Pro-High reaches 75.00% overall, 70.00% on bug-fix tasks, and 80.00% on non-bug-fix tasks; three non-clean official reruns are counted as failures in this published snapshot.

5 Timeout accounting. Timeout rows are counted as model failures only after the benchmark policy classifies them as non-completions; timeout turn statistics use the same OpenHands iteration/trajectory-step definition as Avg/Max Turns. Step-3.7-Flash-High: 4 timeouts; timeout limit=1800s; timeout turns min/median/avg/max=67/90.0/86.00/97 MiMo-V2.5-Pro-High: 1 timeouts; timeout limit=1800s; timeout turns min/median/avg/max=127/127/127.00/127 MiniMax-M3-Adaptive: 1 timeouts; timeout limit=1800s; timeout turns min/median/avg/max=131/131/131.00/131

The Rise and Fall of Static Benchmarks

If you can't measure it, you can't improve it. But if your measurement never changes, you stop improving and start memorizing.

On the lifecycle of AI benchmarks

SWE-bench (2023) took a practical approach to evaluating coding agents: derive tasks from real GitHub issues in popular Python repositories and use the existing test suites to judge correctness. The benchmark gained wide adoption, and model performance on it improved rapidly over the following years.

But scores went up faster than real-world usefulness.

As more teams optimized for SWE-bench, scores kept rising while practitioners reported little matching improvement in day-to-day coding assistance. The benchmark was being overfitted, not through explicit cheating, but through the gradual contamination of training data that inevitably absorbs widely-studied public benchmarks.

The Overfitting Cycle

A benchmark is released → the community studies it intensely → solutions, discussions, and patch patterns enter the public corpus → models train on that corpus → benchmark scores inflate → the benchmark loses its ability to discriminate real capability. This cycle is not a failure of any single team; it is a structural property of static benchmarks.

SWE-bench Verified attempted to address quality concerns. OpenAI partnered with the original authors to have human annotators review 500 tasks from the original 2,294, filtering out under-specified or ambiguous problems. This improved the benchmark’s reliability, but it was still a static, fixed set. OpenAI eventually announced they would no longer evaluate on SWE-bench Verified, citing concerns that even the curated subset was becoming saturated.

SWE-bench-Pro pushed further with harder tasks. Yet the underlying problem remains: any fixed set of tasks has an expiration date. The moment a benchmark is published, its clock starts ticking toward obsolescence.

SWE-bench Live introduced periodic updates with new tasks, taking a step toward liveness. However, it remains Python-focused, covers only bug fixes, and lacks rigorous instruction-test alignment verification, leaving room for ambiguous or under-specified tasks to slip through.

Why We Need a Living, Verified, and Broad Benchmark

The problems outlined above are not isolated issues. They reflect fundamental gaps in existing benchmarks: liveness (static task sets expire), verification (task quality is not systematically guaranteed), and breadth (limited to Python and bug fixes only). A next-generation software engineering benchmark must address all three. Monthly-SWEBench is built around six pillars spanning these dimensions.

🕑

Anti-Contamination by Design

Every month, ~100 new tasks from freshly merged PRs across fresh repositories. All repositories and PRs are newly selected and did not exist in any previous release, making data contamination and benchmark-specific optimization structurally impossible.

🌐

Broad Coverage

Covering Python, Go, C++, Java, Rust, TypeScript, JavaScript and more across ~20 repositories per month, reflecting the real diversity of open-source ecosystems.

⚖

Sufficient Complexity

Tasks are challenging enough to differentiate model capabilities. Typical tasks span 2+ files and 30+ lines of meaningful changes, requiring genuine repository-level understanding.

🔗

Instruction-Test Alignment

Every tested behavior is explicitly described in the instruction. No hidden API names, parameter orders, or interface contracts that silently fail correct implementations.

🔬

Rigorous Test Quality

Tests run real code with multiple inputs and check actual outputs. No source-code text matching, no single-case tests bypassable by hardcoding.

🛠

Bug Fix + Feature Tasks

Unlike prior benchmarks that focus exclusively on bug fixes, we include both bug fixes and feature additions. We carefully curate feature PRs that extend existing interfaces, making fail-to-pass verification possible for a broader task spectrum.

Why "Live" Alone Is Not Enough

Freshness prevents overfitting, but fresh tasks can still be poorly constructed. Instruction-test misalignment: the issue description says one thing, but the test suite silently requires specific API names or implementation details never mentioned. Insufficient tests: some tasks have only one or two test cases, trivially bypassable through hardcoded outputs. Dirty PRs: real-world PRs bundle unrelated changes, and tasks inherit this noise. Non-behavioral verification: some test scripts check source code text rather than running the code. Bug-fix-only coverage: feature additions, behavior changes, and API extensions are equally important in real development but are excluded by prior benchmarks because they are harder to validate with fail-to-pass testing. Monthly-SWEBench addresses all of these through systematic verification at every stage of the pipeline.

The following table summarizes where each benchmark stands across these dimensions:

Benchmark	Tasks	Languages	Task Types	Updated	Test Quality	Instruction-Test Alignment
SWE-bench	2,294	Python only	Bug fix only	Static	Partial	No
SWE-bench Verified	500	Python only	Bug fix only	Static	Human review	Partial
SWE-bench-Pro	Varies	Python only	Bug fix only	Static	Yes	Partial
SWE-bench Live	Growing	Python-focused	Bug fix only	Periodic	Partial	Partial
Monthly-SWEBench	~100/month	Multiple languages	Bug fix + Feature	Monthly	Rigorous pipeline	Strict alignment

The Monthly-SWEBench Pipeline

Our benchmark is not hand-crafted. It is produced by a multi-stage pipeline that starts from the global stream of open-source activity and distills it into high-quality, verified tasks. Here is what happens at each stage.

🔍

Repository
Discovery

▶

🕐

PR Collection
& Filtering

▶

⚒

Task Construction
& Quality Gates

▶

✅

Fail-to-Pass
Verification

▶

⚖

Alignment
Verification

▶

🚀

Monthly
Release

Filter Repository Discovery & Filtering

Scan GitHub for actively maintained repositories across Python, Go, C++, Java, Rust, TypeScript, and JavaScript
Require a minimum star count (≥500) and recent activity (pushed within the last 30 days)
Apply multi-dimensional filtering: test signal presence, release cadence (≥3 releases), code scale (10K~1M lines of code), commit activity
Each monthly batch targets ~20 high-quality, actively maintained repositories spanning multiple language ecosystems

Filter PR Collection & Filtering

Collect PRs merged to the default branch within a specific time window (e.g., the past 30 days)
Must include test changes: PRs without any test file modifications are excluded, ensuring every candidate PR has a verifiable testing signal
Minimum complexity: at least 1 changed source file and 30+ lines of code additions (excluding pure documentation, configuration, or formatting changes)
Maximum scope: exclude overly large PRs (≤100 changed files) that would be impractical as individual tasks
Exclude pure UI/cosmetic changes, dependency bumps, and CI configuration updates

Quality Task Construction & Quality Gates

Each PR that passes the earlier filters is transformed into a standalone benchmark task in Harbor format. The task construction process enforces multiple quality gates:

📄 Instruction-Test Strict Alignment

Every behavior that the test suite verifies must be explicitly described in the task instruction. If the test checks a specific function name, parameter signature, return type, CLI flag, or file path, the instruction must specify it. No hidden requirements.

🚫 Irrelevant Functionality Filtering

Real-world PRs are often "dirty": a single PR may fix a bug, refactor adjacent code, update docs, and tweak CI, all in one commit. We extract the core functional change and filter out everything else. The resulting task focuses on one coherent problem.

🔬 Optimized Verification Scripts

Tests must be behavioral verification: run the actual code with real inputs and check the actual outputs. We prohibit source-code text matching (grep for function names), exit-code-only checks, and single-input tests that can be trivially hardcoded around. Each fail-to-pass test requires multiple distinct inputs and checks output content with sufficient information density.

Fairness Solver-Perspective Review

Every task is reviewed from the perspective of a solver who has never seen the original PR and can only read the instruction plus the unpatched repository. If a reasonable solver could produce a semantically correct implementation that differs in interface shape (e.g., different method naming) and the test would reject it, the task is disqualified until the instruction or test is fixed.

📚 Multi-dimensional Test Roles

Each task includes multiple test categories working together:

Regression tests: guard existing functionality that must not break
Fail-to-pass core tests: directly verify the primary fix or feature
Fail-to-pass edge tests: cover boundary conditions and special inputs
Fail-to-pass error handling tests: verify graceful handling of invalid inputs

Verify Fail-to-Pass Verification

Every task must pass a two-phase verification to confirm that the task is solvable and that the tests correctly distinguish between unpatched and patched states.

Beyond Bug Fixes: Why Feature Tasks Are Hard but Valuable

Most existing SWE-bench variants focus exclusively on bug fix PRs: for bug fixes, the code already exists and simply produces wrong results, making fail-to-pass straightforward. But real-world software engineering is not just about fixing bugs. Feature additions, behavior changes, and API extensions are equally important and often harder.

We invest significant effort in carefully selecting feature-type PRs that modify or extend existing interfaces rather than creating entirely new modules. For example, adding a new parameter to an existing function, changing the behavior of an existing command, or extending an existing data format. In all these cases, the pre-patch code runs but behaves differently, making fail-to-pass verification possible. This careful curation allows Monthly-SWEBench to cover a fuller spectrum of real development work that prior benchmarks have left out.

Two-Phase Verification

Each task is run twice from the same repository snapshot:

Phase	Solve Script	Expected Reward	Regression Tests	Fail-to-Pass Tests
Phase 1 (unpatched)	Empty (no fix)	reward = 0	All pass	All fail
Phase 2 (patched)	Reference solution	reward = 1	All pass	All pass

This guarantees that tests are neither too easy (passing without a fix) nor broken (failing even with the correct fix). In Phase 1, the fail-to-pass tests must fail because the code itself is buggy, not because infrastructure is missing.

Quality Instruction-Test Alignment Verification

Direct review of instructions and test cases to ensure every tested behavior is explicitly described in the task instruction
Review of agent reasoning trajectories to distinguish genuine task difficulty from errors caused by ambiguous or under-specified instructions
The final task set is carefully curated to be discriminative across current agent models

Release Monthly Release

~100 verified tasks are released each month along with updated leaderboard results and the agent harness used for each run
All tasks come from fresh PRs and fresh repositories that were not part of any previous release
Every released task has passed all quality gates, fail-to-pass verification, and instruction-test alignment verification

Raw PRs

~10,000

▼

Merged to default branch

~5,000

▼

Has test changes

~2,000

▼

Meets complexity threshold

~800

▼

Passes quality gates

~300

▼

Fail-to-Pass verified

~150

▼

Alignment verified

~100

Conclusion

The era of static, unverified software engineering benchmarks is ending. As language models grow more capable, three problems become critical: fixed task sets get overfitted, poorly constructed tasks give misleading signals, and narrow scope fails to reflect the diversity of real-world engineering. A benchmark must address all three.

Monthly-SWEBench addresses all three:

Live: monthly releases of ~100 tasks from freshly merged PRs across fresh repositories ensure models cannot memorize their way to high scores
Verified: every task passes strict instruction-test alignment review, behavioral test quality checks, fail-to-pass dual-phase verification, and solver-perspective auditing to ensure no hidden requirements
Broad: covering Python, Go, C++, Java, Rust, TypeScript, JavaScript and more across ~20 repositories, with both bug fixes and feature additions covering a fuller spectrum of real-world software engineering than any prior benchmark

Honest evaluation is a prerequisite for genuine progress. We need benchmarks that reflect where AI coding agents actually are, not where they were six months ago.

Citation

If you find Monthly-SWEBench useful in your research, please cite:

@misc{monthlyswebench2026,
    title={Monthly-SWEBench: A Living, Rigorously Verified Benchmark
           for Real-World Software Engineering},
    author={Haiyang Shen and Xinbo Xu and Xuanzhong Chen and Wendong Xu and Elvis Zhang and Kaiyuan Chen and Xiaobo Hu and Rui Wang and Yang Liu and Yixin Ren and Yuan Gong and Liang Chen and Kuan Li},
    year={2026},
    note={Blog post: https://unipat.ai/benchmarks/MonthlySWEBench}
}

← Back to Benchmarks