Monthly-SWEBench
🏆 2026.03: Claude-Opus-4.6 · Coding
Monthly-SWEBench Leaderboard
Xiaobo Hu2 · Yang Liu2 · Yixin Ren2 · Yuan Gong2 · Liang Chen1 · Kuan Li1
1UniPat AI, 2xbench
Correspondence: <wendongxu@unipat.ai>, <kuanli@unipat.ai>
A continuously updated benchmark with fresh, real GitHub issues across multiple programming languages. Every month, we source ~100 new high-quality tasks from freshly merged PRs, keeping the benchmark alive and the leaderboard honest.
Task Categories
112 real-world tasks sourced from closed GitHub PRs
Construction Pipeline
From GitHub PRs to a fixed evaluation pool
-
1Candidates sampled from closed PRs with multi-round filtering by time window, repository scope, and task quality
-
2Construct each task: environment, instruction, and verifier
-
3All in Harbor format. Run oracle to validate solvability and verifier correctness; retain only stably-evaluable tasks
Quality Assurance
Multi-layer QC before entering the eval pool
- Remove future branches to avoid contamination.
- Multi-round filtering during PR selection and task construction
- oracle pass — every task has a known-reachable reference solution
- Claude Opus 4.6 Pass@4 rollout to calibrate difficulty and fairness
Key Findings (Mar 2026)
The Rise and Fall of Static Benchmarks
SWE-bench (2023) took a practical approach to evaluating coding agents: derive tasks from real GitHub issues in popular Python repositories and use the existing test suites to judge correctness. The benchmark gained wide adoption, and model performance on it improved rapidly over the following years.
But scores went up faster than real-world usefulness.
As more teams optimized for SWE-bench, scores kept rising while practitioners reported little matching improvement in day-to-day coding assistance. The benchmark was being overfitted, not through explicit cheating, but through the gradual contamination of training data that inevitably absorbs widely-studied public benchmarks.
A benchmark is released → the community studies it intensely → solutions, discussions, and patch patterns enter the public corpus → models train on that corpus → benchmark scores inflate → the benchmark loses its ability to discriminate real capability. This cycle is not a failure of any single team; it is a structural property of static benchmarks.
SWE-bench Verified attempted to address quality concerns. OpenAI partnered with the original authors to have human annotators review 500 tasks from the original 2,294, filtering out under-specified or ambiguous problems. This improved the benchmark’s reliability, but it was still a static, fixed set. OpenAI eventually announced they would no longer evaluate on SWE-bench Verified, citing concerns that even the curated subset was becoming saturated.
SWE-bench-Pro pushed further with harder tasks. Yet the underlying problem remains: any fixed set of tasks has an expiration date. The moment a benchmark is published, its clock starts ticking toward obsolescence.
SWE-bench Live introduced periodic updates with new tasks, taking a step toward liveness. However, it remains Python-focused, covers only bug fixes, and lacks rigorous instruction-test alignment verification, leaving room for ambiguous or under-specified tasks to slip through.
Why We Need a Living, Verified, and Broad Benchmark
The problems outlined above are not isolated issues. They reflect fundamental gaps in existing benchmarks: liveness (static task sets expire), verification (task quality is not systematically guaranteed), and breadth (limited to Python and bug fixes only). A next-generation software engineering benchmark must address all three. Monthly-SWEBench is built around six pillars spanning these dimensions.
Freshness prevents overfitting, but fresh tasks can still be poorly constructed. Instruction-test misalignment: the issue description says one thing, but the test suite silently requires specific API names or implementation details never mentioned. Insufficient tests: some tasks have only one or two test cases, trivially bypassable through hardcoded outputs. Dirty PRs: real-world PRs bundle unrelated changes, and tasks inherit this noise. Non-behavioral verification: some test scripts check source code text rather than running the code. Bug-fix-only coverage: feature additions, behavior changes, and API extensions are equally important in real development but are excluded by prior benchmarks because they are harder to validate with fail-to-pass testing. Monthly-SWEBench addresses all of these through systematic verification at every stage of the pipeline.
The following table summarizes where each benchmark stands across these dimensions:
| Benchmark | Tasks | Languages | Task Types | Updated | Test Quality | Instruction-Test Alignment |
|---|---|---|---|---|---|---|
| SWE-bench | 2,294 | Python only | Bug fix only | Static | Partial | No |
| SWE-bench Verified | 500 | Python only | Bug fix only | Static | Human review | Partial |
| SWE-bench-Pro | Varies | Python only | Bug fix only | Static | Yes | Partial |
| SWE-bench Live | Growing | Python-focused | Bug fix only | Periodic | Partial | Partial |
| Monthly-SWEBench | ~100/month | Multiple languages | Bug fix + Feature | Monthly | Rigorous pipeline | Strict alignment |
The Monthly-SWEBench Pipeline
Our benchmark is not hand-crafted. It is produced by a multi-stage pipeline that starts from the global stream of open-source activity and distills it into high-quality, verified tasks. Here is what happens at each stage.
Discovery
& Filtering
& Quality Gates
Verification
Verification
Release
Filter Repository Discovery & Filtering
- Scan GitHub for actively maintained repositories across Python, Go, C++, Java, Rust, TypeScript, and JavaScript
- Require a minimum star count (≥500) and recent activity (pushed within the last 30 days)
- Apply multi-dimensional filtering: test signal presence, release cadence (≥3 releases), code scale (10K~1M lines of code), commit activity
- Each monthly batch targets ~20 high-quality, actively maintained repositories spanning multiple language ecosystems
Filter PR Collection & Filtering
- Collect PRs merged to the default branch within a specific time window (e.g., the past 30 days)
- Must include test changes: PRs without any test file modifications are excluded, ensuring every candidate PR has a verifiable testing signal
- Minimum complexity: at least 1 changed source file and 30+ lines of code additions (excluding pure documentation, configuration, or formatting changes)
- Maximum scope: exclude overly large PRs (≤100 changed files) that would be impractical as individual tasks
- Exclude pure UI/cosmetic changes, dependency bumps, and CI configuration updates
Quality Task Construction & Quality Gates
Each PR that passes the earlier filters is transformed into a standalone benchmark task in Harbor format. The task construction process enforces multiple quality gates:
📄 Instruction-Test Strict Alignment
Every behavior that the test suite verifies must be explicitly described in the task instruction. If the test checks a specific function name, parameter signature, return type, CLI flag, or file path, the instruction must specify it. No hidden requirements.
🚫 Irrelevant Functionality Filtering
Real-world PRs are often "dirty": a single PR may fix a bug, refactor adjacent code, update docs, and tweak CI, all in one commit. We extract the core functional change and filter out everything else. The resulting task focuses on one coherent problem.
🔬 Optimized Verification Scripts
Tests must be behavioral verification: run the actual code with real inputs and check the actual outputs. We prohibit source-code text matching (grep for function names), exit-code-only checks, and single-input tests that can be trivially hardcoded around. Each fail-to-pass test requires multiple distinct inputs and checks output content with sufficient information density.
Fairness Solver-Perspective Review
Every task is reviewed from the perspective of a solver who has never seen the original PR and can only read the instruction plus the unpatched repository. If a reasonable solver could produce a semantically correct implementation that differs in interface shape (e.g., different method naming) and the test would reject it, the task is disqualified until the instruction or test is fixed.
📚 Multi-dimensional Test Roles
Each task includes multiple test categories working together:
- Regression tests: guard existing functionality that must not break
- Fail-to-pass core tests: directly verify the primary fix or feature
- Fail-to-pass edge tests: cover boundary conditions and special inputs
- Fail-to-pass error handling tests: verify graceful handling of invalid inputs
Verify Fail-to-Pass Verification
Every task must pass a two-phase verification to confirm that the task is solvable and that the tests correctly distinguish between unpatched and patched states.
Beyond Bug Fixes: Why Feature Tasks Are Hard but Valuable
Most existing SWE-bench variants focus exclusively on bug fix PRs: for bug fixes, the code already exists and simply produces wrong results, making fail-to-pass straightforward. But real-world software engineering is not just about fixing bugs. Feature additions, behavior changes, and API extensions are equally important and often harder.
We invest significant effort in carefully selecting feature-type PRs that modify or extend existing interfaces rather than creating entirely new modules. For example, adding a new parameter to an existing function, changing the behavior of an existing command, or extending an existing data format. In all these cases, the pre-patch code runs but behaves differently, making fail-to-pass verification possible. This careful curation allows Monthly-SWEBench to cover a fuller spectrum of real development work that prior benchmarks have left out.
Two-Phase Verification
Each task is run twice from the same repository snapshot:
| Phase | Solve Script | Expected Reward | Regression Tests | Fail-to-Pass Tests |
|---|---|---|---|---|
| Phase 1 (unpatched) | Empty (no fix) | reward = 0 | All pass | All fail |
| Phase 2 (patched) | Reference solution | reward = 1 | All pass | All pass |
This guarantees that tests are neither too easy (passing without a fix) nor broken (failing even with the correct fix). In Phase 1, the fail-to-pass tests must fail because the code itself is buggy, not because infrastructure is missing.
Quality Instruction-Test Alignment Verification
- Direct review of instructions and test cases to ensure every tested behavior is explicitly described in the task instruction
- Review of agent reasoning trajectories to distinguish genuine task difficulty from errors caused by ambiguous or under-specified instructions
- The final task set is carefully curated to be discriminative across current agent models
Release Monthly Release
- ~100 verified tasks are released each month along with updated leaderboard results
- All tasks come from fresh PRs and fresh repositories that were not part of any previous release
- Every released task has passed all quality gates, fail-to-pass verification, and instruction-test alignment verification
Conclusion
The era of static, unverified software engineering benchmarks is ending. As language models grow more capable, three problems become critical: fixed task sets get overfitted, poorly constructed tasks give misleading signals, and narrow scope fails to reflect the diversity of real-world engineering. A benchmark must address all three.
Monthly-SWEBench addresses all three:
- Live: monthly releases of ~100 tasks from freshly merged PRs across fresh repositories ensure models cannot memorize their way to high scores
- Verified: every task passes strict instruction-test alignment review, behavioral test quality checks, fail-to-pass dual-phase verification, and solver-perspective auditing to ensure no hidden requirements
- Broad: covering Python, Go, C++, Java, Rust, TypeScript, JavaScript and more across ~20 repositories, with both bug fixes and feature additions covering a fuller spectrum of real-world software engineering than any prior benchmark
Honest evaluation is a prerequisite for genuine progress. We need benchmarks that reflect where AI coding agents actually are, not where they were six months ago.
Citation
If you find Monthly-SWEBench useful in your research, please cite:
@misc{monthlyswebench2026,
title={Monthly-SWEBench: A Living, Rigorously Verified Benchmark
for Real-World Software Engineering},
author={Haiyang Shen and Xinbo Xu and Xuanzhong Chen and Wendong Xu and Elvis Zhang and Kaiyuan Chen and Xiaobo Hu and Rui Wang and Yang Liu and Yixin Ren and Yuan Gong and Liang Chen and Kuan Li},
year={2026},
note={Blog post: https://unipat.ai/benchmarks/MonthlySWEBench}
}