Benchmarks

Monthly-SWEBench

🏆 2026.03: Claude-Opus-4.6 · Coding

Monthly-SWEBench Leaderboard

Contributors: Haiyang Shen*1 · Xinbo Xu*1 · Xuanzhong Chen1 · Wendong Xu1 · Elvis Zhang1 · Kaiyuan Chen2 · Rui Wang2
Xiaobo Hu2 · Yang Liu2 · Yixin Ren2 · Yuan Gong2 · Liang Chen1 · Kuan Li1
1UniPat AI, 2xbench

Correspondence: <wendongxu@unipat.ai>, <kuanli@unipat.ai>

A continuously updated benchmark with fresh, real GitHub issues across multiple programming languages. Every month, we source ~100 new high-quality tasks from freshly merged PRs, keeping the benchmark alive and the leaderboard honest.

112 Total Tasks
68 Bug Fix
44 Non-Bug Fix
10 LLMs Evaluated
Pass@1 Metric
Next evaluation round (Apr 2026) coming soon. We will include more harnesses during evaluation. New models and updated scores will be added monthly.

Task Categories

112 real-world tasks sourced from closed GitHub PRs

68
bugfix
Bug-oriented fixes & maintenance patches
44
non_bugfix
Feature additions, API evolution & engineering improvements

Construction Pipeline

From GitHub PRs to a fixed evaluation pool

  • 1
    Candidates sampled from closed PRs with multi-round filtering by time window, repository scope, and task quality
  • 2
    Construct each task: environment, instruction, and verifier
  • 3
    All in Harbor format. Run oracle to validate solvability and verifier correctness; retain only stably-evaluable tasks

Quality Assurance

Multi-layer QC before entering the eval pool

  • Remove future branches to avoid contamination.
  • Multi-round filtering during PR selection and task construction
  • oracle pass — every task has a known-reachable reference solution
  • Claude Opus 4.6 Pass@4 rollout to calibrate difficulty and fairness

Key Findings (Mar 2026)

1 High task difficulty. Most models average 30~60 episodes per task, indicating these are real multi-step engineering challenges requiring extended reasoning and trial-and-error, not simple one-shot patches.
2 Trajectory efficiency varies widely. Gemini achieves strong accuracy with the lowest avg episodes (~13), while GLM-5.1, MiniMax, Qwen, and GLM-5 require 3~5x more steps, revealing significant differences in how directly models solve problems.
3 Clear model tiers. Opus leads decisively; GLM-5.1 and GPT-5.4-High form a strong second tier; Qwen-3.6-Plus and Gemini compete closely behind; GLM-5, Kimi, and MiniMax cluster together, showing emerging but still maturing agent capabilities across providers.
4 Bug fix vs non-bug fix signal. All models score higher on bug fix tasks (localized repairs, regression fixes) than non-bug fix tasks (feature additions, API evolution), confirming that broader-scope changes remain significantly harder for current agents.

The Rise and Fall of Static Benchmarks

If you can't measure it, you can't improve it. But if your measurement never changes, you stop improving and start memorizing.
On the lifecycle of AI benchmarks

SWE-bench (2023) took a practical approach to evaluating coding agents: derive tasks from real GitHub issues in popular Python repositories and use the existing test suites to judge correctness. The benchmark gained wide adoption, and model performance on it improved rapidly over the following years.

But scores went up faster than real-world usefulness.

As more teams optimized for SWE-bench, scores kept rising while practitioners reported little matching improvement in day-to-day coding assistance. The benchmark was being overfitted, not through explicit cheating, but through the gradual contamination of training data that inevitably absorbs widely-studied public benchmarks.

The Overfitting Cycle

A benchmark is released → the community studies it intensely → solutions, discussions, and patch patterns enter the public corpus → models train on that corpus → benchmark scores inflate → the benchmark loses its ability to discriminate real capability. This cycle is not a failure of any single team; it is a structural property of static benchmarks.

SWE-bench Verified attempted to address quality concerns. OpenAI partnered with the original authors to have human annotators review 500 tasks from the original 2,294, filtering out under-specified or ambiguous problems. This improved the benchmark’s reliability, but it was still a static, fixed set. OpenAI eventually announced they would no longer evaluate on SWE-bench Verified, citing concerns that even the curated subset was becoming saturated.

SWE-bench-Pro pushed further with harder tasks. Yet the underlying problem remains: any fixed set of tasks has an expiration date. The moment a benchmark is published, its clock starts ticking toward obsolescence.

SWE-bench Live introduced periodic updates with new tasks, taking a step toward liveness. However, it remains Python-focused, covers only bug fixes, and lacks rigorous instruction-test alignment verification, leaving room for ambiguous or under-specified tasks to slip through.


Why We Need a Living, Verified, and Broad Benchmark

The problems outlined above are not isolated issues. They reflect fundamental gaps in existing benchmarks: liveness (static task sets expire), verification (task quality is not systematically guaranteed), and breadth (limited to Python and bug fixes only). A next-generation software engineering benchmark must address all three. Monthly-SWEBench is built around six pillars spanning these dimensions.

🕑
Anti-Contamination by Design
Every month, ~100 new tasks from freshly merged PRs across fresh repositories. All repositories and PRs are newly selected and did not exist in any previous release, making data contamination and benchmark-specific optimization structurally impossible.
🌐
Broad Coverage
Covering Python, Go, C++, Java, Rust, TypeScript, JavaScript and more across ~20 repositories per month, reflecting the real diversity of open-source ecosystems.
Sufficient Complexity
Tasks are challenging enough to differentiate model capabilities. Typical tasks span 2+ files and 30+ lines of meaningful changes, requiring genuine repository-level understanding.
🔗
Instruction-Test Alignment
Every tested behavior is explicitly described in the instruction. No hidden API names, parameter orders, or interface contracts that silently fail correct implementations.
🔬
Rigorous Test Quality
Tests run real code with multiple inputs and check actual outputs. No source-code text matching, no single-case tests bypassable by hardcoding.
🛠
Bug Fix + Feature Tasks
Unlike prior benchmarks that focus exclusively on bug fixes, we include both bug fixes and feature additions. We carefully curate feature PRs that extend existing interfaces, making fail-to-pass verification possible for a broader task spectrum.
Why "Live" Alone Is Not Enough

Freshness prevents overfitting, but fresh tasks can still be poorly constructed. Instruction-test misalignment: the issue description says one thing, but the test suite silently requires specific API names or implementation details never mentioned. Insufficient tests: some tasks have only one or two test cases, trivially bypassable through hardcoded outputs. Dirty PRs: real-world PRs bundle unrelated changes, and tasks inherit this noise. Non-behavioral verification: some test scripts check source code text rather than running the code. Bug-fix-only coverage: feature additions, behavior changes, and API extensions are equally important in real development but are excluded by prior benchmarks because they are harder to validate with fail-to-pass testing. Monthly-SWEBench addresses all of these through systematic verification at every stage of the pipeline.


The following table summarizes where each benchmark stands across these dimensions:

Benchmark Tasks Languages Task Types Updated Test Quality Instruction-Test Alignment
SWE-bench 2,294 Python only Bug fix only Static Partial No
SWE-bench Verified 500 Python only Bug fix only Static Human review Partial
SWE-bench-Pro Varies Python only Bug fix only Static Yes Partial
SWE-bench Live Growing Python-focused Bug fix only Periodic Partial Partial
Monthly-SWEBench ~100/month Multiple languages Bug fix + Feature Monthly Rigorous pipeline Strict alignment

The Monthly-SWEBench Pipeline

Our benchmark is not hand-crafted. It is produced by a multi-stage pipeline that starts from the global stream of open-source activity and distills it into high-quality, verified tasks. Here is what happens at each stage.

🔍
Repository
Discovery
🕐
PR Collection
& Filtering
Task Construction
& Quality Gates
Fail-to-Pass
Verification
Alignment
Verification
🚀
Monthly
Release
1

Filter Repository Discovery & Filtering

  • Scan GitHub for actively maintained repositories across Python, Go, C++, Java, Rust, TypeScript, and JavaScript
  • Require a minimum star count (≥500) and recent activity (pushed within the last 30 days)
  • Apply multi-dimensional filtering: test signal presence, release cadence (≥3 releases), code scale (10K~1M lines of code), commit activity
  • Each monthly batch targets ~20 high-quality, actively maintained repositories spanning multiple language ecosystems
2

Filter PR Collection & Filtering

  • Collect PRs merged to the default branch within a specific time window (e.g., the past 30 days)
  • Must include test changes: PRs without any test file modifications are excluded, ensuring every candidate PR has a verifiable testing signal
  • Minimum complexity: at least 1 changed source file and 30+ lines of code additions (excluding pure documentation, configuration, or formatting changes)
  • Maximum scope: exclude overly large PRs (≤100 changed files) that would be impractical as individual tasks
  • Exclude pure UI/cosmetic changes, dependency bumps, and CI configuration updates
3

Quality Task Construction & Quality Gates

Each PR that passes the earlier filters is transformed into a standalone benchmark task in Harbor format. The task construction process enforces multiple quality gates:

📄 Instruction-Test Strict Alignment

Every behavior that the test suite verifies must be explicitly described in the task instruction. If the test checks a specific function name, parameter signature, return type, CLI flag, or file path, the instruction must specify it. No hidden requirements.

🚫 Irrelevant Functionality Filtering

Real-world PRs are often "dirty": a single PR may fix a bug, refactor adjacent code, update docs, and tweak CI, all in one commit. We extract the core functional change and filter out everything else. The resulting task focuses on one coherent problem.

🔬 Optimized Verification Scripts

Tests must be behavioral verification: run the actual code with real inputs and check the actual outputs. We prohibit source-code text matching (grep for function names), exit-code-only checks, and single-input tests that can be trivially hardcoded around. Each fail-to-pass test requires multiple distinct inputs and checks output content with sufficient information density.

Fairness Solver-Perspective Review

Every task is reviewed from the perspective of a solver who has never seen the original PR and can only read the instruction plus the unpatched repository. If a reasonable solver could produce a semantically correct implementation that differs in interface shape (e.g., different method naming) and the test would reject it, the task is disqualified until the instruction or test is fixed.

📚 Multi-dimensional Test Roles

Each task includes multiple test categories working together:

  • Regression tests: guard existing functionality that must not break
  • Fail-to-pass core tests: directly verify the primary fix or feature
  • Fail-to-pass edge tests: cover boundary conditions and special inputs
  • Fail-to-pass error handling tests: verify graceful handling of invalid inputs
4

Verify Fail-to-Pass Verification

Every task must pass a two-phase verification to confirm that the task is solvable and that the tests correctly distinguish between unpatched and patched states.

Beyond Bug Fixes: Why Feature Tasks Are Hard but Valuable

Most existing SWE-bench variants focus exclusively on bug fix PRs: for bug fixes, the code already exists and simply produces wrong results, making fail-to-pass straightforward. But real-world software engineering is not just about fixing bugs. Feature additions, behavior changes, and API extensions are equally important and often harder.

We invest significant effort in carefully selecting feature-type PRs that modify or extend existing interfaces rather than creating entirely new modules. For example, adding a new parameter to an existing function, changing the behavior of an existing command, or extending an existing data format. In all these cases, the pre-patch code runs but behaves differently, making fail-to-pass verification possible. This careful curation allows Monthly-SWEBench to cover a fuller spectrum of real development work that prior benchmarks have left out.

Two-Phase Verification

Each task is run twice from the same repository snapshot:

Phase Solve Script Expected Reward Regression Tests Fail-to-Pass Tests
Phase 1 (unpatched) Empty (no fix) reward = 0 All pass All fail
Phase 2 (patched) Reference solution reward = 1 All pass All pass

This guarantees that tests are neither too easy (passing without a fix) nor broken (failing even with the correct fix). In Phase 1, the fail-to-pass tests must fail because the code itself is buggy, not because infrastructure is missing.

5

Quality Instruction-Test Alignment Verification

  • Direct review of instructions and test cases to ensure every tested behavior is explicitly described in the task instruction
  • Review of agent reasoning trajectories to distinguish genuine task difficulty from errors caused by ambiguous or under-specified instructions
  • The final task set is carefully curated to be discriminative across current agent models
6

Release Monthly Release

  • ~100 verified tasks are released each month along with updated leaderboard results
  • All tasks come from fresh PRs and fresh repositories that were not part of any previous release
  • Every released task has passed all quality gates, fail-to-pass verification, and instruction-test alignment verification
Raw PRs
~10,000
Merged to default branch
~5,000
Has test changes
~2,000
Meets complexity threshold
~800
Passes quality gates
~300
Fail-to-Pass verified
~150
Alignment verified
~100

Conclusion

The era of static, unverified software engineering benchmarks is ending. As language models grow more capable, three problems become critical: fixed task sets get overfitted, poorly constructed tasks give misleading signals, and narrow scope fails to reflect the diversity of real-world engineering. A benchmark must address all three.

Monthly-SWEBench addresses all three:

  • Live: monthly releases of ~100 tasks from freshly merged PRs across fresh repositories ensure models cannot memorize their way to high scores
  • Verified: every task passes strict instruction-test alignment review, behavioral test quality checks, fail-to-pass dual-phase verification, and solver-perspective auditing to ensure no hidden requirements
  • Broad: covering Python, Go, C++, Java, Rust, TypeScript, JavaScript and more across ~20 repositories, with both bug fixes and feature additions covering a fuller spectrum of real-world software engineering than any prior benchmark

Honest evaluation is a prerequisite for genuine progress. We need benchmarks that reflect where AI coding agents actually are, not where they were six months ago.

Citation

If you find Monthly-SWEBench useful in your research, please cite:

@misc{monthlyswebench2026,
    title={Monthly-SWEBench: A Living, Rigorously Verified Benchmark
           for Real-World Software Engineering},
    author={Haiyang Shen and Xinbo Xu and Xuanzhong Chen and Wendong Xu and Elvis Zhang and Kaiyuan Chen and Xiaobo Hu and Rui Wang and Yang Liu and Yixin Ren and Yuan Gong and Liang Chen and Kuan Li},
    year={2026},
    note={Blog post: https://unipat.ai/benchmarks/MonthlySWEBench}
}