Benchmarks

RoadmapBench

๐Ÿ† Claude-Opus-4.7: 39.1% ยท Software Engineering

Can coding agents implement real version upgrades? RoadmapBench evaluates frontier models on 115 multi-target tasks spanning 17 open-source repositories and 5 programming languages.

115
Tasks
17
Repositories
5
Languages
~3.7K
LOC / Oracle
12
Models

Leaderboard

Below is the resolved rate across frontier models on the OpenHands scaffold. Resolved % measures the fraction of tasks where all targets pass (full reward). Hover over each bar for detailed metrics including completion score, average interaction turns, and token usage.

39.1%
Claude Opus 4.7
Completion Score: 0.692
Avg Turns: 140
Tokens: 44K
32.2%
Claude Opus 4.6
Completion Score: 0.627
Avg Turns: 141
Tokens: 42K
29.6%
GPT 5.4
Completion Score: 0.497
Avg Turns: 171
Tokens: 93K
20.9%
Gemini 3.1 Pro
Completion Score: 0.439
Avg Turns: 133
Tokens: 26K
18.3%
DeepSeek V4 Pro
Completion Score: 0.486
Avg Turns: 140
Tokens: 64K
18.3%
GLM 5.1
Completion Score: 0.453
Avg Turns: 163
Tokens: 38K
14.8%
Kimi K2.6
Completion Score: 0.432
Avg Turns: 159
Tokens: 76K
13.9%
Mimo V2.5 Pro
Completion Score: 0.440
Avg Turns: 156
Tokens: 66K
12.2%
Qwen 3.6 Plus
Completion Score: 0.424
Avg Turns: 150
Tokens: 47K
11.3%
Kimi K2.5
Completion Score: 0.378
Avg Turns: 110
Tokens: 29K
10.4%
MiniMax M2.7
Completion Score: 0.332
Avg Turns: 124
Tokens: 38K
9.6%
Qwen 3.5 397B
Completion Score: 0.383
Avg Turns: 111
Tokens: 35K
Resolved Rate (%) — OpenHands scaffold — single trial per model

Full Results

Metrics. For task t with K subtasks, each carrying a weight reflecting its complexity, the per-task reward is the weighted fraction of passed subtasks. Resolved Rate is the fraction of tasks with full reward (all subtasks pass). Completion Score averages the per-task reward across all tasks, crediting partial progress. We also report average agent turns and median output tokens per task. Single trial per model.
#ModelScaffoldResolved %Completion ScoreTurnsTok.
1
Claude-Opus-4.7
Anthropic
OpenHands
39.1
0.69214044K
2
Claude-Opus-4.6
Anthropic
OpenHands
32.2
0.62714142K
3
GPT-5.4
OpenAI
OpenHands
29.6
0.49717193K
4
Gemini-3.1-Pro
Google
OpenHands
20.9
0.43913326K
5
DeepSeek-V4-Pro
DeepSeek
OpenHands
18.3
0.48614064K
5
GLM-5.1
Zhipu AI
OpenHands
18.3
0.45316338K
7
Kimi-K2.6
Moonshot AI
OpenHands
14.8
0.43215976K
8
Mimo-V2.5-Pro
Xiaomi
OpenHands
13.9
0.44015666K
9
Qwen3.6-Plus
Alibaba
OpenHands
12.2
0.42415047K
10
Kimi-K2.5
Moonshot AI
OpenHands
11.3
0.37811029K
11
MiniMax-M2.7
MiniMax
OpenHands
10.4
0.33212438K
12
Qwen3.5-397B
Alibaba
OpenHands
9.6
0.38311135K

How It Works

The agent receives a source-version repository snapshot and a multi-target instruction, then implements the specified functionality inside a pinned Docker environment. Evaluation is performed via weighted subtask-level tests against behaviors introduced in the target version.

Overview of a RoadmapBench task
Overview of a RoadmapBench task. The agent operates in a Docker environment with the repository pinned at the source version. It receives a multi-target specification describing what to implement. Tests adapted from the upstream test suite verify each target independently.

Dataset Overview

RoadmapBench covers 17 open-source repositories across 5 programming languages, organized into 5 domain groups. Oracle patches range from under 300 to over 30,000 lines changed, with a median of approximately 3,700 lines and 51 files touched per task.

Dataset Overview
(a) Task count per repository (outer ring) grouped by domain (inner ring): ML & Data (36), Web & RPC (17), ORM & Val (25), Infra & Tool (23), UI & Ren (14). (b) Distribution of oracle patch size (LOC) per repository. Dashed line marks median of 3,714 LOC.

Analysis

Step Efficiency & Compute Scaling

Frontier models (Claude-4.7, GPT-5.4) achieve 30–39% resolved with moderate step budgets. By contrast, models like GLM-5.1 and Kimi-K2.6 consume comparable or larger budgets but remain near mid-performance, indicating lower step efficiency. Most models plateau within 200 steps; Claude-4.7 is the main exception, continuing to improve beyond that threshold.

Efficiency Analysis
(a) Efficiency Landscape: Resolved rate vs. average agent steps. Dashed lines mark fleet averages (134 steps, 18%). (b) Step Budget Ablation: Cumulative resolved rate as step budget increases.

Tool Composition & Usage

Across all models, Explore, Edit, and Execute dominate tool usage. The key differentiator is allocation: Claude-4.7 achieves the highest resolved rate with the lowest Explore ratio (35%), indicating more efficient code localization. Weaker models spend over half their budget on exploration without translating it into successful edits.

Tool Usage Analysis
(a) Tool composition per model (6 intent categories). Models sorted by resolved rate. (b) Per-task tool call distributions for three representative models spanning the full performance range.

Failure Mode Analysis

As model capability decreases, the dominant failure mode shifts from Implementation Error (incorrect behavior in buildable code) to Build Error and Missing Implementation (incomplete or non-compiling code). For Claude-Opus-4.6, 58% of failures are Implementation Errors, dominated by code defects and requirement misinterpretation. For weaker models, Build Error and Missing Implementation account for over 70% of failures.

Error Distribution
Error distribution for three representative models. Inner ring: category proportions. Outer ring: sub-type breakdowns. Failure mode shifts from (a) implementation correctness to (c) code construction as capability decreases.