Benchmarks

EvoCode-Bench

Opus-4.8: 59.1 Dataset Score · Coding

Iteration · Multi-turn · Harbor Official Multi-Step + Terminus-2

EvoCode-Bench tests whether coding agents can keep a project working as user requests change. It evaluates agents on 26 stateful coding tasks and 227 rounds, with the same workspace and agent session preserved for 5-15 rounds.

26
Tasks
227
Rounds
5-15
Rounds / Task
13
Agents
Mean
Score Strategy

Leaderboard

The current leaderboard is the 2026-06-25 update to the clean re-release, evaluated under the Harbor official multi-step format with one attempt per task. The main ranking mode is Dataset Score: each round is verified independently with a binary reward, each task scores passed_rounds / total_rounds, and the final score is the mean across the 26 tasks. Per-round, per-test-case detail for every task is available on the interactive results site. The clean evaluation trajectories are released with the Hugging Face dataset; each reached benchmark round includes an agent/trajectory.json file for auditing model behavior and Avg Rounds.

Dataset score — Harbor official multi-step — one attempt per task, mean of per-round rewards
Auto-rotates every 3 seconds
#AgentDataset ScoreCase ScoreAvg RoundsPerfect TasksReasoning
Leaderboard metric definitions.
  • Dataset Score is the headline score on a 0–100 scale. For each task, compute passed_rounds / total_rounds, then average that task score over all 26 tasks and multiply by 100. A round earns 1 only if every required test case passes; if the chain aborts before later rounds, those unreached rounds count as 0. This is similar in spirit to MT@1 because it is one attempt per task, but it is not the legacy paper MT@1: this table uses Harbor official multi-step full-chain runs and averages binary per-round rewards within each task first.
  • Case Score is the finer-grained companion on a 0–100 scale. For each round, compute passed_test_cases / total_test_cases; build failures and unreached rounds count as 0. Then average over the task's rounds and over the 26 tasks. It credits partial progress that Dataset Score hides; GPT-5.5, for example, scores 29.5 on binary round rewards but 81.8 on test cases.
  • Avg Rounds is the mean number of agent-tool interactions per reached benchmark round, computed from each reached round's agent/trajectory.json. It is not the number of benchmark rounds in a task. If a chain stops early, missing later benchmark rounds are not included in this average, while they still count as 0 for Dataset Score and Case Score.
  • Perfect Tasks counts tasks where every benchmark round passed. A value of 9 / 26 means 9 full tasks were completed end to end, or 34.6% full-task completion.
The interactive site exposes the per-round and per-test-case pass rate behind every cell.

A note on evaluation integrity (results updated 2026-06-25). While auditing trajectories from our first run we found a Harbor framework issue: in the default shared multi-step mode, the previous round's grading script (/tests/test.sh) and reward persist into the next round's agent workspace, so an agent can read the grader from inside its own run. We observed this on 12 of 26 tasks (22 task–model pairs), concentrated in a few models. We reported it upstream (issue #1960, fix PR #1961), patched our harness, and re-ran the whole benchmark clean — the numbers above are from the patched runs and replace the leaderboard published June 13–16. Earlier v1 numbers are historical only and are not shown in the current leaderboard. Per-(task, model, round) detail and the full changelog are in the GitHub repo.

How It Works

Each task is a sequence of user rounds run in one persistent Docker container with one continuous agent session. At round i, the agent receives a new instruction, edits the same workspace, and is verified by a cumulative test suite covering all still-active requirements through round i. Every round runs and is scored with a binary 0/1 reward; the chain does not short-circuit, and the trial score is the mean of the per-round rewards. Because every round is scored independently, the per-round pattern is itself diagnostic of which requirements an agent can and cannot satisfy.

Evaluation Scaffold

EvoCode-Bench runs on the Harbor official multi-step format — the same framework used by Terminal-Bench 2.0 — using its native [[steps]] sequencing, a single persistent workspace, a per-step verifier, and multi_step_reward_strategy = "mean" aggregation. No fork is required to run a full task. Single-round fast-forward (solving a target round from a reference-completed prior state) is provided by our fork harbor-official-fast-forward.

Reproduce. Download the task archives and clean evaluation trajectories from Hugging Face, then run harbor run <task> --agent oracle | nop | <model>. The oracle reference solutions score 1.0 on every round; empty submissions score 0. See github.com/UniPat-AI/EvoCodeBench for the evaluation scripts and leaderboard recomputation command.
⚠ Legacy — Paper Evaluation. Everything below is the original paper evaluation, run on the legacy harbor_multiturn fork with the MT@4 / SR / Comp metrics (best-of-four, fail-stop). Those numbers use a different runner and a different scoring rule than the official multi-step dataset score above and are not directly comparable. They are kept here for reproducibility of the paper.

Leaderboard (Legacy · MT@4)

The main score is MT@4: a best-of-four fail-stop multi-round score. A round receives credit only if at least one attempt reaches that round with a workspace that still satisfies all active cumulative tests.

54.0
Opus 4.7High reasoning
MT@4: 54.0
SR: 76.7
Comp: 42.3%
Setting: high reasoning effort
52.4
GPT 5.5High reasoning
MT@4: 52.4
SR: 74.4
Comp: 38.5%
Setting: high reasoning effort
44.0
Opus 4.6Default
MT@4: 44.0
SR: 78.9
Comp: 34.6%
Setting: default configured reasoning
36.2
GLM 5.1Thinking
MT@4: 36.2
SR: 63.9
Comp: 15.4%
Setting: thinking enabled
31.9
Kimi K2.6Thinking
MT@4: 31.9
SR: 59.0
Comp: 23.1%
Setting: thinking enabled
30.6
DS V4 ProHigh reasoning
MT@4: 30.6
SR: 56.4
Comp: 19.2%
Setting: high reasoning effort
29.4
Qwen 3.6Thinking
MT@4: 29.4
SR: 57.3
Comp: 15.4%
Setting: thinking enabled
17.3
MiMo V2.5High reasoning
MT@4: 17.3
SR: 7.9
Comp: 11.5%
Setting: high reasoning effort
13.7
Gemini 3.1High reasoning
MT@4: 13.7
SR: 46.7
Comp: 11.5%
Setting: high reasoning effort
9.4
DS V4 FlashHigh reasoning
MT@4: 9.4
SR: 46.3
Comp: 0.0%
Setting: high reasoning effort
4.6
Qwen 3.5Thinking
MT@4: 4.6
SR: 44.1
Comp: 0.0%
Setting: thinking enabled
3.7
MiniMax M2.7Reasoning split
MT@4: 3.7
SR: 30.0
Comp: 0.0%
Setting: reasoning split enabled
1.9
Doubao 2.0High reasoning
MT@4: 1.9
SR: 23.8
Comp: 0.0%
Setting: high reasoning effort
MT@4 score — Harbor + Terminus-2 scaffold — four attempts per multi-round task

Full Results (Legacy · MT@4)

Metrics. MT@4 is the four-attempt fail-stop multi-round score. SR is single-round pass rate after earlier rounds have been completed by the reference solution. Gap is SR minus MT@4, showing how much isolated round-solving overstates persistent execution. Comp is full-task completion through the final round in at least one attempt. Model sublabels show the evaluation setting from the paper appendix.
#AgentMT@4SRGapCompAvg TurnsOutput Tok.
1
Claude-Opus-4.7-High
Anthropic
high reasoning effort
54.076.7+22.742.3590.650.0K
2
GPT-5.5-High
OpenAI
high reasoning effort
52.474.4+22.038.5456.374.1K
3
Claude-Opus-4.6
Anthropic
default configured reasoning
44.078.9+34.934.6747.5734.2K
4
GLM-5.1
Zhipu AI
thinking enabled
36.263.9+27.715.4859.8104.2K
5
Kimi-K2.6
Moonshot AI
thinking enabled
31.959.0+27.123.11155.592.5K
6
DeepSeek-V4-Pro
DeepSeek
high reasoning effort
30.656.4+25.819.21134.8168.8K
7
Qwen3.6-Plus
Alibaba
thinking enabled
29.457.3+27.915.4629.3103.1K
8
Xiaomi-MiMo-V2.5-Pro
Xiaomi
high reasoning effort
17.37.9-9.411.5754.8125.7K
9
Gemini-3.1-Pro-Preview
Google
high reasoning effort
13.746.7+33.011.5261.372.7K
10
DeepSeek-V4-Flash
DeepSeek
high reasoning effort
9.446.3+36.90.01104.7148.7K
11
Qwen3.5-397B-A17B
Alibaba
thinking enabled
4.644.1+39.50.0587.853.0K
12
MiniMax-M2.7
MiniMax
reasoning split enabled
3.730.0+26.30.0600.459.2K
13
Doubao-Seed-2.0-Pro
ByteDance
high reasoning effort
1.923.8+21.90.0211.118.5K
EvoCode-Bench main results
Main results. SR measures isolated single-round success from a reference-fast-forwarded workspace; MT@4 measures whether the agent's own workspace keeps passing as rounds accumulate.

How It Works (Legacy · MT@4)

Each task is a sequence of user rounds. At round i, the agent receives a new instruction, edits the same Docker workspace, and is evaluated by cumulative tests for all still-active requirements through round i. If the cumulative verifier fails, the multi-turn attempt stops and later rounds receive zero credit.

Overview of EvoCode-Bench
Overview of EvoCode-Bench. Task construction pairs every round with an instruction, reference solution, and cumulative tests. MT@4 keeps one container and one agent session across rounds; SR fast-forwards to the reference state before the target round.

Evaluation Scaffold (Legacy · harbor_multiturn)

The paper evaluation used harbor_multiturn, the multi-turn Harbor fork released at github.com/UniPat-AI/harbor_multiturn. The scaffold adds persistent Docker workspaces, continuous agent sessions, round-boundary verifier swaps, reference fast-forwarding for SR, snapshot/resume lineage, and fail-stop reward aggregation.

Where it fits. A stock Harbor task runs one instruction followed by one verifier. EvoCode-Bench needs one task to run as a sequence of rounds. harbor_multiturn handles that protocol: it delivers each round instruction, preserves the workspace, swaps in cumulative tests for the current round, records verifier/multiround_results.json, and writes the aggregate reward used by MT@4.

Dataset Overview

The paper groups EvoCode-Bench tasks along two axes: interaction style, or how users communicate across rounds, and engineering activity, or what kind of code change the round asks for. Each cell reports tasks / rounds.

ActivityCapability MeasuredExplorativeContractualDocument-DrivenTotal
ConstructionBuilding a system incrementally while preserving earlier features and interfaces.9 / 803 / 371 / 713 / 124
Spec EvolutionUpdating an implementation after a later round overturns a core assumption.1 / 81 / 71 / 73 / 22
ReviewImproving non-functional properties such as performance, security, and observability without regression.3 / 211 / 71 / 95 / 37
MigrationMoving a legacy system to a new implementation style while keeping backward compatibility.3 / 291 / 71 / 85 / 44
Total16 / 1386 / 584 / 3126 / 227
Round Length
Tasks span 5-15 rounds of evolving state, with 227 cumulative verification points. The longer tasks expose late-stage state accumulation and regression risk.
Requirement Pressure
Rounds are annotated with non-exclusive change types: 198 extensions, 69 corrections, and 42 conflicts. In total, 110 rounds carry at least one correction or conflict.
Technical Coverage
Tasks cover MLOps, data engineering, systems programming, scientific computing, testing and automation, infrastructure, and security.
Behavioral Verification
Tests check observable behavior rather than the reference implementation path, so different valid internal designs can pass the same cumulative contract.

Analysis (Legacy · Paper)

Single-Round Skill Does Not Imply Persistent Reliability

SR exceeds MT@4 by 22 to 40 points for most agents. Claude-Opus-4.6 has the highest SR at 78.9, but ranks third on persistent execution at 44.0 MT@4. The reranking shows that solving an isolated round from a clean reference state is different from living with one's own earlier edits.

Single-round versus multi-turn score by round
SR vs. MT@4 by round. SR stays comparatively stable while MT@4 falls with depth, showing the cost of accumulated workspace state.

Workspace State Drives a Large Share of Failures

A controlled comparison shows that 57.0% of failed MT@4 round records are solvable under SR from a reference-completed state. The state penalty grows with depth: only 15.0% of round-1 MT failures are SR-solvable, rising above 80% beyond round 12.

Workspace state penalty
Workspace state penalty. Many multi-turn failures are not isolated instruction failures. They happen because the agent's accumulated workspace no longer satisfies the active contract.

Failure Patterns Are Tier-Dependent

Missed active requirements dominate every tier, but the secondary modes differ. Lower-tier agents fail early: 57.4% of lower-tier trial failures occur in the first 20% of rounds. Stronger agents survive long enough for stale behavior, regressions, and conflict-resolution errors to appear.

First failure distribution
First failure distribution. Lower-tier failures concentrate early; top-tier failures are spread deeper into trajectories.
Failure mode summary
Failure modes. Regressions and stale behavior become visible only after agents build enough working functionality for later rounds to break or replace it.

Failing Rounds Consume More Tokens

At the same round index, failed trials usually produce more output tokens than passing trials. Across rounds 1-9, failing trials emit 1.1x to 3.1x as many generated tokens as passing trials at the same depth. This is a diagnostic association rather than a causal claim.

Within-round token usage
Within-round token usage. Higher effort often accompanies harder or degraded workspace states, but more output does not guarantee recovery.