Blog

Terminal-X: Evaluate Coding Agents across Depth, Iteration, and Evolution in Terminal Environments

2026-05-02

Terminal-X tests whether coding agents can solve calibrated terminal tasks in one shot, preserve correctness across multi-turn requirement changes, and implement real version-upgrade roadmaps.

UniPat AI Coding Team

contact@unipat.ai

Real software engineering usually happens in the terminal: in shells, on real filesystems, and through real build pipelines and test suites. We believe the most credible way to evaluate LLM coding agents is to put them in terminal environments where every claim of correctness has to survive execution: code must compile, programs must run, tests must pass, and adversarial inputs must be handled. Static string matching, multiple-choice questions, and isolated code-completion benchmarks all leave room for shortcuts that real terminal tasks do not. Building on this paradigm, we introduce three complementary datasets targeting three core capability dimensions. DeepTerminalBench tests deep engineering ability with 50 calibrated tasks across 20+ domains. EvoCodeBench tests multi-turn consistency with 50 tasks that evolve over 5–30 rounds of changing requirements. RoadmapBench tests long-horizon feature development through real version-upgrade tasks, covering 110+ tasks from 18 repositories across 6 programming languages. Together, these datasets evaluate coding agents along the dimensions that matter most in real-world software engineering: depth, iteration, and evolution.
50
DeepTerminalBench Tasks
20+
Engineering Domains
12
Workload Types
~42
Avg Agent Steps per Attempt (Claude-Opus-4.6)
26
Multi-Turn Tasks
227
Evaluated Rounds
5-15
Rounds per Task (Standard)
4
Engineering Activities
3
Interaction Styles
115
Version Upgrade Tasks
17
Open-Source Repositories
5
Programming Languages
~3,700
Median Oracle Patch (LOC)

Main Results

Execution-Based Results Across the Three Datasets

The three result blocks correspond to three evaluation settings. DeepTerminalBench reports how often a model completes a calibrated single-shot terminal task. EvoCodeBench reports whether correctness carries across changing requirements: its table includes both persistent multi-turn success (MT@4) and isolated single-round success (SR) so the state-carrying gap is explicit. RoadmapBench reports how much of a real version-upgrade roadmap a model completes under phase-weighted tests.

🧪DeepTerminalBench Depth · Single-shot 50 tasks · pass@1 · terminus-2 · calibrated by Claude-Opus-4.6 Pass@4
Metric. Pass@1 is the single-attempt task success rate: each model receives one agent run per task, and a task is counted as passed only if the hidden tests/test.sh verification returns reward = 1.0. Avg Turns reports how many agent-model exchanges the run uses on average. Output Tok. (K) reports how many tokens the model generates during those exchanges, in thousands. Calibration. All 50 tasks are validated under Claude-Opus-4.6 with four attempts — every task is solved by at least one of the four Pass@4 attempts, ensuring it is reachable for a frontier-tier model rather than impossible by construction; the curated subset specifically retains tasks where strong models still fail under single-attempt Pass@1, preserving headroom for the leaderboard.
# Model Pass@1 Pass / Fail Avg Turns Output Tok. (K)
🥇1 Claude-Opus-4.7
34.00%
17/33 14.7 11.4
🥈2 GPT-5.5-high
32.00%
16/34 9.5 13.3
🥉3 DeepSeek-V4-Pro
32.00%
16/34 30.4 46.0
4 Kimi-K2.6
32.00%
16/34 29.6 42.0
5 Qwen-3.6-Plus
32.00%
16/34 20.9 43.3
6 GLM-5.1
30.00%
15/35 24.2 32.1
7 Qwen-3.5-397B-A17B
24.00%
12/38 26.1 21.2
8 Gemini-3.1-Pro
16.00%
8/42 10.5 40.1
9 MiniMax-M2.7
10.00%
5/45 36.8 34.6
Per-task pass/fail map
A 50-task × 9-model heat-map showing which tasks each model solved, ordered by difficulty. View the full matrix →
🔄EvoCodeBench Iteration · Multi-turn 26 tasks · 227 rounds · multi-turn + single-round · terminus-2
Metrics. MT@4 is the four-attempt fail-stop multi-turn score: for each task and round, a pass is credited if any of four persistent attempts reaches that round successfully, then passed rounds are averaged. SR is the single-round pass rate from a reference-fast-forwarded workspace. Comp is the fraction of tasks completed through the final round. Avg. Turns reports the average number of agent-model exchanges needed to execute a full multi-turn task; partial trajectories are scaled to the task's full round horizon. Output Tok. (K) reports provider-reported generated-token usage across those exchanges, in thousands; hidden reasoning tokens are included only when the provider reports them as completion/output tokens.
# Model MT@4 SR Comp Avg. Turns Output Tok. (K)
🥇1 Claude-Opus-4.7
54.0
76.7 42.3 590.6 50.0
🥈2 GPT-5.5-High
52.4
74.4 38.5 456.3 74.1
🥉3 Claude-Opus-4.6
44.0
78.9 34.6 747.5 734.2
4 GLM-5.1
36.2
63.9 15.4 859.8 104.2
5 Kimi-K2.6
31.9
59.0 23.1 1155.5 92.5
6 DeepSeek-V4-Pro
30.6
56.4 19.2 1134.8 168.8
7 Qwen3.6-Plus
29.4
57.3 15.4 629.3 103.1
8 Xiaomi-MiMo-V2.5-Pro
17.3
7.9 11.5 754.8 125.7
9 Gemini-3.1-Pro-Preview
13.7
46.7 11.5 261.3 72.7
10 DeepSeek-V4-Flash
9.4
46.3 0.0 1104.7 148.7
11 Qwen3.5-397B-A17B
4.6
44.1 0.0 587.8 53.0
12 MiniMax-M2.7
3.7
30.0 0.0 600.4 59.2
The SR-to-MT@4 gap measures state-carrying failure: a model may solve a round from a clean reference state but fail when the same round depends on its own earlier edits.
🛠️RoadmapBench Evolution · Version-upgrade 115 tasks · 17 repos · 5 languages · openhands + terminus-2
Metrics. Resolved % is the fraction of tasks where the agent achieves full reward (all phases pass). Completion is the average per-task phase-pass fraction (0–1 scale), capturing partial progress. Output Tok. (K) reports median generated tokens per task. Results are single-trial per model. Two scaffoldings are reported: OpenHands (open-source multi-tool agent) and Terminus 2 (standardized harness).

OpenHands

# Model Resolved % Completion Avg. Turns Output Tok. (K)
🥇1 Claude-Opus-4.7
39.5%
0.702 110.3 42
🥈2 Claude-Opus-4.6
32.2%
0.627 140.7 42
🥉3 GPT-5.4
29.6%
0.497 170.7 93
4 Gemini-3.1-Pro
20.9%
0.439 133.4 26
5 DeepSeek-V4-Pro
18.3%
0.486 140.2 64
6 GLM-5.1
18.3%
0.453 163.2 38
7 Kimi-K2.6
15.6%
0.456 140.0 70
8 Qwen3.6-Plus
12.2%
0.424 150.3 47
9 Kimi-K2.5
11.3%
0.378 110.3 29
10 MiniMax-M2.7
10.4%
0.332 123.5 38

1. Introduction

We believe this paradigm—real shells, real filesystems, real build and test tooling, and execution-verified rewards—provides the right substrate for evaluating LLM coding agents. We build on the Harbor task format because it offers a standardized, reproducible way to place agents in executable engineering environments where correctness can be verified by running code rather than judging surface-level outputs. On top of this substrate, we construct three datasets along orthogonal capability dimensions, together covering the full spectrum of real-world coding work:

Direction 1 — Depth

DeepTerminalBench

Can an agent solve a complex engineering task in one shot? We independently develop a large pool of Terminal-Bench 2.0–compatible tasks, fully conforming to the Harbor task specification. From this pool we curate 50 tasks — each a self-contained engineering task inside a Docker container, calibrated so that the top model succeeds on fewer than half of attempts.

Direction 2 — Iteration

EvoCodeBench

Can an agent sustain quality across many rounds of evolving requirements? We introduce the first systematic multi-turn coding evaluation dataset: 26 tasks comprising 227 rounds of evolving requirements, covering 4 engineering activities crossed with 3 interaction styles. Tasks feature cumulative state, requirement evolution, specification conflicts, and backward-compatibility constraints — mirroring how coding agents are actually used.

Direction 3 — Evolution

RoadmapBench

Can an agent implement substantial new functionality across a real library version upgrade? We present the first dataset targeting long-horizon feature implementation from release roadmaps: 115 tasks across 17 repositories in 5 programming languages, each requiring an agent to transform a repository pinned at an earlier release toward the target version behaviors using only a roadmap-style instruction.

The three datasets use different task formats, but the admission criteria are shared:

  • Quality: Every task is grounded in real-world engineering scenarios, oracle-verified (reward = 1.0), and reviewed by domain engineering experts for technical accuracy and test fairness.
  • Diversity: DeepTerminalBench tasks span 20+ domains and 12 workload types; multi-turn tasks span 4 engineering activities and 3 interaction styles; version upgrade tasks span 17 repositories in 5 programming languages — ensuring no critical skill is left untested.
  • Scalability: Task designs draw from continuously growing real-world sources (GitHub, Stack Overflow, CVE databases, RFCs, production postmortems, CS textbooks, and open-source release cycles). New domains, languages, and difficulty tiers can be added as the engineering landscape evolves.

2. Three Datasets at a Glance

This table fixes the comparison axes used throughout the article: evaluation axis, task unit, starting state, scoring rule, and difficulty source.

Dimension DeepTerminalBench EvoCodeBench RoadmapBench
Evaluation axis Depth Iteration Evolution
Central question Can the agent solve a complex engineering task in one shot? Can the agent sustain quality across many rounds of evolving requirements? Can the agent implement substantial new functionality across a real version upgrade?
Task count 50 curated tasks 26 multi-turn tasks 115 version-upgrade tasks
Unit of work 1 instruction → 1 verified delivery 5–15 rounds, cumulative tests 3–12 phases, weighted by importance
Starting state Pre-configured Docker env (median 10 files, up to 50+) Empty / skeleton / legacy system (varies by task) Real repository snapshot pinned at the earlier release
Scoring Binary (pass/fail) Mean of per-round binary rewards, fail-stop Weighted sum over phases
Coverage axes Domain × Workload (20+ × 12) Interaction Style × Engineering Activity (3 × 4) Language × Repository (5 × 17)
Harbor extension needed None — uses the stock single-shot harness Persistent environment and round boundary protocol Pinned baseline image, phase-level test orchestration
Primary source of difficulty Multi-file diagnosis, deep domain knowledge, simultaneous constraints Cumulative state, requirement evolution, backward compatibility Long-horizon feature synthesis under API ambiguity
How to choose the dataset

Use DeepTerminalBench to measure single-session engineering competence on a dense self-contained task. Use EvoCodeBench to measure how an agent handles changing requirements, specification conflicts, and regression risk across a working session. Use RoadmapBench to measure whether an agent can synthesize substantial new code against a roadmap in an evolving repository.


3. Shared Foundation: The Harbor Task Spec

All three datasets use the Harbor task specification introduced by Terminal-Bench 2.0. This makes the benchmark runnable by Harbor-compatible scaffolds such as Claude Code, OpenHands, Codex CLI, and Terminus-2, while keeping the new contributions concentrated in task semantics: multi-turn state for EvoCodeBench and version-upgrade structure for RoadmapBench.

A stock single-shot Harbor task contains five components:

ComponentRole
instruction.mdWhat the agent must do — natural-language specification with requirements, constraints, expected outputs.
environment/Where it works — Dockerfile with all dependencies, plus baseline code/data/fixtures.
tests/How correctness is judged — test suite plus a shell harness producing a binary reward.
solution/The ground truth — reference oracle implementation achieving reward = 1.0.
task.tomlMachine-readable metadata — domain, difficulty, resource limits, timeouts.
Black-Box Evaluation

In all three datasets, the agent receives only instruction.md and the environment. It cannot access tests/ or solution/. The test script produces a reward signal (binary in DeepTerminalBench, per-round in EvoCodeBench, weighted in RoadmapBench) that is never revealed during the run.

Each dataset extends this stock format in a different way. The next section walks through the three variants.


4. Task Format: Three Variants

4.1 DeepTerminalBench: Flat Structure

DeepTerminalBench uses the stock Harbor task layout unchanged:

task/
├── task.toml
├── instruction.md
├── environment/
│   └── Dockerfile               # runtime + pre-configured files
├── solution/
│   └── solve.sh                 # reference solution
└── tests/
    └── test.sh                  # verification script

The execution model is simple: Harbor builds the Docker environment, delivers instruction.md to the agent, the agent produces code, and tests/test.sh checks correctness. One instruction, one execution, one verification.

4.2 Multi-Turn (EvoCodeBench): Rounds with Cumulative Tests

For multi-turn tasks, we introduce round directories. Each round carries its own instruction, solution, and tests, while the environment and top-level metadata remain shared:

task/
├── task.toml                   # metadata (round count, change types)
├── instruction.md             # top-level task description
├── environment/
│   └── Dockerfile               # shared across all rounds
├── round_1/
│   ├── instruction.md
│   ├── solution/solve.sh       # incremental: this round's delta only
│   └── tests/test.sh           # cumulative: verifies all still-valid behavior up to here
├── round_2/ ...
└── round_N/

Three things make this work. First, solutions are incremental — each round's solve.sh applies only that round's delta. Second, tests are cumulative — round N's tests re-verify everything still valid from rounds 1 through N, so regressions caused by later changes are caught immediately. Third, tests verify behavioral contracts rather than implementation details: instructions describe what the system should do, tests check the system's external behavior, and round N's tests cannot assume the agent chose the same code structure as the reference solution in round N-1. Together these principles allow different agents to reach equivalent behavioral goals through divergent implementation paths without breaking the evaluation.

4.3 Version Upgrade (RoadmapBench): Repository Snapshot + Roadmap Phases

RoadmapBench keeps the single-shot execution model but reshapes what each component contains. The environment is a real-world repository snapshot pinned at an earlier release (with commits and tags after that point stripped to prevent information leakage), and the instruction is a multi-phase roadmap describing behaviors introduced in the target version:

Phase N: <Feature Name>
├── Background — why this functionality is needed
├── Behavioral specification
│   ├── API signatures, parameter semantics, default values
│   ├── Expected error behaviors and edge cases
│   └── Export paths and import locations
└── Integration context — how it fits the existing codebase

Like EvoCodeBench's principles, the roadmap specifies WHAT to build but never HOW — no algorithm names, implementation steps, or code snippets. Tests are organized as phase-level files (test_01_.py, test_02_.py, …) adapted from the official upstream test suite, and a runner computes a weighted reward across phases.

🔒 Information Boundaries

The agent may freely inspect and modify the repository, but has no access to the target version source code, the test files, or the oracle patch. The repository identity and version numbers are withheld from the instruction to prevent shortcut-based lookup.

4.4 Side-by-Side Structural Comparison

DeepTerminalBench EvoCodeBench RoadmapBench
Instruction Single instruction.md One instruction.md per round directory (plus top-level summary) Multi-phase roadmap in one instruction.md
Environment Pre-configured Dockerfile Shared Dockerfile across all rounds Pinned earlier-release repository snapshot
Tests One test.sh producing binary reward Cumulative test.sh per round Phase-level test_NN_*.py + weighted runner
Solution One solve.sh Per-round incremental solve.sh Oracle patch derived from real cross-version diff
Session model One container, one agent session, one submission One container and one agent session spanning all rounds One container, one agent session, one submission

5. Sources & Coverage

5.1 A Shared Pool of Real-World Problems

All three datasets draw from the same kinds of real-world engineering artifacts — the scenarios that practicing engineers actually encounter. Grounding tasks in these sources keeps the evaluation tied to observable engineering work and gives the datasets a renewable source of future tasks.

GitHub Repositories and Issue Trackers

Real project structures, build configurations, and bug reports. B-tree corruption tasks are modeled on LevelDB/BoltDB/SQLite; CI recovery tasks draw from real .github/workflows. RoadmapBench goes one step further and uses actual cross-version diffs from 17 open-source repositories as ground truth.

Stack Overflow and Developer Forums

Recurring pain points — segfaults in multithreaded C, cron timezone bugs, CommonJS-to-ESM migrations — signal what engineers actually struggle with.

CVE Databases and Security Advisories

Published vulnerabilities (path traversal, credential exposure, ReDoS, timing-unsafe comparisons) provide ready-made attack surfaces. Security hardening tasks are built from real CWE categories with concrete exploit payloads.

RFCs, Format Specs and Protocol Standards

PCAP binary format, WAV/RIFF headers, tar archive structure, HMAC-SHA256 chains, OpenAPI specifications — real-world standards that practitioners must implement correctly.

Production Incident Postmortems and SRE Runbooks

Cascading failures, WAL corruption after unclean shutdowns, Kubernetes pod crash loops, distributed lock starvation — scenarios on-call engineers actually face.

CS Textbooks, Papers, and Release Cycles

Raft consensus, Chandy-Lamport snapshots, dominator tree algorithms, Fibonacci heaps — algorithmic depth that cannot be derived from shallow pattern matching. For RoadmapBench specifically, release narratives and upstream test suites provide the ground truth for version-upgrade tasks.

These sources are continuously growing. New CVEs are published daily, new open-source projects introduce novel architectures, new RFCs define new protocols, and every release cycle yields fresh version pairs — all three datasets can expand without exhausting authentic task sources.

5.2 Three Complementary Taxonomies

Each dataset uses the two axes that expose its main source of score variation. DeepTerminalBench varies engineering domain and workload type; EvoCodeBench varies interaction style and engineering activity; RoadmapBench varies programming language and source repository.

DeepTerminalBench: Domain × Workload Type

Tasks are organized along domain (what area of engineering or science) and workload type (what kind of work the agent must perform). The cross-product covers over 20 domains crossed with 12 workload types. The 50 curated tasks sample broadly rather than exhaustively covering every combination; the taxonomy reserves additional domains for future expansion, so adding one new domain immediately creates 12 new domain-workload combinations.

Domains span: Code and Build, Repo Engineering, Testing and Quality, Debugging and Observability, Data Engineering, Database and Storage, Systems and Networking, Cloud and DevOps, Security, ML/AI and MLOps, Scientific and Numerical Computing, Automation and Productivity, Compiler and Language Tooling, File Format and Protocol, Distributed Systems and Concurrency, API and Schema Design, Data Structures and Algorithms, Simulation and Modeling, CLI and Interactive Tools, Workflow and Rule Engine — with Embedded/IoT, Container Orchestration, Cryptography, and Network Protocol Engineering reserved for expansion.

Workload types include: Greenfield Implementation, Brownfield Modification, Bug Localization and Fix, Migration and Upgrade, Integration and E2E Wiring, Performance Optimization, Reliability and Recovery, Security Patch and Hardening, Reproducibility and Verification, Forensics and Analysis, Automation Scripting, and Explanation and Reporting.

EvoCodeBench: Interaction Style × Engineering Activity

Multi-turn tasks vary along interaction style (how information flows across rounds) and engineering activity (what type of work each round involves).

Explorative

Vibe Coding Style

Requirements emerge through interaction. Round 1 is detailed; subsequent rounds shrink drastically, often to a single sentence.

Contractual

Agentic Style

The user knows what they want but can't specify it all at once. Each round provides detailed specs of roughly equal length; later rounds may revise earlier behavior.

Document-Driven

Doc-Driven Style

Key semantics live in project artifacts (specs, schemas, AGENTS.md). Instructions simply say "implement per the doc" or "doc updated, sync implementation."

Construction

Incremental Construction

The agent must build new functionality over several rounds, where each round adds features that later rounds depend on.

Spec Evolution

Spec Evolution & Conflict

The agent must adapt to new specs while preserving unchanged old behavior. At least one round explicitly overturns a core assumption established earlier.

Review-Driven

Review-Driven Improvement

Round 1 delivers full functionality. Subsequent rounds are code review feedback: performance, error handling, security, logging — all while keeping existing behavior unchanged.

Migration

Migration & Modernization

Starting from a complete legacy system, migrate incrementally to a new paradigm while preserving external behavior at every step.

The 3 interaction styles crossed with 4 engineering activities yield 12 task-type combinations. Allocation across EvoCodeBench:

Activity \ Style Explorative Contractual Document-Driven
Construction 8 4 1
Spec Evolution & Conflict 1 1 1
Review-Driven Improvement 3 1 1
Migration & Modernization 3 1 1

RoadmapBench: Programming Language × Repository Domain

RoadmapBench's axes reflect a reality: software engineers work across languages, and library evolution patterns recur in every ecosystem. The current release covers 17 repositories across 5 languages.

🐍

Python

Polars, PyG, Optuna, Falcon, spaCy — ML frameworks, dataframe engines, web servers, NLP pipelines.

41 tasks • 5 repos
📘

TypeScript

MikroORM, Prisma, Valibot — ORM frameworks, database clients, validation libraries.

22 tasks • 3 repos
⚙️

C++

Glaze, thread-pool — high-performance serialization and concurrent task scheduling.

20 tasks • 2 repos
🔵

Go

Fiber, Kitex, Fyne — RPC frameworks, web frameworks, GUI toolkits.

17 tasks • 3 repos
🦀

Rust

Ratatui, Diesel, Slint, Ruff — TUI frameworks, ORM, GUI toolkits, linting tools.

15 tasks • 4 repos

Why Three Different Taxonomies

Each taxonomy answers a different diagnostic question. DeepTerminalBench's Domain×Workload table shows which kinds of engineering work a model handles in a single dense session. EvoCodeBench's Interaction×Activity table shows which forms of requirement change break multi-turn reliability. RoadmapBench's Language×Repository table shows where version-upgrade ability depends on ecosystem-specific APIs, build systems, and conventions.


6. Dataset Statistics

6.1 Side-by-Side Task Statistics

Metric DeepTerminalBench EvoCodeBench RoadmapBench
Task count 50 curated tasks 26 multi-turn tasks 115 tasks across 17 repos
Unit of work per task 42 agent steps (median) 5–15 rounds 5 phases median, up to 12
Total evaluation units 227 rounds ~590 phases
Instruction length (median) ~5,000 chars / ~650 words Varies by round; round 1 typically largest Multi-phase; each phase ~300–800 words
Reference solution size (median) ~680 lines across solution/ Cumulative across rounds ~3,700 LOC per oracle patch
Test code (median) ~570 lines, 32 test functions Per-round cumulative test.sh Phase-level test files from upstream
Pre-configured env files (median) 10 files, up to 50+ Empty to complete legacy system Full earlier-release repository
Difficulty distribution Opus four-attempt pass rate (Pass@4): 22 at 0.25 • 17 at 0.50 • 11 at 0.75 12 task types; round count distribution 5–15 Median 5 phases; oracle patch ~3,700 LOC median

6.2 Difficulty Calibration: Three Methodologies

The three datasets calibrate difficulty in different ways because difficulty means different things across the three settings.

DeepTerminalBench: Four-Attempt Pass Rate (Pass@4) Band [0.25, 0.75]

50 tasks were selected from a larger pool so that Claude Opus 4.6 achieves Pass@4 between 0.25 and 0.75. This filters out both unsolved tasks and saturated tasks. 22 tasks land at Pass@4 = 0.25 (lowest success band), 17 at 0.50, 11 at 0.75.

EvoCodeBench: Fail-Stop + Cumulative Tests

Difficulty is structural rather than statistical. Later rounds raise difficulty when they stress cross-round dependencies: specification conflicts, backward compatibility, and multiple valid implementation paths. The fail-stop design makes partial progress measurable: an agent that passes rounds 1–3 but fails round 4 receives 3/N rather than 0.

RoadmapBench: Multi-Phase Weighted Scoring

Difficulty emerges naturally from the number and interdependence of phases. Tasks range from 3 to 12 phases (median 5), with core architectural phases weighted higher than peripheral utilities. The weighted scoring produces a continuous difficulty gradient: stronger models complete more phases and earn higher partial credit, while weaker models fail at earlier architectural boundaries. Oracle patches are ~3,700 LOC median — roughly two orders of magnitude larger than bug-fix benchmarks.

6.3 Agent Behavior Under Evaluation

DeepTerminalBench: Opus 4.6 Across 200 Rollouts (50 tasks × 4 attempts)

MetricMeanMedianMinMax
Agent steps (tool calls)42.54214160
Duration per attempt17.4 min14.8 min8.7 min39.1 min
Input tokens2.1M1.7M445K7.6M
Output tokens35K29K8.7K76K

DeepTerminalBench takeaways:

  • 44% of tasks trigger at least one timeout across 4 attempts, indicating that many tasks push agents to their execution budget limits.
  • ~83% of tasks include pre-existing environment files that the agent must read and work with before writing anything new.
  • Lower Pass@4 corresponds to longer attempts — Pass@4 = 0.25 tasks average 31 minutes per attempt, while Pass@4 = 0.75 tasks average 15 minutes.

EvoCodeBench: Multi-Turn Results Across 13 Models

MetricBest ModelBest ScoreKey Observation
MT@4Claude-Opus-4.754.0Claude-Opus-4.7 and GPT-5.5-High are the only models above 50 MT@4.
Single-round pass rate (SR)Claude-Opus-4.678.9High isolated-round pass rate does not imply high persistent multi-round score.
Full-task completionClaude-Opus-4.742.3The top completion rate remains below half of tasks.
Round degradationAll models46.7 → 21.3Average pass rate falls from round 1 to round 5 under fail-stop scoring.

EvoCodeBench takeaways:

  • Persistent execution scores lower than isolated-round evaluation. Claude-Opus-4.6 reaches 78.9 SR but 44.0 MT@4; GLM-5.1 drops from 63.9 SR to 36.2 MT@4; Kimi-K2.6 drops from 59.0 SR to 31.9 MT@4.
  • End-to-end completion remains limited. Claude-Opus-4.7 has the highest full-task completion at 42.3, followed by GPT-5.5-High at 38.5 and Claude-Opus-4.6 at 34.6.
  • The common failure symptom is concrete, not generic. In 957 high-confidence annotations from executed failed round fragments, 89.1% are missed active-round obligations, 6.5% are environment/tooling failures, and 3.4% are regressions. Correction rounds raise the regression share to 11.2%; conflict-labeled rounds raise conflict mishandling to 14.9%.
  • Persistent state is the central stressor. The same round instruction receives higher scores when earlier rounds have already been applied by the reference solution than when the model must live with its own earlier edits.

RoadmapBench: Phase-Weighted Scoring Produces Graded Separation

The leaderboard results above demonstrate that multi-phase weighted scoring naturally separates model capabilities: the top model (Claude-Opus-4.6) achieves 32.2% resolved and 0.627 completion score,. Crucially, even models with low resolve rates achieve meaningful completion scores — confirming that phase-level partial credit exposes capability gradients invisible to binary pass/fail evaluation.

Task scale exceeds short-prompt coding

Across all three datasets, tasks require sustained multi-step engineering. The largest DeepTerminalBench rollout reaches 160 agent steps and nearly 40 minutes of wall-clock time. EvoCodeBench tasks span 5–15 rounds of evolving state, with 227 cumulative verification points. RoadmapBench oracle patches average ~3,700 lines across 5 phases. The evaluation requires agents to maintain context, correlate information across files, and make engineering tradeoffs across an extended terminal session.

6.4 Comparison with Related Benchmarks

Benchmark Tasks Task Type Unit of Work Oracle Size Scoring
HumanEval / MBPP 164 / 974 Algorithm 1 function <20 LOC Binary
SWE-bench Verified 500 Bug fix 1 patch ~33 LOC Binary
Terminal-Bench 2.0 89 Mixed single-shot 1 session Binary
DeepTerminalBench 50 Deep single-shot ~42 steps ~680 LOC Binary (Pass@k calibrated)
EvoCodeBench 26 Multi-turn 5–15 rounds Cumulative Mean of rounds, fail-stop
RoadmapBench 115 Feature impl. 5 phases ~3,700 LOC Weighted by phase

Oracle patches in RoadmapBench are roughly two orders of magnitude larger than those in bug-fix benchmarks, confirming that version-upgrade tasks require sustained multi-step engineering rather than single-function edits.


7. Verification & Scoring

7.1 Shared Principle: Execution-Based Verification

All three datasets emphasize dynamic, execution-based testing over static output checking. The majority of verification compiles, executes, and observes program behavior at runtime. Static output checks supplement execution-based verification but do not replace it. This layering prevents "teaching to the test" — solutions that produce correct output but use incorrect or insecure implementations will fail behavioral or attack-based verification.

For DeepTerminalBench, the median task employs three verification strategies simultaneously. The prevalence of each strategy across the 50 tasks:

Dynamic (execution-based)
Program execution / exit code~80%
Performance / timeout enforcement~72%
Build / compilation verification~46%
Database state queries~24%
Adversarial security payloads~22%
Static (inspection-based)
Output file existence / structure~84%
Output value correctness~78%
Cryptographic integrity checks~30%
AST / pattern matching~12%
File diff / binary comparison~10%

EvoCodeBench and RoadmapBench tests follow the same execution-first philosophy but apply it at the round/phase level rather than task level.

7.2 Three Scoring Schemes

Because the three datasets have different task units (single session vs. multiple rounds vs. multiple phases), each uses a different scoring formula. All three share one principle: the scoring unit is a binary pass at the lowest level, and the task score aggregates these.

DeepTerminalBench: Binary Reward

Each task produces a single binary reward: pass (1) or fail (0). The verification script exercises the delivered solution end-to-end with both dynamic and static checks; any failure collapses the reward to 0. The main leaderboard reports Pass@1: one attempt per task, averaged across all 50 tasks. Separately, Pass@k across multiple attempts yields per-task success rates used for difficulty calibration.

Formula: Pass@1 = (1 / T) ∑ reward_t, where reward_t ∈ {0, 1} and T = 50.

EvoCodeBench: Mean of Per-Round Binary Rewards, Fail-Stop

Each round produces a binary reward (pass / fail). A failed round terminates the task immediately (fail-stop), and any rounds not executed count as 0. Because of fail-stop, the reward sequence is always a run of 1s followed by 0s — never a gap.

Formula: $\mathrm{Score} = \frac{1}{N} \sum_{i=1}^{N} r_i$ where $r_i \in \{0, 1\}$

Concretely, in a 5-round task: passing all 5 rounds scores 1.0; passing rounds 1–3 but failing round 4 scores 3/5 = 0.6; failing round 1 scores 0.0. The main table reports MT@4 by crediting a round if any of four independent multi-turn attempts reaches it successfully: MT@4 = mean_t (1/N_t) sum_i max_{a<=4} r_{t,a,i}. It reports SR as mean_{t,i} s_{t,i}, where s is the binary reward for a single target round after fast-forwarding earlier rounds with reference solutions. Comp is mean_t 1[max_{a<=4} r_{t,a,N_t}=1]. Avg. Turns and Output Tok. (K) report the interaction count and generated-token cost required to obtain those trajectories. Per-round rewards are recorded as intermediate artifacts and can serve as dense training signals for fine-tuning or RL.

RoadmapBench: Weighted Sum Over Phases

Each task is divided into n phases. Let $p_i \in \{0, 1\}$ indicate whether phase i passes and $w_i$ denote its weight. The task reward is:

Formula: $\mathrm{reward} = \sum_i w_i \cdot p_i / \sum_i w_i$

Weights reflect the architectural centrality of each phase — core modules that later phases depend on receive higher weights than peripheral utilities. A model that completes 3 of 5 phases scores above 0 rather than being indistinguishable from one that implements 0 phases, providing a denser signal than binary scoring.


8. Quality Assurance

8.1 Shared Gates: Structure + Oracle + Expert + Calibration

All three datasets share a four-gate quality assurance pipeline before any task is admitted:

  1. Static Structure Validation — The task package is well-formed: Dockerfile builds, test harness conforms to conventions, metadata is valid, instruction provides sufficient context. Automated checks guarantee structural integrity.
  2. Oracle Verification — The reference solution is deployed into the task's Docker container and the full test suite runs end-to-end. Only tasks where the oracle achieves reward = 1.0 (RoadmapBench: also baseline < 1.0) are retained; the oracle run is the solvability evidence for admission.
  3. Agent-Assisted Expert Review — Domain engineering experts review task instructions, test suites, and oracle solutions for technical accuracy, specification completeness, and test fairness. Experts verify that instructions contain sufficient information for the task to be solvable without hidden assumptions, that test assertions are reasonable and not overly brittle, and that the oracle represents sound engineering practice. Automated checks ensure structural integrity; expert review checks whether the task is a valid engineering evaluation item.
  4. Multi-Model Pass@k Calibration — Multiple LLM agents attempt each task (k=4 trials). Tasks passed by every model on every attempt or failed by every model on every attempt are flagged and either adjusted or excluded.

Beyond these shared gates, each dataset has one additional mechanism tailored to its setting.

8.2 Dataset-Specific QA Mechanisms

DeepTerminalBench: Pass@4 ∈ [0.25, 0.75]

The 50 released tasks are specifically those where Opus 4.6 Pass@4 falls in the 0.25–0.75 range. This keeps the released set away from both saturation and universal failure, and provides score variation among evaluated models. Tasks outside this band are flagged for review, difficulty adjustment, or exclusion.

EvoCodeBench: Path-Divergence via Behavioral Contracts

Multi-turn tasks introduce path divergence: different agents can reach the end of round N-1 through different implementation paths, but round N's tests must accept all valid paths. EvoCodeBench addresses this by requiring all tests to check the system's behavioral contract (external observable behavior) rather than implementation details. Round N's tests cannot assume the agent used the same code structure as the reference solution in round N-1 — only that the behavioral contract is preserved. Expert review explicitly audits every test for implementation coupling before admission.

RoadmapBench: Attribution-Driven Failure Classification

After automated verification, tasks enter multi-model rollout. Failures are classified into two categories: task-side defects (T-type) — missing instruction details, over-constrained assertions, environment mismatches, or test bugs — versus model-side failures (M-type) — incorrect architecture, buggy implementation, or failed debugging. T-type issues are iteratively fixed and re-verified through the oracle gate. A task is admitted to the dataset only after all T-type issues are resolved, ensuring that remaining failures reflect genuine model limitations rather than benchmark noise.

The T-type/M-type methodology is most formally developed in RoadmapBench, but the same spirit applies to DeepTerminalBench and EvoCodeBench quality control.


9. Infrastructure: Extending Harbor

Harbor was originally designed for single-shot tasks: one instruction, one agent execution, one verification. DeepTerminalBench uses that engine unchanged. EvoCodeBench adds persistent sessions, round boundaries, and resume support; RoadmapBench adds pinned repository images and phase-level scoring. The implementation differences explain what each benchmark setting costs to run reproducibly.

9.1 Baseline: Harbor's Single-Shot Engine

The baseline execution model is simple: Harbor builds the Docker environment, delivers instruction.md to the agent, runs the agent inside the container, then runs tests/test.sh to produce a reward. DeepTerminalBench uses this directly — every DeepTerminalBench task is a stock Harbor task, runnable by any Harbor-compatible scaffolding without modification.

9.2 EvoCodeBench Extensions: Persistent Multi-Round Execution

Supporting multi-turn evaluation required rethinking several layers of Harbor. The overarching goal: make the evaluation environment behave like a real multi-turn coding session. When a user works with a coding agent across multiple turns, the agent remembers the conversation, the file system accumulates changes, installed packages persist, and running services stay alive. None of this resets between turns.

Persistent Environment

The core design decision is one Docker container and one agent session for the entire multi-turn task. The container is created once at the start. The agent is initialized once. Neither is torn down or rebuilt between rounds. All state accumulates naturally: files written in round 1 are visible in round 5, packages installed in round 2 remain available in round 4, the agent's own conversation history carries over throughout. Restarting the environment between rounds would measure a different setting: isolated repeated tasks rather than one persistent coding session.

Round Boundary Protocol

Between every two consecutive rounds, the framework executes a round boundary protocol:

  1. The agent completes its work for the current round.
  2. The framework runs the current round's cumulative verification script.
  3. A binary reward (pass/fail) is recorded.
  4. If the round passes, an optional state snapshot is saved, and the next round begins.
  5. If the round fails, the entire task terminates immediately (fail-stop).

Fail-stop is deliberate: since later rounds depend on correct earlier implementations, allowing an agent to continue after a failure would produce meaningless results. From the agent's perspective, nothing visible happens at round boundaries — it simply receives the next instruction and continues working. Verification and checkpointing happen transparently.

State Management and Partial Evaluation

Building a multi-turn dataset requires running thousands of trials during development. Running every task from round 1 every time would be prohibitively expensive. We need the ability to start from the middle, which means placing the environment into the exact state it would be in after round N-1 without actually executing rounds 1 through N-1.

Two mechanisms address this:

  • Reference solution fast-forward. The framework executes early rounds using the task's reference solution to bring the environment to a known-good state, then hands off to the target agent at the desired round. Fast-forwarded rounds are not verified and not scored; they serve purely to prepare the workspace.
  • Snapshot-based resume. At each round boundary, the framework saves a full snapshot of the container state. A later run can restore from any saved snapshot, skip all prior rounds entirely, and begin from that checkpoint. Historical round results are preserved and carried forward into the new run's scoring.

These enable (1) difficulty calibration — iterate on a specific round's design without re-running the chain; (2) partial evaluation — test an agent on rounds 3–5 only; (3) cross-agent comparison — run the reference solution through all rounds, then resume with different agents from the same checkpoint.

9.3 RoadmapBench Extensions: Pinned Images and Phase Orchestration

RoadmapBench extends Harbor more modestly, along two axes:

  • Pinned baseline Docker image. The Docker image built at Gate 1 verification (where the baseline fails and oracle passes) is saved and reused for all subsequent evaluations, ensuring reproducibility across runs even as upstream dependencies drift.
  • Phase-level test orchestration + weighted reward. The test runner executes phase-level test files (test_01_.py, test_02_.py, ...) independently, records per-phase pass/fail, and computes the weighted sum as the task reward. Partial-progress scoring requires the runner to continue past phase failures rather than short-circuit, unlike EvoCodeBench's fail-stop model.

Beyond these, RoadmapBench reuses Harbor's stock execution engine without modification — a single instruction, a single agent session, a single verification pass.

9.4 Shared Limitations

  • Snapshot restoration is not perfect. EvoCodeBench snapshots capture the container's file system but not all runtime state. If an agent's work depends on running background processes, in-memory caches, or live network connections from earlier rounds, these will not survive a save-and-restore cycle.
  • Tests must be deterministic and non-interactive. All verification scripts must run without human input and produce consistent results. Tests depending on external network state, wall-clock time, or interactive prompts are not supported.
  • Snapshot storage cost. For long EvoCodeBench tasks, full container snapshots across many rounds can accumulate significant storage. By default, snapshots are saved only for rounds that pass verification, reducing storage for failed trials.
  • No mid-round recovery. If an agent crashes mid-round in EvoCodeBench, that round must be re-executed from the previous snapshot.

10. Task Examples

Below are two representative tasks per dataset, chosen to cover different task structures within each benchmark. For space, we summarize each example rather than reproducing the full instruction, focusing on the design pattern and the difficulty mechanism.

10.1 DeepTerminalBench

B-Tree Index Manager 11-Bug Fix (Python, 31 tests, Opus Pass@4: 0.25)

A disk-backed B-tree index manager for a lightweight embedded database lives at /app/btree/, spanning 7 Python source files (node, tree, pager, serializer, cursor, wal, concurrency). A recent refactor introduced 11 distinct bugs spanning all 5 layers of the storage engine stack: - BUG-001 — Leaf-split key loss: median key promoted to parent is also left in the right child, so the left child silently loses keys[t-1]. - BUG-004 — Trailing child pointer lost on serialize: serialize_node writes len(keys) child page-IDs instead of len(keys) + 1, corrupting the tree after checkpoint+recovery. - BUG-008 — Header signedness: num_keys packed as signed int16 wraps negative for >= 32768 keys. - BUG-009 — WAL recovery replays aborted transactions. - BUG-011 — RWLock starvation: busy-wait spin causes 100% CPU. ... and 6 more bugs touching range scans, merges, cursors, and free-page reuse. The public API must remain unchanged. A read-only diagnostics script at /app/run_diagnostics.py exercises the module end-to-end and writes a structured JSON report. Language is Python 3.

Environment (9 files)

/app/btree/__init__.py
/app/btree/node.py
/app/btree/tree.py
/app/btree/pager.py
/app/btree/serializer.py
/app/btree/cursor.py
/app/btree/wal.py
/app/btree/concurrency.py
/app/run_diagnostics.py (read-only diagnostic script)

The agent inherits a complete B-tree storage engine codebase: 7 Python source files implementing the core data structure, page serialization, write-ahead logging, cursor iteration, and thread-safe concurrent access, plus a diagnostics script that exercises the entire module and produces a structured report.

Oracle Solution

The oracle fixes all 11 bugs across 7 files. Key techniques include: correcting the B-tree split algorithm to properly promote median keys without duplication; fixing serialization to write len(keys)+1 child pointers; changing num_keys from signed to unsigned int16; making WAL recovery filter out aborted transactions; populating the free-page list so freed pages are reused; replacing the busy-wait spin with threading.Condition; and correcting range_scan boundaries to inclusive comparisons.

Tests (31 functions)

Functional correctness: test_range_scan_101_results, test_cursor_count_10000, test_search_after_merge.

Persistence and recovery: test_serialization_children_preserved checks that serialize-then-deserialize roundtrips preserve all child pointers. test_wal_recovery_skips_aborted verifies aborted-transaction keys do not reappear after replay.

Concurrency and resource management: test_no_cpu_spike, test_rwlock_uses_condition, test_free_page_direct.

Edge cases: empty tree operations, single-key tree, duplicate inserts, delete of non-existent key, range with low==high, interleaved commit/abort WAL recovery.

Why This Task Is Difficult

The 11 bugs span 7 files across 5 layers (tree ops, binary serialization, page management, WAL recovery, concurrency). The key difficulty is cross-layer bug interaction. Fixing BUG-001 (leaf-split key loss) requires understanding the B-tree split invariant; but if BUG-004 (serialization writes wrong child-pointer count) is not also fixed, the corrected split silently corrupts on the next checkpoint/recovery cycle. Similarly, fixing BUG-009 (WAL replay of aborted transactions) requires understanding the WAL transaction lifecycle; but if BUG-007 (free-page reuse) is not also fixed, the data file grows without bound.

In failed trajectories, agents typically fix 8–9 of 11 bugs but miss either BUG-004 (easy to overlook because it only manifests after checkpoint+recovery) or BUG-007 (symptom is file growth, not data corruption, so it is deprioritized). The interaction between bugs means partial fixes often break other tests in non-obvious ways.

Kubernetes Pod Scheduler Optimizer (Rust, ~20 tests, Opus Pass@4: 0.75)

Given a Rust-based Kubernetes pod scheduler simulator at /app/scheduler_slow.rs that reads cluster state from JSON and produces correct scheduling decisions — but takes ~18 seconds on the test dataset. Rewrite as /app/scheduler.rs so it finishes in under 2 seconds while producing bit-for-bit identical output. The scheduler must compute six things: (1) pod scheduling with taints, tolerations, node/pod affinity and anti-affinity, node selector matching, resource fit, and a composite score combining resource balance and affinity preference; (2) per-node utilization; (3) per-namespace summaries; (4) anti-affinity conflict detection; (5) bin-packing standard deviation across resources; (6) preemption: for unschedulable pods with PreemptLowerPriority policy, find the minimum-cost eviction set on some node. Deterministic sorting everywhere. Standard library only. Compile with rustc -O and print elapsed milliseconds to stderr.

Environment (3 files)

/app/scheduler_slow.rs (the slow reference implementation)
/app/data/cluster_state.json (cluster topology, pods, nodes)
/app/data/expected_output.json (exact expected scheduling decisions)

The agent receives a working but slow Rust implementation, a large cluster state JSON, and the exact expected output. The task is pure performance optimization — the output must be bit-for-bit identical, just produced 9x faster.

Oracle Solution

The oracle rewrites the scheduler from scratch in optimized Rust. Key techniques: pre-indexing nodes by label sets for O(1) node-selector matching; incremental resource-availability tracking instead of re-scanning all nodes; caching taint-toleration compatibility; priority-based preemption with early termination; efficient data structures (HashMap, BTreeMap) for topology lookups; eliminating unnecessary JSON serialization in intermediate steps.

Tests

Correctness: the primary test compares the output file byte-for-byte against expected_output.json. Every scheduling decision — which pod on which node, which pods are preempted, which remain unscheduled — must match exactly. No partial credit.

Performance: 2-second wall-clock limit (down from ~18s baseline) under rustc -O.

Build verification: compiles without errors or warnings under the default release profile.

Why This Task Is Difficult

The task combines three requirements that must all hold simultaneously: Kubernetes scheduling semantics, Rust systems programming, and algorithmic optimization.

The scheduling specification is dense. Node affinity has both required and preferred expressions with 6 operators. Taints and tolerations have 3 effects with both Equal and Exists operators. Preemption follows priority ordering with a specific policy. The agent must understand all of these correctly — not approximately — because the test requires exact output match.

Writing correct, optimized Rust adds a language-level constraint: the ownership model, borrow checker, and lifetime system prevent many common optimization patterns. The performance requirement (under 2s, down from 18s) requires an algorithmic improvement — an agent that produces correct code with the same O(N×M) complexity will pass correctness but fail the performance gate.

In failed trajectories, the most common pattern: Rust code compiles and runs under 2s but produces slightly different output due to a misunderstood scheduling rule (typically pod anti-affinity topology key handling or PreferNoSchedule taint semantics). The exact-match requirement leaves no room for approximation.

10.2 EvoCodeBench

Representative Real Task: Deterministic Data Pipeline CLI

This example is a released EvoCodeBench task, deterministic-data-pipeline-go. It asks the agent to build and then extend a Go CLI called dpipe for deterministic CSV processing, reproducibility metadata, validation, profiling, lineage, quality checks, snapshots, and changelogs. The task is representative because it combines all three multi-turn stressors in one session: cumulative feature growth, corrections to earlier behavior, and explicit conflicts with prior output formats.

Task: dpipe Deterministic Data Pipeline (15 Rounds)
R1
Initial build: implement dpipe as a deterministic Go CLI that reads CSV files, applies configured transformations, writes bit-identical outputs across repeated runs, and supports reproducibility verification.
R3
Correction + extension: fix linear-fill boundary extrapolation while adding additional pipeline capabilities. The round tests both the new behavior and the earlier deterministic CSV pipeline.
R5
Conflict: change the verify command checksum algorithm from SHA-256 to BLAKE2b-256. Existing verification behavior must be updated without breaking earlier ingestion, transformation, profiling, and validation behavior.
R7
Correction + extension: fix lineage manifest graph edges and add further reproducibility functionality. This round checks that lineage metadata remains consistent with the manifest semantics introduced earlier.
R10
Format conflict: replace the manifest's single blake2b JSON field with two new fields. The model has to migrate the output contract while preserving the rest of the manifest and verification behavior.
R12
Correction: change the default normalize method so omitted or empty method behaves as minmax. This tests whether the agent can revise an earlier transform without disturbing explicit zscore and minmax behavior.
R15
Late conflict: change snapshot row hashes from MD5 to SHA-256 while also adding a tag transform. The final tests exercise the full accumulated system: transforms, auditability, lineage, manifest formats, snapshots, and changelogs.

The key property is cumulative verification. A correct round 15 implementation cannot merely satisfy the latest hash-format change; it must also preserve deterministic output, earlier checksum semantics where still valid, profile/validate/quality/schema behavior, audit logging, lineage metadata, and snapshot/changelog contracts introduced across the previous 14 rounds. This is the EvoCodeBench setting in concrete form: later requests are small, but each one is evaluated against the whole evolving tool.

10.3 RoadmapBench

Example 1: PyTorch Geometric 2.4 → 2.5 (Python)

📈 Task Overview

The agent receives the PyG source tree at version 2.4.0 and must implement features introduced in 2.5.0. The task spans 5 phases: (1) the new Index class and EdgeIndex.unbind(); (2) EdgeIndex.sparse_resize_() with validation logic; (3) the ClusterPooling layer; (4) LinkPredMRR metric and GATConv residual connections. Each phase tests API-level behavioral contracts using tests adapted from the official PyG test suite.

Difficulty source: The agent must synthesize new classes that fit PyG's internal tensor semantics, register them in the correct __init__.py export paths, and preserve backward compatibility with existing 2.4 APIs — all without seeing the target version source code.

Example 2: Fiber v2.49 → v2.50 (Go)

📄 Task Overview

Go library upgrades often introduce new middleware, context propagation mechanisms, or routing features. This task asks the agent to implement new request context helpers, extend the middleware chain API, and add a new built-in middleware with configurable behavior. The instruction provides behavioral contracts (which HTTP status codes to return under which conditions) but not the Go struct definitions or interface signatures.

Difficulty source: Go's type system and interface semantics require precise API surface alignment. The agent must discover the correct embedding and delegation patterns by reading the existing middleware implementations rather than from the instruction alone. Whereas the PyG example tests Python/ML ecosystem fluency, this example tests Go idioms and the ability to generalize RoadmapBench's pattern across language ecosystems.


11. Evaluation Insights

Why We Use a High-Performing Scaffolding

The same underlying model can produce different scores under different agent frameworks. To reduce scaffolding as a confounder, we evaluate with Harbor Terminus-2 — a standardized agent harness with terminal access, file editing, and code execution inside a Docker container (see the Terminal-Bench 2.0 leaderboard for comparative scaffolding performance). The analyses below therefore focus on behavior under this fixed harness.

11.1 DeepTerminalBench: Per-Task Pass/Fail Map

The matrix below renders the same 9-model results as the leaderboard, but at task granularity: each row is one of the 50 tasks (ordered by how many models solved it, descending), each column is one of the 9 models, and a filled cell means that model passed that task in this single-trial Pass@1 run. The right-most column shows the per-task pass count. Hover any row label to see the full task name. Reading the matrix vertically gives each model's solved-task profile; reading it horizontally exposes the difficulty spectrum — from a few rows that are widely solved to seven bottom rows that no model solves.
Per-task pass/fail map — tasks ordered by how many of the 9 models solve them (descending)
Filled cells indicate the model passed the task in this single-trial Pass@1 run; empty cells indicate failure. The bottom 7 rows are tasks no model solved; the top rows show the easier shared core. Hover a row label to see the full task name.
TaskOpus-4.7GPT-5.5DSK-V4K2.6Q3.6GLM-5.1Q3.5-AGeminiMM-2.7
d12w3_fix-pipeline-executor-bugs7
d12_w3_d12w3_fix-log-analysis-alert-pipeline-bugs6
d17_w5_d17w5_doc-similarity-clustering5
d18w3_fix-particle-collision-sim5
d2w10_build-dependency-graph-forensics5
d4w10_microservice-crash-cascade-analyzer5
d5_w7_d5w7_streaming-aggregation-checkpoint-restore5
d5w6_json-log-aggregation-indexing-pipeline5
d9w8_security-patch-credential-exposure-bash-deployment5
d12_w6_d12w6_build-cache-parallel-execution-optimizer4
d13_w5_d13w5_static-analysis-pipeline-integration4
d4_w2_d4w2_enhance-application-metrics-collector4
d6_w6_d6w6_wal-compaction-checkpoint-optimizer4
d10w7_ml-feature-store-snapshot-recovery-drift3
d15_w6_d15w6_concurrent-hashmap-lock-striping-optimizer3
d19w4_migrate-bash-backup-to-python3
d3w6_d3w6-optimize-c-regression-test-harness3
d4_w2_d4w2_enhance-log-analysis-alert-engine3
d4w10_performance-triage-diagnose-runtime-anomalies-distributed3
d6_w3-schema-registry-rollback-chain3
d6w3_fix-b-tree-index-manager-bugs3
d12_w10_d12w10_backup-chain-forensics2
d14w2_extend-tar-archive-format-parser2
d16_w9_d16w9_api-contract-snapshot-regression-checker2
d16w8_harden-graphql-project-api2
d1_w4_d1w4_migrate-commonjs-to-esm-node-app2
d2_w1_d2w1_full-text-document-search-engine2
d5_w7_d5w7_ingestion-queue-dead-letter-recovery2
d5w2_fix-etl-pipeline-bugs-add-lineage2
d7w11_proc-tree-budget-enforcer2
d7w7_resume-checkpoint-batch-pipeline2
d9_w6_d9w6_rbac-abac-policy-optimizer2
d10w7_ml-training-checkpoint-recovery-system1
d11_w8_d11w8_numerical-simulation-engine-security-remediation1
d13w5_build-microschema-validator-compiler1
d14_w9_d14w9_binary-log-replay-determinism-validator1
d15w1_chandy-lamport-distributed-snapshot-simulator1
d1w9_reproducible-build-pipeline-artifact-verification1
d20w2_java-approval-chain-parallel-escalation1
d3_w4_d3w4_unittest-to-pytest-migration-verifier1
d4_w3_d4w3_fix-metrics-dashboard-collection-bugs1
d8_w8_d8w8_k8s-manifest-generator-security-hardening1
d8w7_ci-cd-pipeline-recovery-orchestrator1
d11_w8_d11w8_signal-processing-pipeline-security-hardening0
d17w12_identify-sorting-algorithm-trace-forensics0
d18w11_fsm-batch-test-coverage-analysis0
d5w10_etl-pipeline-audit-trail-forensic-reconstruction0
d6_w2_d6w2_lsm-tree-compaction-bloom-filter-enhancement0
d6_w8_d6w8_schema-migration-ddl-deserialization-fix0
d8w10_kubernetes-cluster-incident-forensics0

11.2 DeepTerminalBench: 200-Rollout Deep Dive

We examined Claude Opus 4.6 across 200 rollouts (50 tasks × 4 attempts). Since tasks were selected using Opus Pass@4 calibration, this is not a model leaderboard; it is a task-difficulty analysis using one high-scoring agent as the probe.

Across 200 rollouts, 89 succeeded and 111 failed. Comparing these two populations gives three measurable contrasts:

  • Failed attempts run longer. 17.7 minutes vs. 17.0 minutes for successes — failing agents spend time on unproductive exploration.
  • 23% of failures end in timeout. These are not wrong answers produced quickly; they are cases where the agent ran out of budget while still actively working — tasks genuinely require sustained multi-step reasoning.
  • Lower Pass@4 tasks take longer. Pass@4 = 0.25 tasks average 31 minutes per attempt; Pass@4 = 0.75 tasks average 15 minutes.
  • Multi-file bug diagnosis, cross-component integration, deep domain knowledge are common among the 22 tasks in the Pass@4 = 0.25 band — understanding WAL semantics, K8s scheduling priorities, or C memory conventions inferred from context rather than stated.
Insight 1: Process strategy determines outcome

Under identical scaffolding, the sequence of actions the model chooses matters enormously. Successful attempts test incrementally: write one module, run the test suite, observe which tests pass, fix issues, move to the next module. Failed attempts write everything first, then run tests for the first time at the end — encountering 20+ failures with no diagnostic signal about which module caused which error.

This is not a scaffolding deficiency — the scaffolding allows incremental testing in both cases. The difference is whether the model chooses to test early. Especially pronounced in the B-tree 11-bug task, where agents that test after each bug fix pass at much higher rates than those attempting all 11 fixes before running any tests. The ability to decompose a complex task into verifiable increments is itself a core model capability, distinct from raw code generation quality.

Insight 2: Remaining misses concentrate in hidden interactions

In bug-fix and security-hardening tasks requiring 8–14 fixes, agents consistently find and fix 80–90% of issues but miss 1–2 subtle ones. The pattern is remarkably stable across tasks and runs.

In the B-tree task, agents reliably fix the obvious bugs but miss BUG-004 (serialization child count) because it only manifests after a checkpoint-recovery cycle, not during normal in-memory operations. In the SIEM pipeline task, agents fix most integration bugs but miss the per-service severity threshold filtering because it requires reading a separate config file not referenced in any component's API. The missed issues usually require domain knowledge or cross-file exploration beyond the local symptom. The agent often considers the right area but implements an incomplete fix — for example, adding lock protection to shared state but not to the escalation path, or normalizing 3 of 4 timestamp formats but missing the CEF format.

Insight 3: Correctness under simultaneous constraints is harder than correctness alone

Many tasks impose simultaneous constraints that interact: thread safety + performance targets, security hardening + backward compatibility, format compliance + error recovery, algorithmic correctness + exact output matching. Agents consistently handle each constraint in isolation but fail when they must all hold at once.

In the K8s scheduler task, agents produce Rust code that either compiles and runs fast but produces wrong output (misunderstood scheduling semantics), or produces correct output but exceeds the 2-second time limit (correct algorithm but unoptimized). The high-failure region is the conjunction: correct and fast. These tasks measure whether an agent can satisfy interacting constraints at the same time, not whether it can satisfy each constraint separately.

11.3 EvoCodeBench: Single-Round Skill Does Not Become Multi-Turn Reliability

EvoCodeBench measures a pattern absent from single-shot benchmarks: agents can solve many individual rounds when each round starts from a clean reference state, but fail when they must carry forward their own implementation. On 26 multi-turn tasks and 227 evaluated rounds, Claude-Opus-4.7 leads MT@4 at 54.0, GPT-5.5-High follows at 52.4, and Claude-Opus-4.6 reaches 44.0. These are the only three models above 40 MT@4.

The SR-to-MT@4 gap quantifies the cost of carrying state across rounds. Claude-Opus-4.6 reaches 78.9 single-round pass rate but 44.0 MT@4. GLM-5.1 reaches 63.9 SR but 36.2 MT@4, Kimi-K2.6 reaches 59.0 SR but 31.9 MT@4, and DeepSeek-V4-Pro reaches 56.4 SR but 30.6 MT@4. The drop is evidence that isolated instruction following is insufficient; the model must preserve a correct evolving workspace while requirements accumulate, change, and sometimes conflict.

Round-level results show how quickly reliability decays. Averaged across model-round observations, pass rates fall from 46.7 at round 1 to 26.9 at round 3, 21.3 at round 5, and 17.2 at round 7. Under fail-stop scoring, these drops quantify the cost of long-horizon state management: one regression can terminate the rest of the task.

Failure annotations sharpen the interpretation. The largest category is not broad conversational forgetting: among high-confidence labels, most failures are concrete current-round specification misses under cumulative tests. Regression and stale-contract errors are minority modes overall, but they become more visible on correction and conflict rounds, where the agent must revise behavior without breaking still-active obligations.

11.4 RoadmapBench: Cross-Model Separation

RoadmapBench results reveal clear separation across model capabilities. Claude-Opus-4.6 leads with 32.2% resolved (OpenHands) and 31.3% (Terminus 2), Failure modes differ systematically by capability level:

  • Top-tier models (Claude-Opus-4.6, GPT-5.4) complete core phases reliably but miss subtle edge cases — error behavior when constraints conflict, API contract details not explicitly spelled out, or integration between newly introduced modules.
  • Mid-tier models (Gemini-3.1-Pro, DeepSeek-V4-Pro, GLM-5.1) pass the first 1–2 phases but fail to integrate later phases that depend on earlier architectural decisions.
  • Lower-tier models fail at the initial architectural setup — placing new code in the wrong modules, missing export registrations, or breaking existing APIs before implementing new ones.

The Completion Score captures this gradient: even models with low resolve rates achieve meaningful partial credit (e.g., DeepSeek-V4-Pro: 18.3% resolved but 0.486 completion), confirming that phase-weighted scoring exposes capability differences invisible to binary evaluation.

11.5 Capability vs. Consistency: A Cross-Dataset Finding

Consistency is the cross-dataset gap

Across all three datasets, every released task has at least one successful attempt. The same model can fail another attempt after choosing a different exploration order, fix strategy, or code structure. This points to reliable execution of multi-step engineering plans under uncertainty as the shared source of score variance.

DeepTerminalBench surfaces this as Pass@4 variance on the same task. EvoCodeBench surfaces it as fail-stop failure on round N after correctly handling rounds 1 through N-1. RoadmapBench surfaces it through T-type/M-type attribution, where task-side defects are removed before model-side failures are analyzed. The common measurement target is consistency across long engineering trajectories.


12. Looking Forward: Toward Adaptive Evaluation

The remaining evaluation constraint is path divergence: two correct agents may reach the same behavior through different code structures, and tests must accept both without becoming too loose. EvoCodeBench handles this with behavioral-contract tests, RoadmapBench handles it with attribution-driven task fixes, and DeepTerminalBench uses Pass@k calibration to filter brittle tasks. These mechanisms reduce false failures, but they still rely on human experts to anticipate valid implementation paths.

The research direction is an adaptive evaluation loop that generates later instructions and tests from the agent's actual intermediate state:

The Vision: Adaptive Evaluation Loop

A Question-Generation Agent observes the agent's current system state and dynamically generates the next round's instruction, maintaining logical consistency while adapting to whatever implementation path the agent chose.

A Test-Generation Agent reads the current round's instruction and the agent's actual code, then generates verification scripts truly matched to the current state.

Question Agent (sees current state) -> generates next instruction
    -> Coding Agent executes
        -> Test Agent (sees instruction + code) -> generates tests -> verifies
            -> Question Agent (sees new state) -> generates next instruction
                -> ...

This would make later evaluation steps conditional on the implementation path actually taken by the agent. The current static datasets (191 tasks total across DeepTerminalBench, EvoCodeBench, and RoadmapBench) provide the fixed-task baseline for studying whether such adaptive evaluation improves attribution.


13. Summary

  • Three directions of terminal-based evaluation: We expand Terminal-Bench along three orthogonal directions — depth, iteration, and evolution — all grounded in containerized terminal environments that exercise the same tools, constraints, and feedback loops as human engineers. Task designs draw from continuously growing real-world sources (GitHub, CVEs, RFCs, production postmortems, CS textbooks, and open-source release cycles), enabling the datasets to expand as the engineering landscape evolves.
  • DeepTerminalBench — depth: 50 curated tasks selected from a larger pool using Claude Opus 4.6 Pass@4 calibration, each a complex engineering task in a rich pre-configured environment (median 10 files, up to 50+). Verification is primarily dynamic and execution-based — programs are compiled, run, and their runtime behavior validated, including adversarial security payloads and performance gates. The main leaderboard reports Claude-Opus-4.7 at 34.0 Pass@1, with GPT-5.5-high, DeepSeek-V4-Pro, Kimi-K2.6 and Qwen-3.6-Plus all clustered at 32.0; the separate 200-rollout analysis uses four attempts per task to analyze task-difficulty variation and failure patterns.
  • EvoCodeBench — multi-turn iteration: 26 multi-turn tasks comprising 227 evaluated rounds capture the core stressors of iterative agent work: cumulative state management, requirement evolution, specification conflicts, and backward compatibility. Tasks are organized along two orthogonal axes — interaction style (explorative, contractual, document-driven) and engineering activity (construction, spec evolution, review-driven improvement, migration) — yielding 12 distinct task-type combinations. Evaluation across 13 models reports long-horizon score decay: Claude-Opus-4.7 leads at 54.0 MT@4, GPT-5.5-High follows at 52.4, Claude-Opus-4.6 remains at 44.0, and all other models are at or below 36.2. All tests verify behavioral contracts rather than implementation details, directly addressing path divergence in multi-turn evaluation.
  • RoadmapBench — version upgrade evolution: 115 tasks across 17 repositories in 5 programming languages (Python, TypeScript, C++, Go, Rust) target long-horizon feature implementation across real version releases. Each task gives the agent a repository pinned at an earlier release and a roadmap-style instruction; the agent must implement behaviors introduced in the target version without access to the new source code, test files, or oracle patch. Multi-phase weighted scoring captures partial progress (median 5 phases, max 12), and oracle patches are roughly two orders of magnitude larger than bug-fix benchmarks (~3,700 LOC median). Attribution-driven quality control separates task defects from model-side failures, so remaining failures are attributable to model behavior rather than dataset defects.
  • Shared infrastructure: All datasets are built on the Harbor evaluation framework. For multi-turn tasks, Harbor is extended with persistent environments, round boundary protocols, state snapshots, and partial evaluation support. For version upgrade tasks, Harbor manages pinned Docker environments, phase-level test orchestration, and weighted reward computation — enabling reproducible, scalable evaluation across all three dataset dimensions.