Terminal-X: Evaluate Coding Agents across Depth, Iteration, and Evolution in Terminal Environments
2026-05-02
Terminal-X tests whether coding agents can solve calibrated terminal tasks in one shot, preserve correctness across multi-turn requirement changes, and implement real version-upgrade roadmaps.
UniPat AI Coding Team
Main Results
Execution-Based Results Across the Three Datasets
The three result blocks correspond to three evaluation settings. DeepTerminalBench reports how often a model completes a calibrated single-shot terminal task. EvoCodeBench reports whether correctness carries across changing requirements: its table includes both persistent multi-turn success (MT@4) and isolated single-round success (SR) so the state-carrying gap is explicit. RoadmapBench reports how much of a real version-upgrade roadmap a model completes under phase-weighted tests.
tests/test.sh verification returns reward = 1.0. Avg Turns reports how many agent-model exchanges the run uses on average. Output Tok. (K) reports how many tokens the model generates during those exchanges, in thousands. Calibration. All 50 tasks are validated under Claude-Opus-4.6 with four attempts — every task is solved by at least one of the four Pass@4 attempts, ensuring it is reachable for a frontier-tier model rather than impossible by construction; the curated subset specifically retains tasks where strong models still fail under single-attempt Pass@1, preserving headroom for the leaderboard.
| # | Model | Pass@1 | Pass / Fail |
|---|---|---|---|
| 🥇1 | Claude-Opus-4.7 | 17/33 | |
| 🥈2 | GPT-5.5-high | 16/34 | |
| 🥉3 | DeepSeek-V4-Pro | 16/34 | |
| 4 | Kimi-K2.6 | 16/34 | |
| 5 | Qwen-3.6-Plus | 16/34 | |
| 6 | GLM-5.1 | 15/35 | |
| 7 | Qwen-3.5-397B-A17B | 12/38 | |
| 8 | Gemini-3.1-Pro | 8/42 | |
| 9 | MiniMax-M2.7 | 5/45 |
| # | Model | MT@4 | SR | Comp |
|---|---|---|---|---|
| 🥇1 | Claude-Opus-4.7 | 76.7 | 42.3 | |
| 🥈2 | GPT-5.5-High | 74.4 | 38.5 | |
| 🥉3 | Claude-Opus-4.6 | 78.9 | 34.6 | |
| 4 | GLM-5.1 | 63.9 | 15.4 | |
| 5 | Kimi-K2.6 | 59.0 | 23.1 | |
| 6 | DeepSeek-V4-Pro | 56.4 | 19.2 | |
| 7 | Qwen3.6-Plus | 57.3 | 15.4 | |
| 8 | Xiaomi-MiMo-V2.5-Pro | 7.9 | 11.5 | |
| 9 | Gemini-3.1-Pro-Preview | 46.7 | 11.5 | |
| 10 | DeepSeek-V4-Flash | 46.3 | 0.0 | |
| 11 | Qwen3.5-397B-A17B | 44.1 | 0.0 | |
| 12 | MiniMax-M2.7 | 30.0 | 0.0 |
OpenHands
| # | Model | Resolved % | Completion |
|---|---|---|---|
| 🥇1 | Claude-Opus-4.7 | 0.702 | |
| 🥈2 | Claude-Opus-4.6 | 0.627 | |
| 🥉3 | GPT-5.4 | 0.497 | |
| 4 | Gemini-3.1-Pro | 0.439 | |
| 5 | DeepSeek-V4-Pro | 0.486 | |
| 6 | GLM-5.1 | 0.453 | |
| 7 | Kimi-K2.6 | 0.456 | |
| 8 | Qwen3.6-Plus | 0.424 | |
| 9 | Kimi-K2.5 | 0.378 | |
| 10 | MiniMax-M2.7 | 0.332 |
1. Introduction
We believe this paradigm—real shells, real filesystems, real build and test tooling, and execution-verified rewards—provides the right substrate for evaluating LLM coding agents. We build on the Harbor task format because it offers a standardized, reproducible way to place agents in executable engineering environments where correctness can be verified by running code rather than judging surface-level outputs. On top of this substrate, we construct three datasets along orthogonal capability dimensions, together covering the full spectrum of real-world coding work:
DeepTerminalBench
Can an agent solve a complex engineering task in one shot? We independently develop a large pool of Terminal-Bench 2.0–compatible tasks, fully conforming to the Harbor task specification. From this pool we curate 50 tasks — each a self-contained engineering task inside a Docker container, calibrated so that the top model succeeds on fewer than half of attempts.
EvoCodeBench
Can an agent sustain quality across many rounds of evolving requirements? We introduce the first systematic multi-turn coding evaluation dataset: 26 tasks comprising 227 rounds of evolving requirements, covering 4 engineering activities crossed with 3 interaction styles. Tasks feature cumulative state, requirement evolution, specification conflicts, and backward-compatibility constraints — mirroring how coding agents are actually used.
RoadmapBench
Can an agent implement substantial new functionality across a real library version upgrade? We present the first dataset targeting long-horizon feature implementation from release roadmaps: 115 tasks across 17 repositories in 5 programming languages, each requiring an agent to transform a repository pinned at an earlier release toward the target version behaviors using only a roadmap-style instruction.
The three datasets use different task formats, but the admission criteria are shared:
- Quality: Every task is grounded in real-world engineering scenarios, oracle-verified (reward = 1.0), and reviewed by domain engineering experts for technical accuracy and test fairness.
- Diversity: DeepTerminalBench tasks span 20+ domains and 12 workload types; multi-turn tasks span 4 engineering activities and 3 interaction styles; version upgrade tasks span 17 repositories in 5 programming languages — ensuring no critical skill is left untested.
- Scalability: Task designs draw from continuously growing real-world sources (GitHub, Stack Overflow, CVE databases, RFCs, production postmortems, CS textbooks, and open-source release cycles). New domains, languages, and difficulty tiers can be added as the engineering landscape evolves.
2. Three Datasets at a Glance
This table fixes the comparison axes used throughout the article: evaluation axis, task unit, starting state, scoring rule, and difficulty source.
| Dimension | DeepTerminalBench | EvoCodeBench | RoadmapBench |
|---|---|---|---|
| Evaluation axis | Depth | Iteration | Evolution |
| Central question | Can the agent solve a complex engineering task in one shot? | Can the agent sustain quality across many rounds of evolving requirements? | Can the agent implement substantial new functionality across a real version upgrade? |
| Task count | 50 curated tasks | 26 multi-turn tasks | 115 version-upgrade tasks |
| Unit of work | 1 instruction → 1 verified delivery | 5–15 rounds, cumulative tests | 3–12 phases, weighted by importance |
| Starting state | Pre-configured Docker env (median 10 files, up to 50+) | Empty / skeleton / legacy system (varies by task) | Real repository snapshot pinned at the earlier release |
| Scoring | Binary (pass/fail) | Mean of per-round binary rewards, fail-stop | Weighted sum over phases |
| Coverage axes | Domain × Workload (20+ × 12) | Interaction Style × Engineering Activity (3 × 4) | Language × Repository (5 × 17) |
| Harbor extension needed | None — uses the stock single-shot harness | Persistent environment and round boundary protocol | Pinned baseline image, phase-level test orchestration |
| Primary source of difficulty | Multi-file diagnosis, deep domain knowledge, simultaneous constraints | Cumulative state, requirement evolution, backward compatibility | Long-horizon feature synthesis under API ambiguity |
Use DeepTerminalBench to measure single-session engineering competence on a dense self-contained task. Use EvoCodeBench to measure how an agent handles changing requirements, specification conflicts, and regression risk across a working session. Use RoadmapBench to measure whether an agent can synthesize substantial new code against a roadmap in an evolving repository.
3. Shared Foundation: The Harbor Task Spec
All three datasets use the Harbor task specification introduced by Terminal-Bench 2.0. This makes the benchmark runnable by Harbor-compatible scaffolds such as Claude Code, OpenHands, Codex CLI, and Terminus-2, while keeping the new contributions concentrated in task semantics: multi-turn state for EvoCodeBench and version-upgrade structure for RoadmapBench.
A stock single-shot Harbor task contains five components:
| Component | Role |
|---|---|
instruction.md | What the agent must do — natural-language specification with requirements, constraints, expected outputs. |
environment/ | Where it works — Dockerfile with all dependencies, plus baseline code/data/fixtures. |
tests/ | How correctness is judged — test suite plus a shell harness producing a binary reward. |
solution/ | The ground truth — reference oracle implementation achieving reward = 1.0. |
task.toml | Machine-readable metadata — domain, difficulty, resource limits, timeouts. |
In all three datasets, the agent receives only instruction.md and the environment. It cannot access tests/ or solution/. The test script produces a reward signal (binary in DeepTerminalBench, per-round in EvoCodeBench, weighted in RoadmapBench) that is never revealed during the run.
Each dataset extends this stock format in a different way. The next section walks through the three variants.
4. Task Format: Three Variants
4.1 DeepTerminalBench: Flat Structure
DeepTerminalBench uses the stock Harbor task layout unchanged:
├── task.toml
├── instruction.md
├── environment/
│ └── Dockerfile # runtime + pre-configured files
├── solution/
│ └── solve.sh # reference solution
└── tests/
└── test.sh # verification script
The execution model is simple: Harbor builds the Docker environment, delivers instruction.md to the agent, the agent produces code, and tests/test.sh checks correctness. One instruction, one execution, one verification.
4.2 Multi-Turn (EvoCodeBench): Rounds with Cumulative Tests
For multi-turn tasks, we introduce round directories. Each round carries its own instruction, solution, and tests, while the environment and top-level metadata remain shared:
├── task.toml # metadata (round count, change types)
├── instruction.md # top-level task description
├── environment/
│ └── Dockerfile # shared across all rounds
├── round_1/
│ ├── instruction.md
│ ├── solution/solve.sh # incremental: this round's delta only
│ └── tests/test.sh # cumulative: verifies all still-valid behavior up to here
├── round_2/ ...
└── round_N/
Three things make this work. First, solutions are incremental — each round's solve.sh applies only that round's delta. Second, tests are cumulative — round N's tests re-verify everything still valid from rounds 1 through N, so regressions caused by later changes are caught immediately. Third, tests verify behavioral contracts rather than implementation details: instructions describe what the system should do, tests check the system's external behavior, and round N's tests cannot assume the agent chose the same code structure as the reference solution in round N-1. Together these principles allow different agents to reach equivalent behavioral goals through divergent implementation paths without breaking the evaluation.
4.3 Version Upgrade (RoadmapBench): Repository Snapshot + Roadmap Phases
RoadmapBench keeps the single-shot execution model but reshapes what each component contains. The environment is a real-world repository snapshot pinned at an earlier release (with commits and tags after that point stripped to prevent information leakage), and the instruction is a multi-phase roadmap describing behaviors introduced in the target version:
├── Background — why this functionality is needed
├── Behavioral specification
│ ├── API signatures, parameter semantics, default values
│ ├── Expected error behaviors and edge cases
│ └── Export paths and import locations
└── Integration context — how it fits the existing codebase
Like EvoCodeBench's principles, the roadmap specifies WHAT to build but never HOW — no algorithm names, implementation steps, or code snippets. Tests are organized as phase-level files (test_01_.py, test_02_.py, …) adapted from the official upstream test suite, and a runner computes a weighted reward across phases.
The agent may freely inspect and modify the repository, but has no access to the target version source code, the test files, or the oracle patch. The repository identity and version numbers are withheld from the instruction to prevent shortcut-based lookup.
4.4 Side-by-Side Structural Comparison
| DeepTerminalBench | EvoCodeBench | RoadmapBench | |
|---|---|---|---|
| Instruction | Single instruction.md |
One instruction.md per round directory (plus top-level summary) |
Multi-phase roadmap in one instruction.md |
| Environment | Pre-configured Dockerfile | Shared Dockerfile across all rounds | Pinned earlier-release repository snapshot |
| Tests | One test.sh producing binary reward |
Cumulative test.sh per round |
Phase-level test_NN_*.py + weighted runner |
| Solution | One solve.sh |
Per-round incremental solve.sh |
Oracle patch derived from real cross-version diff |
| Session model | One container, one agent session, one submission | One container and one agent session spanning all rounds | One container, one agent session, one submission |
5. Sources & Coverage
5.1 A Shared Pool of Real-World Problems
All three datasets draw from the same kinds of real-world engineering artifacts — the scenarios that practicing engineers actually encounter. Grounding tasks in these sources keeps the evaluation tied to observable engineering work and gives the datasets a renewable source of future tasks.
GitHub Repositories and Issue Trackers
Real project structures, build configurations, and bug reports. B-tree corruption tasks are modeled on LevelDB/BoltDB/SQLite; CI recovery tasks draw from real .github/workflows. RoadmapBench goes one step further and uses actual cross-version diffs from 17 open-source repositories as ground truth.
Stack Overflow and Developer Forums
Recurring pain points — segfaults in multithreaded C, cron timezone bugs, CommonJS-to-ESM migrations — signal what engineers actually struggle with.
CVE Databases and Security Advisories
Published vulnerabilities (path traversal, credential exposure, ReDoS, timing-unsafe comparisons) provide ready-made attack surfaces. Security hardening tasks are built from real CWE categories with concrete exploit payloads.
RFCs, Format Specs and Protocol Standards
PCAP binary format, WAV/RIFF headers, tar archive structure, HMAC-SHA256 chains, OpenAPI specifications — real-world standards that practitioners must implement correctly.
Production Incident Postmortems and SRE Runbooks
Cascading failures, WAL corruption after unclean shutdowns, Kubernetes pod crash loops, distributed lock starvation — scenarios on-call engineers actually face.
CS Textbooks, Papers, and Release Cycles
Raft consensus, Chandy-Lamport snapshots, dominator tree algorithms, Fibonacci heaps — algorithmic depth that cannot be derived from shallow pattern matching. For RoadmapBench specifically, release narratives and upstream test suites provide the ground truth for version-upgrade tasks.
These sources are continuously growing. New CVEs are published daily, new open-source projects introduce novel architectures, new RFCs define new protocols, and every release cycle yields fresh version pairs — all three datasets can expand without exhausting authentic task sources.
5.2 Three Complementary Taxonomies
Each dataset uses the two axes that expose its main source of score variation. DeepTerminalBench varies engineering domain and workload type; EvoCodeBench varies interaction style and engineering activity; RoadmapBench varies programming language and source repository.
DeepTerminalBench: Domain × Workload Type
Tasks are organized along domain (what area of engineering or science) and workload type (what kind of work the agent must perform). The cross-product covers over 20 domains crossed with 12 workload types. The 50 curated tasks sample broadly rather than exhaustively covering every combination; the taxonomy reserves additional domains for future expansion, so adding one new domain immediately creates 12 new domain-workload combinations.
Domains span: Code and Build, Repo Engineering, Testing and Quality, Debugging and Observability, Data Engineering, Database and Storage, Systems and Networking, Cloud and DevOps, Security, ML/AI and MLOps, Scientific and Numerical Computing, Automation and Productivity, Compiler and Language Tooling, File Format and Protocol, Distributed Systems and Concurrency, API and Schema Design, Data Structures and Algorithms, Simulation and Modeling, CLI and Interactive Tools, Workflow and Rule Engine — with Embedded/IoT, Container Orchestration, Cryptography, and Network Protocol Engineering reserved for expansion.
Workload types include: Greenfield Implementation, Brownfield Modification, Bug Localization and Fix, Migration and Upgrade, Integration and E2E Wiring, Performance Optimization, Reliability and Recovery, Security Patch and Hardening, Reproducibility and Verification, Forensics and Analysis, Automation Scripting, and Explanation and Reporting.
EvoCodeBench: Interaction Style × Engineering Activity
Multi-turn tasks vary along interaction style (how information flows across rounds) and engineering activity (what type of work each round involves).
Vibe Coding Style
Requirements emerge through interaction. Round 1 is detailed; subsequent rounds shrink drastically, often to a single sentence.
Agentic Style
The user knows what they want but can't specify it all at once. Each round provides detailed specs of roughly equal length; later rounds may revise earlier behavior.
Doc-Driven Style
Key semantics live in project artifacts (specs, schemas, AGENTS.md). Instructions simply say "implement per the doc" or "doc updated, sync implementation."
Incremental Construction
The agent must build new functionality over several rounds, where each round adds features that later rounds depend on.
Spec Evolution & Conflict
The agent must adapt to new specs while preserving unchanged old behavior. At least one round explicitly overturns a core assumption established earlier.
Review-Driven Improvement
Round 1 delivers full functionality. Subsequent rounds are code review feedback: performance, error handling, security, logging — all while keeping existing behavior unchanged.
Migration & Modernization
Starting from a complete legacy system, migrate incrementally to a new paradigm while preserving external behavior at every step.
The 3 interaction styles crossed with 4 engineering activities yield 12 task-type combinations. Allocation across EvoCodeBench:
| Activity \ Style | Explorative | Contractual | Document-Driven |
|---|---|---|---|
| Construction | 8 | 4 | 1 |
| Spec Evolution & Conflict | 1 | 1 | 1 |
| Review-Driven Improvement | 3 | 1 | 1 |
| Migration & Modernization | 3 | 1 | 1 |
RoadmapBench: Programming Language × Repository Domain
RoadmapBench's axes reflect a reality: software engineers work across languages, and library evolution patterns recur in every ecosystem. The current release covers 17 repositories across 5 languages.
Python
Polars, PyG, Optuna, Falcon, spaCy — ML frameworks, dataframe engines, web servers, NLP pipelines.
41 tasks • 5 reposTypeScript
MikroORM, Prisma, Valibot — ORM frameworks, database clients, validation libraries.
22 tasks • 3 reposC++
Glaze, thread-pool — high-performance serialization and concurrent task scheduling.
20 tasks • 2 reposGo
Fiber, Kitex, Fyne — RPC frameworks, web frameworks, GUI toolkits.
17 tasks • 3 reposRust
Ratatui, Diesel, Slint, Ruff — TUI frameworks, ORM, GUI toolkits, linting tools.
15 tasks • 4 reposWhy Three Different Taxonomies
Each taxonomy answers a different diagnostic question. DeepTerminalBench's Domain×Workload table shows which kinds of engineering work a model handles in a single dense session. EvoCodeBench's Interaction×Activity table shows which forms of requirement change break multi-turn reliability. RoadmapBench's Language×Repository table shows where version-upgrade ability depends on ecosystem-specific APIs, build systems, and conventions.
6. Dataset Statistics
6.1 Side-by-Side Task Statistics
| Metric | DeepTerminalBench | EvoCodeBench | RoadmapBench |
|---|---|---|---|
| Task count | 50 curated tasks | 26 multi-turn tasks | 115 tasks across 17 repos |
| Unit of work per task | 42 agent steps (median) | 5–15 rounds | 5 phases median, up to 12 |
| Total evaluation units | — | 227 rounds | ~590 phases |
| Instruction length (median) | ~5,000 chars / ~650 words | Varies by round; round 1 typically largest | Multi-phase; each phase ~300–800 words |
| Reference solution size (median) | ~680 lines across solution/ | Cumulative across rounds | ~3,700 LOC per oracle patch |
| Test code (median) | ~570 lines, 32 test functions | Per-round cumulative test.sh | Phase-level test files from upstream |
| Pre-configured env files (median) | 10 files, up to 50+ | Empty to complete legacy system | Full earlier-release repository |
| Difficulty distribution | Opus four-attempt pass rate (Pass@4): 22 at 0.25 • 17 at 0.50 • 11 at 0.75 | 12 task types; round count distribution 5–15 | Median 5 phases; oracle patch ~3,700 LOC median |
6.2 Difficulty Calibration: Three Methodologies
The three datasets calibrate difficulty in different ways because difficulty means different things across the three settings.
DeepTerminalBench: Four-Attempt Pass Rate (Pass@4) Band [0.25, 0.75]
50 tasks were selected from a larger pool so that Claude Opus 4.6 achieves Pass@4 between 0.25 and 0.75. This filters out both unsolved tasks and saturated tasks. 22 tasks land at Pass@4 = 0.25 (lowest success band), 17 at 0.50, 11 at 0.75.
EvoCodeBench: Fail-Stop + Cumulative Tests
Difficulty is structural rather than statistical. Later rounds raise difficulty when they stress cross-round dependencies: specification conflicts, backward compatibility, and multiple valid implementation paths. The fail-stop design makes partial progress measurable: an agent that passes rounds 1–3 but fails round 4 receives 3/N rather than 0.
RoadmapBench: Multi-Phase Weighted Scoring
Difficulty emerges naturally from the number and interdependence of phases. Tasks range from 3 to 12 phases (median 5), with core architectural phases weighted higher than peripheral utilities. The weighted scoring produces a continuous difficulty gradient: stronger models complete more phases and earn higher partial credit, while weaker models fail at earlier architectural boundaries. Oracle patches are ~3,700 LOC median — roughly two orders of magnitude larger than bug-fix benchmarks.
6.3 Agent Behavior Under Evaluation
DeepTerminalBench: Opus 4.6 Across 200 Rollouts (50 tasks × 4 attempts)
| Metric | Mean | Median | Min | Max |
|---|---|---|---|---|
| Agent steps (tool calls) | 42.5 | 42 | 14 | 160 |
| Duration per attempt | 17.4 min | 14.8 min | 8.7 min | 39.1 min |
| Input tokens | 2.1M | 1.7M | 445K | 7.6M |
| Output tokens | 35K | 29K | 8.7K | 76K |
DeepTerminalBench takeaways:
- 44% of tasks trigger at least one timeout across 4 attempts, indicating that many tasks push agents to their execution budget limits.
- ~83% of tasks include pre-existing environment files that the agent must read and work with before writing anything new.
- Lower Pass@4 corresponds to longer attempts — Pass@4 = 0.25 tasks average 31 minutes per attempt, while Pass@4 = 0.75 tasks average 15 minutes.
EvoCodeBench: Multi-Turn Results Across 13 Models
| Metric | Best Model | Best Score | Key Observation |
|---|---|---|---|
| MT@4 | Claude-Opus-4.7 | 54.0 | Claude-Opus-4.7 and GPT-5.5-High are the only models above 50 MT@4. |
| Single-round pass rate (SR) | Claude-Opus-4.6 | 78.9 | High isolated-round pass rate does not imply high persistent multi-round score. |
| Full-task completion | Claude-Opus-4.7 | 42.3 | The top completion rate remains below half of tasks. |
| Round degradation | All models | 46.7 → 21.3 | Average pass rate falls from round 1 to round 5 under fail-stop scoring. |
EvoCodeBench takeaways:
- Persistent execution scores lower than isolated-round evaluation. Claude-Opus-4.6 reaches 78.9 SR but 44.0 MT@4; GLM-5.1 drops from 63.9 SR to 36.2 MT@4; Kimi-K2.6 drops from 59.0 SR to 31.9 MT@4.
- End-to-end completion remains limited. Claude-Opus-4.7 has the highest full-task completion at 42.3, followed by GPT-5.5-High at 38.5 and Claude-Opus-4.6 at 34.6.
- The common failure symptom is concrete, not generic. In 957 high-confidence annotations from executed failed round fragments, 89.1% are missed active-round obligations, 6.5% are environment/tooling failures, and 3.4% are regressions. Correction rounds raise the regression share to 11.2%; conflict-labeled rounds raise conflict mishandling to 14.9%.
- Persistent state is the central stressor. The same round instruction receives higher scores when earlier rounds have already been applied by the reference solution than when the model must live with its own earlier edits.
RoadmapBench: Phase-Weighted Scoring Produces Graded Separation
The leaderboard results above demonstrate that multi-phase weighted scoring naturally separates model capabilities: the top model (Claude-Opus-4.6) achieves 32.2% resolved and 0.627 completion score,. Crucially, even models with low resolve rates achieve meaningful completion scores — confirming that phase-level partial credit exposes capability gradients invisible to binary pass/fail evaluation.
Across all three datasets, tasks require sustained multi-step engineering. The largest DeepTerminalBench rollout reaches 160 agent steps and nearly 40 minutes of wall-clock time. EvoCodeBench tasks span 5–15 rounds of evolving state, with 227 cumulative verification points. RoadmapBench oracle patches average ~3,700 lines across 5 phases. The evaluation requires agents to maintain context, correlate information across files, and make engineering tradeoffs across an extended terminal session.
6.4 Comparison with Related Benchmarks
| Benchmark | Tasks | Task Type | Unit of Work | Oracle Size | Scoring |
|---|---|---|---|---|---|
| HumanEval / MBPP | 164 / 974 | Algorithm | 1 function | <20 LOC | Binary |
| SWE-bench Verified | 500 | Bug fix | 1 patch | ~33 LOC | Binary |
| Terminal-Bench 2.0 | 89 | Mixed single-shot | 1 session | — | Binary |
| DeepTerminalBench | 50 | Deep single-shot | ~42 steps | ~680 LOC | Binary (Pass@k calibrated) |
| EvoCodeBench | 26 | Multi-turn | 5–15 rounds | Cumulative | Mean of rounds, fail-stop |
| RoadmapBench | 115 | Feature impl. | 5 phases | ~3,700 LOC | Weighted by phase |
Oracle patches in RoadmapBench are roughly two orders of magnitude larger than those in bug-fix benchmarks, confirming that version-upgrade tasks require sustained multi-step engineering rather than single-function edits.
7. Verification & Scoring
7.1 Shared Principle: Execution-Based Verification
All three datasets emphasize dynamic, execution-based testing over static output checking. The majority of verification compiles, executes, and observes program behavior at runtime. Static output checks supplement execution-based verification but do not replace it. This layering prevents "teaching to the test" — solutions that produce correct output but use incorrect or insecure implementations will fail behavioral or attack-based verification.
For DeepTerminalBench, the median task employs three verification strategies simultaneously. The prevalence of each strategy across the 50 tasks:
| Program execution / exit code | ~80% |
| Performance / timeout enforcement | ~72% |
| Build / compilation verification | ~46% |
| Database state queries | ~24% |
| Adversarial security payloads | ~22% |
| Output file existence / structure | ~84% |
| Output value correctness | ~78% |
| Cryptographic integrity checks | ~30% |
| AST / pattern matching | ~12% |
| File diff / binary comparison | ~10% |
EvoCodeBench and RoadmapBench tests follow the same execution-first philosophy but apply it at the round/phase level rather than task level.
7.2 Three Scoring Schemes
Because the three datasets have different task units (single session vs. multiple rounds vs. multiple phases), each uses a different scoring formula. All three share one principle: the scoring unit is a binary pass at the lowest level, and the task score aggregates these.
DeepTerminalBench: Binary Reward
Each task produces a single binary reward: pass (1) or fail (0). The verification script exercises the delivered solution end-to-end with both dynamic and static checks; any failure collapses the reward to 0. The main leaderboard reports Pass@1: one attempt per task, averaged across all 50 tasks. Separately, Pass@k across multiple attempts yields per-task success rates used for difficulty calibration.
Formula: Pass@1 = (1 / T) ∑ reward_t, where reward_t ∈ {0, 1} and T = 50.
EvoCodeBench: Mean of Per-Round Binary Rewards, Fail-Stop
Each round produces a binary reward (pass / fail). A failed round terminates the task immediately (fail-stop), and any rounds not executed count as 0. Because of fail-stop, the reward sequence is always a run of 1s followed by 0s — never a gap.
Formula: $\mathrm{Score} = \frac{1}{N} \sum_{i=1}^{N} r_i$ where $r_i \in \{0, 1\}$
Concretely, in a 5-round task: passing all 5 rounds scores 1.0; passing rounds 1–3 but failing round 4 scores 3/5 = 0.6; failing round 1 scores 0.0. The main table reports MT@4 by crediting a round if any of four independent multi-turn attempts reaches it successfully: MT@4 = mean_t (1/N_t) sum_i max_{a<=4} r_{t,a,i}. It reports SR as mean_{t,i} s_{t,i}, where s is the binary reward for a single target round after fast-forwarding earlier rounds with reference solutions. Comp is mean_t 1[max_{a<=4} r_{t,a,N_t}=1]. Avg. Turns and Output Tok. (K) report the interaction count and generated-token cost required to obtain those trajectories. Per-round rewards are recorded as intermediate artifacts and can serve as dense training signals for fine-tuning or RL.
RoadmapBench: Weighted Sum Over Phases
Each task is divided into n phases. Let $p_i \in \{0, 1\}$ indicate whether phase i passes and $w_i$ denote its weight. The task reward is:
Formula: $\mathrm{reward} = \sum_i w_i \cdot p_i / \sum_i w_i$
Weights reflect the architectural centrality of each phase — core modules that later phases depend on receive higher weights than peripheral utilities. A model that completes 3 of 5 phases scores above 0 rather than being indistinguishable from one that implements 0 phases, providing a denser signal than binary scoring.
8. Quality Assurance
8.1 Shared Gates: Structure + Oracle + Expert + Calibration
All three datasets share a four-gate quality assurance pipeline before any task is admitted:
- Static Structure Validation — The task package is well-formed: Dockerfile builds, test harness conforms to conventions, metadata is valid, instruction provides sufficient context. Automated checks guarantee structural integrity.
- Oracle Verification — The reference solution is deployed into the task's Docker container and the full test suite runs end-to-end. Only tasks where the oracle achieves reward = 1.0 (RoadmapBench: also baseline < 1.0) are retained; the oracle run is the solvability evidence for admission.
- Agent-Assisted Expert Review — Domain engineering experts review task instructions, test suites, and oracle solutions for technical accuracy, specification completeness, and test fairness. Experts verify that instructions contain sufficient information for the task to be solvable without hidden assumptions, that test assertions are reasonable and not overly brittle, and that the oracle represents sound engineering practice. Automated checks ensure structural integrity; expert review checks whether the task is a valid engineering evaluation item.
- Multi-Model Pass@k Calibration — Multiple LLM agents attempt each task (k=4 trials). Tasks passed by every model on every attempt or failed by every model on every attempt are flagged and either adjusted or excluded.
Beyond these shared gates, each dataset has one additional mechanism tailored to its setting.
8.2 Dataset-Specific QA Mechanisms
DeepTerminalBench: Pass@4 ∈ [0.25, 0.75]
The 50 released tasks are specifically those where Opus 4.6 Pass@4 falls in the 0.25–0.75 range. This keeps the released set away from both saturation and universal failure, and provides score variation among evaluated models. Tasks outside this band are flagged for review, difficulty adjustment, or exclusion.
EvoCodeBench: Path-Divergence via Behavioral Contracts
Multi-turn tasks introduce path divergence: different agents can reach the end of round N-1 through different implementation paths, but round N's tests must accept all valid paths. EvoCodeBench addresses this by requiring all tests to check the system's behavioral contract (external observable behavior) rather than implementation details. Round N's tests cannot assume the agent used the same code structure as the reference solution in round N-1 — only that the behavioral contract is preserved. Expert review explicitly audits every test for implementation coupling before admission.
RoadmapBench: Attribution-Driven Failure Classification
After automated verification, tasks enter multi-model rollout. Failures are classified into two categories: task-side defects (T-type) — missing instruction details, over-constrained assertions, environment mismatches, or test bugs — versus model-side failures (M-type) — incorrect architecture, buggy implementation, or failed debugging. T-type issues are iteratively fixed and re-verified through the oracle gate. A task is admitted to the dataset only after all T-type issues are resolved, ensuring that remaining failures reflect genuine model limitations rather than benchmark noise.
The T-type/M-type methodology is most formally developed in RoadmapBench, but the same spirit applies to DeepTerminalBench and EvoCodeBench quality control.
9. Infrastructure: Extending Harbor
Harbor was originally designed for single-shot tasks: one instruction, one agent execution, one verification. DeepTerminalBench uses that engine unchanged. EvoCodeBench adds persistent sessions, round boundaries, and resume support; RoadmapBench adds pinned repository images and phase-level scoring. The implementation differences explain what each benchmark setting costs to run reproducibly.
9.1 Baseline: Harbor's Single-Shot Engine
The baseline execution model is simple: Harbor builds the Docker environment, delivers instruction.md to the agent, runs the agent inside the container, then runs tests/test.sh to produce a reward. DeepTerminalBench uses this directly — every DeepTerminalBench task is a stock Harbor task, runnable by any Harbor-compatible scaffolding without modification.
9.2 EvoCodeBench Extensions: Persistent Multi-Round Execution
Supporting multi-turn evaluation required rethinking several layers of Harbor. The overarching goal: make the evaluation environment behave like a real multi-turn coding session. When a user works with a coding agent across multiple turns, the agent remembers the conversation, the file system accumulates changes, installed packages persist, and running services stay alive. None of this resets between turns.
Persistent Environment
The core design decision is one Docker container and one agent session for the entire multi-turn task. The container is created once at the start. The agent is initialized once. Neither is torn down or rebuilt between rounds. All state accumulates naturally: files written in round 1 are visible in round 5, packages installed in round 2 remain available in round 4, the agent's own conversation history carries over throughout. Restarting the environment between rounds would measure a different setting: isolated repeated tasks rather than one persistent coding session.
Round Boundary Protocol
Between every two consecutive rounds, the framework executes a round boundary protocol:
- The agent completes its work for the current round.
- The framework runs the current round's cumulative verification script.
- A binary reward (pass/fail) is recorded.
- If the round passes, an optional state snapshot is saved, and the next round begins.
- If the round fails, the entire task terminates immediately (fail-stop).
Fail-stop is deliberate: since later rounds depend on correct earlier implementations, allowing an agent to continue after a failure would produce meaningless results. From the agent's perspective, nothing visible happens at round boundaries — it simply receives the next instruction and continues working. Verification and checkpointing happen transparently.
State Management and Partial Evaluation
Building a multi-turn dataset requires running thousands of trials during development. Running every task from round 1 every time would be prohibitively expensive. We need the ability to start from the middle, which means placing the environment into the exact state it would be in after round N-1 without actually executing rounds 1 through N-1.
Two mechanisms address this:
- Reference solution fast-forward. The framework executes early rounds using the task's reference solution to bring the environment to a known-good state, then hands off to the target agent at the desired round. Fast-forwarded rounds are not verified and not scored; they serve purely to prepare the workspace.
- Snapshot-based resume. At each round boundary, the framework saves a full snapshot of the container state. A later run can restore from any saved snapshot, skip all prior rounds entirely, and begin from that checkpoint. Historical round results are preserved and carried forward into the new run's scoring.
These enable (1) difficulty calibration — iterate on a specific round's design without re-running the chain; (2) partial evaluation — test an agent on rounds 3–5 only; (3) cross-agent comparison — run the reference solution through all rounds, then resume with different agents from the same checkpoint.
9.3 RoadmapBench Extensions: Pinned Images and Phase Orchestration
RoadmapBench extends Harbor more modestly, along two axes:
- Pinned baseline Docker image. The Docker image built at Gate 1 verification (where the baseline fails and oracle passes) is saved and reused for all subsequent evaluations, ensuring reproducibility across runs even as upstream dependencies drift.
- Phase-level test orchestration + weighted reward. The test runner executes phase-level test files (
test_01_.py,test_02_.py, ...) independently, records per-phase pass/fail, and computes the weighted sum as the task reward. Partial-progress scoring requires the runner to continue past phase failures rather than short-circuit, unlike EvoCodeBench's fail-stop model.
Beyond these, RoadmapBench reuses Harbor's stock execution engine without modification — a single instruction, a single agent session, a single verification pass.
9.4 Shared Limitations
- Snapshot restoration is not perfect. EvoCodeBench snapshots capture the container's file system but not all runtime state. If an agent's work depends on running background processes, in-memory caches, or live network connections from earlier rounds, these will not survive a save-and-restore cycle.
- Tests must be deterministic and non-interactive. All verification scripts must run without human input and produce consistent results. Tests depending on external network state, wall-clock time, or interactive prompts are not supported.
- Snapshot storage cost. For long EvoCodeBench tasks, full container snapshots across many rounds can accumulate significant storage. By default, snapshots are saved only for rounds that pass verification, reducing storage for failed trials.
- No mid-round recovery. If an agent crashes mid-round in EvoCodeBench, that round must be re-executed from the previous snapshot.
10. Task Examples
Below are two representative tasks per dataset, chosen to cover different task structures within each benchmark. For space, we summarize each example rather than reproducing the full instruction, focusing on the design pattern and the difficulty mechanism.
10.1 DeepTerminalBench
B-Tree Index Manager 11-Bug Fix (Python, 31 tests, Opus Pass@4: 0.25)
A disk-backed B-tree index manager for a lightweight embedded database lives at/app/btree/, spanning 7 Python source files (node, tree, pager, serializer, cursor, wal, concurrency). A recent refactor introduced 11 distinct bugs spanning all 5 layers of the storage engine stack: - BUG-001 — Leaf-split key loss: median key promoted to parent is also left in the right child, so the left child silently loseskeys[t-1]. - BUG-004 — Trailing child pointer lost on serialize:serialize_nodewriteslen(keys)child page-IDs instead oflen(keys) + 1, corrupting the tree after checkpoint+recovery. - BUG-008 — Header signedness:num_keyspacked as signed int16 wraps negative for >= 32768 keys. - BUG-009 — WAL recovery replays aborted transactions. - BUG-011 — RWLock starvation: busy-wait spin causes 100% CPU. ... and 6 more bugs touching range scans, merges, cursors, and free-page reuse. The public API must remain unchanged. A read-only diagnostics script at/app/run_diagnostics.pyexercises the module end-to-end and writes a structured JSON report. Language is Python 3.
Environment (9 files)
/app/btree/node.py
/app/btree/tree.py
/app/btree/pager.py
/app/btree/serializer.py
/app/btree/cursor.py
/app/btree/wal.py
/app/btree/concurrency.py
/app/run_diagnostics.py (read-only diagnostic script)
The agent inherits a complete B-tree storage engine codebase: 7 Python source files implementing the core data structure, page serialization, write-ahead logging, cursor iteration, and thread-safe concurrent access, plus a diagnostics script that exercises the entire module and produces a structured report.
Oracle Solution
The oracle fixes all 11 bugs across 7 files. Key techniques include: correcting the B-tree split algorithm to properly promote median keys without duplication; fixing serialization to write len(keys)+1 child pointers; changing num_keys from signed to unsigned int16; making WAL recovery filter out aborted transactions; populating the free-page list so freed pages are reused; replacing the busy-wait spin with threading.Condition; and correcting range_scan boundaries to inclusive comparisons.
Tests (31 functions)
Functional correctness: test_range_scan_101_results, test_cursor_count_10000, test_search_after_merge.
Persistence and recovery: test_serialization_children_preserved checks that serialize-then-deserialize roundtrips preserve all child pointers. test_wal_recovery_skips_aborted verifies aborted-transaction keys do not reappear after replay.
Concurrency and resource management: test_no_cpu_spike, test_rwlock_uses_condition, test_free_page_direct.
Edge cases: empty tree operations, single-key tree, duplicate inserts, delete of non-existent key, range with low==high, interleaved commit/abort WAL recovery.
Why This Task Is Difficult
The 11 bugs span 7 files across 5 layers (tree ops, binary serialization, page management, WAL recovery, concurrency). The key difficulty is cross-layer bug interaction. Fixing BUG-001 (leaf-split key loss) requires understanding the B-tree split invariant; but if BUG-004 (serialization writes wrong child-pointer count) is not also fixed, the corrected split silently corrupts on the next checkpoint/recovery cycle. Similarly, fixing BUG-009 (WAL replay of aborted transactions) requires understanding the WAL transaction lifecycle; but if BUG-007 (free-page reuse) is not also fixed, the data file grows without bound.
In failed trajectories, agents typically fix 8–9 of 11 bugs but miss either BUG-004 (easy to overlook because it only manifests after checkpoint+recovery) or BUG-007 (symptom is file growth, not data corruption, so it is deprioritized). The interaction between bugs means partial fixes often break other tests in non-obvious ways.
Kubernetes Pod Scheduler Optimizer (Rust, ~20 tests, Opus Pass@4: 0.75)
Given a Rust-based Kubernetes pod scheduler simulator at/app/scheduler_slow.rsthat reads cluster state from JSON and produces correct scheduling decisions — but takes ~18 seconds on the test dataset. Rewrite as/app/scheduler.rsso it finishes in under 2 seconds while producing bit-for-bit identical output. The scheduler must compute six things: (1) pod scheduling with taints, tolerations, node/pod affinity and anti-affinity, node selector matching, resource fit, and a composite score combining resource balance and affinity preference; (2) per-node utilization; (3) per-namespace summaries; (4) anti-affinity conflict detection; (5) bin-packing standard deviation across resources; (6) preemption: for unschedulable pods withPreemptLowerPrioritypolicy, find the minimum-cost eviction set on some node. Deterministic sorting everywhere. Standard library only. Compile withrustc -Oand print elapsed milliseconds to stderr.
Environment (3 files)
/app/data/cluster_state.json (cluster topology, pods, nodes)
/app/data/expected_output.json (exact expected scheduling decisions)
The agent receives a working but slow Rust implementation, a large cluster state JSON, and the exact expected output. The task is pure performance optimization — the output must be bit-for-bit identical, just produced 9x faster.
Oracle Solution
The oracle rewrites the scheduler from scratch in optimized Rust. Key techniques: pre-indexing nodes by label sets for O(1) node-selector matching; incremental resource-availability tracking instead of re-scanning all nodes; caching taint-toleration compatibility; priority-based preemption with early termination; efficient data structures (HashMap, BTreeMap) for topology lookups; eliminating unnecessary JSON serialization in intermediate steps.
Tests
Correctness: the primary test compares the output file byte-for-byte against expected_output.json. Every scheduling decision — which pod on which node, which pods are preempted, which remain unscheduled — must match exactly. No partial credit.
Performance: 2-second wall-clock limit (down from ~18s baseline) under rustc -O.
Build verification: compiles without errors or warnings under the default release profile.
Why This Task Is Difficult
The task combines three requirements that must all hold simultaneously: Kubernetes scheduling semantics, Rust systems programming, and algorithmic optimization.
The scheduling specification is dense. Node affinity has both required and preferred expressions with 6 operators. Taints and tolerations have 3 effects with both Equal and Exists operators. Preemption follows priority ordering with a specific policy. The agent must understand all of these correctly — not approximately — because the test requires exact output match.
Writing correct, optimized Rust adds a language-level constraint: the ownership model, borrow checker, and lifetime system prevent many common optimization patterns. The performance requirement (under 2s, down from 18s) requires an algorithmic improvement — an agent that produces correct code with the same O(N×M) complexity will pass correctness but fail the performance gate.
In failed trajectories, the most common pattern: Rust code compiles and runs under 2s but produces slightly different output due to a misunderstood scheduling rule (typically pod anti-affinity topology key handling or PreferNoSchedule taint semantics). The exact-match requirement leaves no room for approximation.
10.2 EvoCodeBench
Representative Real Task: Deterministic Data Pipeline CLI
This example is a released EvoCodeBench task, deterministic-data-pipeline-go. It asks the agent to build and then extend a Go CLI called dpipe for deterministic CSV processing, reproducibility metadata, validation, profiling, lineage, quality checks, snapshots, and changelogs. The task is representative because it combines all three multi-turn stressors in one session: cumulative feature growth, corrections to earlier behavior, and explicit conflicts with prior output formats.
dpipe Deterministic Data Pipeline (15 Rounds)dpipe as a deterministic Go CLI that reads CSV files, applies configured transformations, writes bit-identical outputs across repeated runs, and supports reproducibility verification.
verify command checksum algorithm from SHA-256 to BLAKE2b-256. Existing verification behavior must be updated without breaking earlier ingestion, transformation, profiling, and validation behavior.
blake2b JSON field with two new fields. The model has to migrate the output contract while preserving the rest of the manifest and verification behavior.
normalize method so omitted or empty method behaves as minmax. This tests whether the agent can revise an earlier transform without disturbing explicit zscore and minmax behavior.
snapshot row hashes from MD5 to SHA-256 while also adding a tag transform. The final tests exercise the full accumulated system: transforms, auditability, lineage, manifest formats, snapshots, and changelogs.
The key property is cumulative verification. A correct round 15 implementation cannot merely satisfy the latest hash-format change; it must also preserve deterministic output, earlier checksum semantics where still valid, profile/validate/quality/schema behavior, audit logging, lineage metadata, and snapshot/changelog contracts introduced across the previous 14 rounds. This is the EvoCodeBench setting in concrete form: later requests are small, but each one is evaluated against the whole evolving tool.
10.3 RoadmapBench
Example 1: PyTorch Geometric 2.4 → 2.5 (Python)
The agent receives the PyG source tree at version 2.4.0 and must implement features introduced in 2.5.0. The task spans 5 phases: (1) the new Index class and EdgeIndex.unbind(); (2) EdgeIndex.sparse_resize_() with validation logic; (3) the ClusterPooling layer; (4) LinkPredMRR metric and GATConv residual connections. Each phase tests API-level behavioral contracts using tests adapted from the official PyG test suite.
Difficulty source: The agent must synthesize new classes that fit PyG's internal tensor semantics, register them in the correct __init__.py export paths, and preserve backward compatibility with existing 2.4 APIs — all without seeing the target version source code.
Example 2: Fiber v2.49 → v2.50 (Go)
Go library upgrades often introduce new middleware, context propagation mechanisms, or routing features. This task asks the agent to implement new request context helpers, extend the middleware chain API, and add a new built-in middleware with configurable behavior. The instruction provides behavioral contracts (which HTTP status codes to return under which conditions) but not the Go struct definitions or interface signatures.
Difficulty source: Go's type system and interface semantics require precise API surface alignment. The agent must discover the correct embedding and delegation patterns by reading the existing middleware implementations rather than from the instruction alone. Whereas the PyG example tests Python/ML ecosystem fluency, this example tests Go idioms and the ability to generalize RoadmapBench's pattern across language ecosystems.
11. Evaluation Insights
The same underlying model can produce different scores under different agent frameworks. To reduce scaffolding as a confounder, we evaluate with Harbor Terminus-2 — a standardized agent harness with terminal access, file editing, and code execution inside a Docker container (see the Terminal-Bench 2.0 leaderboard for comparative scaffolding performance). The analyses below therefore focus on behavior under this fixed harness.
11.1 DeepTerminalBench: Per-Task Pass/Fail Map
| Task | Opus-4.7 | GPT-5.5 | DSK-V4 | K2.6 | Q3.6 | GLM-5.1 | Q3.5-A | Gemini | MM-2.7 | ✓ |
|---|---|---|---|---|---|---|---|---|---|---|
| d12w3_fix-pipeline-executor-bugs | 7 | |||||||||
| d12_w3_d12w3_fix-log-analysis-alert-pipeline-bugs | 6 | |||||||||
| d17_w5_d17w5_doc-similarity-clustering | 5 | |||||||||
| d18w3_fix-particle-collision-sim | 5 | |||||||||
| d2w10_build-dependency-graph-forensics | 5 | |||||||||
| d4w10_microservice-crash-cascade-analyzer | 5 | |||||||||
| d5_w7_d5w7_streaming-aggregation-checkpoint-restore | 5 | |||||||||
| d5w6_json-log-aggregation-indexing-pipeline | 5 | |||||||||
| d9w8_security-patch-credential-exposure-bash-deployment | 5 | |||||||||
| d12_w6_d12w6_build-cache-parallel-execution-optimizer | 4 | |||||||||
| d13_w5_d13w5_static-analysis-pipeline-integration | 4 | |||||||||
| d4_w2_d4w2_enhance-application-metrics-collector | 4 | |||||||||
| d6_w6_d6w6_wal-compaction-checkpoint-optimizer | 4 | |||||||||
| d10w7_ml-feature-store-snapshot-recovery-drift | 3 | |||||||||
| d15_w6_d15w6_concurrent-hashmap-lock-striping-optimizer | 3 | |||||||||
| d19w4_migrate-bash-backup-to-python | 3 | |||||||||
| d3w6_d3w6-optimize-c-regression-test-harness | 3 | |||||||||
| d4_w2_d4w2_enhance-log-analysis-alert-engine | 3 | |||||||||
| d4w10_performance-triage-diagnose-runtime-anomalies-distributed | 3 | |||||||||
| d6_w3-schema-registry-rollback-chain | 3 | |||||||||
| d6w3_fix-b-tree-index-manager-bugs | 3 | |||||||||
| d12_w10_d12w10_backup-chain-forensics | 2 | |||||||||
| d14w2_extend-tar-archive-format-parser | 2 | |||||||||
| d16_w9_d16w9_api-contract-snapshot-regression-checker | 2 | |||||||||
| d16w8_harden-graphql-project-api | 2 | |||||||||
| d1_w4_d1w4_migrate-commonjs-to-esm-node-app | 2 | |||||||||
| d2_w1_d2w1_full-text-document-search-engine | 2 | |||||||||
| d5_w7_d5w7_ingestion-queue-dead-letter-recovery | 2 | |||||||||
| d5w2_fix-etl-pipeline-bugs-add-lineage | 2 | |||||||||
| d7w11_proc-tree-budget-enforcer | 2 | |||||||||
| d7w7_resume-checkpoint-batch-pipeline | 2 | |||||||||
| d9_w6_d9w6_rbac-abac-policy-optimizer | 2 | |||||||||
| d10w7_ml-training-checkpoint-recovery-system | 1 | |||||||||
| d11_w8_d11w8_numerical-simulation-engine-security-remediation | 1 | |||||||||
| d13w5_build-microschema-validator-compiler | 1 | |||||||||
| d14_w9_d14w9_binary-log-replay-determinism-validator | 1 | |||||||||
| d15w1_chandy-lamport-distributed-snapshot-simulator | 1 | |||||||||
| d1w9_reproducible-build-pipeline-artifact-verification | 1 | |||||||||
| d20w2_java-approval-chain-parallel-escalation | 1 | |||||||||
| d3_w4_d3w4_unittest-to-pytest-migration-verifier | 1 | |||||||||
| d4_w3_d4w3_fix-metrics-dashboard-collection-bugs | 1 | |||||||||
| d8_w8_d8w8_k8s-manifest-generator-security-hardening | 1 | |||||||||
| d8w7_ci-cd-pipeline-recovery-orchestrator | 1 | |||||||||
| d11_w8_d11w8_signal-processing-pipeline-security-hardening | 0 | |||||||||
| d17w12_identify-sorting-algorithm-trace-forensics | 0 | |||||||||
| d18w11_fsm-batch-test-coverage-analysis | 0 | |||||||||
| d5w10_etl-pipeline-audit-trail-forensic-reconstruction | 0 | |||||||||
| d6_w2_d6w2_lsm-tree-compaction-bloom-filter-enhancement | 0 | |||||||||
| d6_w8_d6w8_schema-migration-ddl-deserialization-fix | 0 | |||||||||
| d8w10_kubernetes-cluster-incident-forensics | 0 |
11.2 DeepTerminalBench: 200-Rollout Deep Dive
We examined Claude Opus 4.6 across 200 rollouts (50 tasks × 4 attempts). Since tasks were selected using Opus Pass@4 calibration, this is not a model leaderboard; it is a task-difficulty analysis using one high-scoring agent as the probe.
Across 200 rollouts, 89 succeeded and 111 failed. Comparing these two populations gives three measurable contrasts:
- Failed attempts run longer. 17.7 minutes vs. 17.0 minutes for successes — failing agents spend time on unproductive exploration.
- 23% of failures end in timeout. These are not wrong answers produced quickly; they are cases where the agent ran out of budget while still actively working — tasks genuinely require sustained multi-step reasoning.
- Lower Pass@4 tasks take longer. Pass@4 = 0.25 tasks average 31 minutes per attempt; Pass@4 = 0.75 tasks average 15 minutes.
- Multi-file bug diagnosis, cross-component integration, deep domain knowledge are common among the 22 tasks in the Pass@4 = 0.25 band — understanding WAL semantics, K8s scheduling priorities, or C memory conventions inferred from context rather than stated.
Under identical scaffolding, the sequence of actions the model chooses matters enormously. Successful attempts test incrementally: write one module, run the test suite, observe which tests pass, fix issues, move to the next module. Failed attempts write everything first, then run tests for the first time at the end — encountering 20+ failures with no diagnostic signal about which module caused which error.
This is not a scaffolding deficiency — the scaffolding allows incremental testing in both cases. The difference is whether the model chooses to test early. Especially pronounced in the B-tree 11-bug task, where agents that test after each bug fix pass at much higher rates than those attempting all 11 fixes before running any tests. The ability to decompose a complex task into verifiable increments is itself a core model capability, distinct from raw code generation quality.
In bug-fix and security-hardening tasks requiring 8–14 fixes, agents consistently find and fix 80–90% of issues but miss 1–2 subtle ones. The pattern is remarkably stable across tasks and runs.
In the B-tree task, agents reliably fix the obvious bugs but miss BUG-004 (serialization child count) because it only manifests after a checkpoint-recovery cycle, not during normal in-memory operations. In the SIEM pipeline task, agents fix most integration bugs but miss the per-service severity threshold filtering because it requires reading a separate config file not referenced in any component's API. The missed issues usually require domain knowledge or cross-file exploration beyond the local symptom. The agent often considers the right area but implements an incomplete fix — for example, adding lock protection to shared state but not to the escalation path, or normalizing 3 of 4 timestamp formats but missing the CEF format.
Many tasks impose simultaneous constraints that interact: thread safety + performance targets, security hardening + backward compatibility, format compliance + error recovery, algorithmic correctness + exact output matching. Agents consistently handle each constraint in isolation but fail when they must all hold at once.
In the K8s scheduler task, agents produce Rust code that either compiles and runs fast but produces wrong output (misunderstood scheduling semantics), or produces correct output but exceeds the 2-second time limit (correct algorithm but unoptimized). The high-failure region is the conjunction: correct and fast. These tasks measure whether an agent can satisfy interacting constraints at the same time, not whether it can satisfy each constraint separately.
11.3 EvoCodeBench: Single-Round Skill Does Not Become Multi-Turn Reliability
EvoCodeBench measures a pattern absent from single-shot benchmarks: agents can solve many individual rounds when each round starts from a clean reference state, but fail when they must carry forward their own implementation. On 26 multi-turn tasks and 227 evaluated rounds, Claude-Opus-4.7 leads MT@4 at 54.0, GPT-5.5-High follows at 52.4, and Claude-Opus-4.6 reaches 44.0. These are the only three models above 40 MT@4.
The SR-to-MT@4 gap quantifies the cost of carrying state across rounds. Claude-Opus-4.6 reaches 78.9 single-round pass rate but 44.0 MT@4. GLM-5.1 reaches 63.9 SR but 36.2 MT@4, Kimi-K2.6 reaches 59.0 SR but 31.9 MT@4, and DeepSeek-V4-Pro reaches 56.4 SR but 30.6 MT@4. The drop is evidence that isolated instruction following is insufficient; the model must preserve a correct evolving workspace while requirements accumulate, change, and sometimes conflict.
Round-level results show how quickly reliability decays. Averaged across model-round observations, pass rates fall from 46.7 at round 1 to 26.9 at round 3, 21.3 at round 5, and 17.2 at round 7. Under fail-stop scoring, these drops quantify the cost of long-horizon state management: one regression can terminate the rest of the task.
Failure annotations sharpen the interpretation. The largest category is not broad conversational forgetting: among high-confidence labels, most failures are concrete current-round specification misses under cumulative tests. Regression and stale-contract errors are minority modes overall, but they become more visible on correction and conflict rounds, where the agent must revise behavior without breaking still-active obligations.
11.4 RoadmapBench: Cross-Model Separation
RoadmapBench results reveal clear separation across model capabilities. Claude-Opus-4.6 leads with 32.2% resolved (OpenHands) and 31.3% (Terminus 2), Failure modes differ systematically by capability level:
- Top-tier models (Claude-Opus-4.6, GPT-5.4) complete core phases reliably but miss subtle edge cases — error behavior when constraints conflict, API contract details not explicitly spelled out, or integration between newly introduced modules.
- Mid-tier models (Gemini-3.1-Pro, DeepSeek-V4-Pro, GLM-5.1) pass the first 1–2 phases but fail to integrate later phases that depend on earlier architectural decisions.
- Lower-tier models fail at the initial architectural setup — placing new code in the wrong modules, missing export registrations, or breaking existing APIs before implementing new ones.
The Completion Score captures this gradient: even models with low resolve rates achieve meaningful partial credit (e.g., DeepSeek-V4-Pro: 18.3% resolved but 0.486 completion), confirming that phase-weighted scoring exposes capability differences invisible to binary evaluation.
11.5 Capability vs. Consistency: A Cross-Dataset Finding
Across all three datasets, every released task has at least one successful attempt. The same model can fail another attempt after choosing a different exploration order, fix strategy, or code structure. This points to reliable execution of multi-step engineering plans under uncertainty as the shared source of score variance.
DeepTerminalBench surfaces this as Pass@4 variance on the same task. EvoCodeBench surfaces it as fail-stop failure on round N after correctly handling rounds 1 through N-1. RoadmapBench surfaces it through T-type/M-type attribution, where task-side defects are removed before model-side failures are analyzed. The common measurement target is consistency across long engineering trajectories.
12. Looking Forward: Toward Adaptive Evaluation
The remaining evaluation constraint is path divergence: two correct agents may reach the same behavior through different code structures, and tests must accept both without becoming too loose. EvoCodeBench handles this with behavioral-contract tests, RoadmapBench handles it with attribution-driven task fixes, and DeepTerminalBench uses Pass@k calibration to filter brittle tasks. These mechanisms reduce false failures, but they still rely on human experts to anticipate valid implementation paths.
The research direction is an adaptive evaluation loop that generates later instructions and tests from the agent's actual intermediate state:
The Vision: Adaptive Evaluation Loop
A Question-Generation Agent observes the agent's current system state and dynamically generates the next round's instruction, maintaining logical consistency while adapting to whatever implementation path the agent chose.
A Test-Generation Agent reads the current round's instruction and the agent's actual code, then generates verification scripts truly matched to the current state.
-> Coding Agent executes
-> Test Agent (sees instruction + code) -> generates tests -> verifies
-> Question Agent (sees new state) -> generates next instruction
-> ...
This would make later evaluation steps conditional on the implementation path actually taken by the agent. The current static datasets (191 tasks total across DeepTerminalBench, EvoCodeBench, and RoadmapBench) provide the fixed-task baseline for studying whether such adaptive evaluation improves attribution.
13. Summary
- Three directions of terminal-based evaluation: We expand Terminal-Bench along three orthogonal directions — depth, iteration, and evolution — all grounded in containerized terminal environments that exercise the same tools, constraints, and feedback loops as human engineers. Task designs draw from continuously growing real-world sources (GitHub, CVEs, RFCs, production postmortems, CS textbooks, and open-source release cycles), enabling the datasets to expand as the engineering landscape evolves.
- DeepTerminalBench — depth: 50 curated tasks selected from a larger pool using Claude Opus 4.6 Pass@4 calibration, each a complex engineering task in a rich pre-configured environment (median 10 files, up to 50+). Verification is primarily dynamic and execution-based — programs are compiled, run, and their runtime behavior validated, including adversarial security payloads and performance gates. The main leaderboard reports Claude-Opus-4.7 at 34.0 Pass@1, with GPT-5.5-high, DeepSeek-V4-Pro, Kimi-K2.6 and Qwen-3.6-Plus all clustered at 32.0; the separate 200-rollout analysis uses four attempts per task to analyze task-difficulty variation and failure patterns.
- EvoCodeBench — multi-turn iteration: 26 multi-turn tasks comprising 227 evaluated rounds capture the core stressors of iterative agent work: cumulative state management, requirement evolution, specification conflicts, and backward compatibility. Tasks are organized along two orthogonal axes — interaction style (explorative, contractual, document-driven) and engineering activity (construction, spec evolution, review-driven improvement, migration) — yielding 12 distinct task-type combinations. Evaluation across 13 models reports long-horizon score decay: Claude-Opus-4.7 leads at 54.0 MT@4, GPT-5.5-High follows at 52.4, Claude-Opus-4.6 remains at 44.0, and all other models are at or below 36.2. All tests verify behavioral contracts rather than implementation details, directly addressing path divergence in multi-turn evaluation.
- RoadmapBench — version upgrade evolution: 115 tasks across 17 repositories in 5 programming languages (Python, TypeScript, C++, Go, Rust) target long-horizon feature implementation across real version releases. Each task gives the agent a repository pinned at an earlier release and a roadmap-style instruction; the agent must implement behaviors introduced in the target version without access to the new source code, test files, or oracle patch. Multi-phase weighted scoring captures partial progress (median 5 phases, max 12), and oracle patches are roughly two orders of magnitude larger than bug-fix benchmarks (~3,700 LOC median). Attribution-driven quality control separates task defects from model-side failures, so remaining failures are attributable to model behavior rather than dataset defects.
- Shared infrastructure: All datasets are built on the Harbor evaluation framework. For multi-turn tasks, Harbor is extended with persistent environments, round boundary protocols, state snapshots, and partial evaluation support. For version upgrade tasks, Harbor manages pinned Docker environments, phase-level test orchestration, and weighted reward computation — enabling reproducible, scalable evaluation across all three dataset dimensions.