Benchmarks

SaaS-Bench

🏆 Claude Opus 4.7: 43.9% · Computer-Use Agents

Can computer-use agents leverage real-world SaaS to solve professional workflows?

SaaS-Bench evaluates computer-use agents inside 23 real, deployable SaaS systems — where progress depends on persistent state, cross-application coordination, domain constraints, and verifiable final artifacts. The strongest model resolves fewer than 4% of tasks end-to-end.

Paper Blog GitHub

23SaaS Systems

106Tasks

3,971Checkpoints

6Domains

SaaS-Bench Leaderboard

Score (%) 100 80 60 40 20 0

3.8%

Claude Opus 4.7

VendorAnthropic

Released2026‑04

Avg. Steps175

43.9

1.9%

GPT-5.5 High

VendorOpenAI

Released2026‑04

Avg. Steps200

43.8

1.9%

Claude Opus 4.6

VendorAnthropic

Released2026‑02

Avg. Steps257

43.2

3.8%

GPT-5.4 High

VendorOpenAI

Released2026‑03

Avg. Steps252

37.0

0.9%

Kimi K2.6

VendorMoonshot AI

Released2026‑04

Avg. Steps269

34.1

1.9%

Qwen 3.6 Plus

VendorAlibaba

Released2026‑04

Avg. Steps249

29.9

0.0%

Kimi K2.5

VendorMoonshot AI

Released2026‑02

Avg. Steps270

27.7

0.0%

Gemini 3.1 Pro

VendorGoogle DeepMind

Released2026‑02

Avg. Steps140

27.1

1.9%

Doubao Seed 2.0 Pro

VendorByteDance

Released2026‑02

Avg. Steps216

27.1

0.0%

Step 3.7 Flash

VendorStepFun

Released—

Avg. Steps272

25.1

0.9%

Gemini 3.5 Flash High

VendorGoogle DeepMind

Released2026‑05

Avg. Steps324

23.3

0.9%

Claude Sonnet 4.6

VendorAnthropic

Released2026‑02

Avg. Steps155

23.3

1.4%

DeepSeek V4 Pro (text-only)

VendorDeepSeek

Released2026‑04

Avg. Steps236

21.5

0.0%

GLM-5.1 (text-only)

VendorZhipu AI

Released2026‑04

Avg. Steps166

17.4

0.0%

MiniMax M2.7 (text-only)

VendorMiniMax

Released2026‑03

Avg. Steps256

15.8

Claude
Opus 4.7

GPT-5.5
High

Claude
Opus 4.6

GPT-5.4
High

Kimi K2.6

Qwen 3.6
Plus

Kimi K2.5

Gemini
3.1 Pro

Doubao Seed
2.0 Pro

Step 3.7
Flash

Gemini 3.5
Flash High

Claude
Sonnet 4.6

DeepSeek
V4 Pro†

GLM-5.1†

MiniMax
M2.7†

Overall Checkpoint Score (bar height, %)

x.x%Resolved Score — fraction of tasks with all checkpoints passing

† Text-only domains; overall checkpoint score reported on text-only tasks.

43.9%

best checkpoint score (Claude Opus 4.7) — but only 3.8% of tasks resolved end-to-end. Agents can start work, but rarely finish it.

Six Professional Domains

Business Operations & Finance

Expense reimbursement, accounting closeout, invoice and payment workflows spanning HR, accounting, and CRM systems.

Frappe HRMS · BigCapital · Twenty CRM · ERPNext

Software Engineering

Test execution audits, regression tracking, and project management work packages from IDE to issue tracker.

code-server · Baserow · OpenProject

Healthcare Administration

Duplicate patient merges, clinical data integrity audits, and HIPAA-compliant audit reporting.

OpenEMR · OnlyOffice

Team Collaboration

Document creation, cloud sharing with tiered access, and email distribution with read receipts.

OnlyOffice · ownCloud · Roundcube · Mattermost

Artisan Agri-Food Supply Chain

Inventory traceability and harvest log cross-referencing across warehouse and farm management systems.

Grocy · FarmOS

Independent Media Creation

Knowledge synthesis from academic sources into structured research notes for content production.

SiYuan · PDF input

Citation

For full details, please read the SaaS-Bench blog. If you find SaaS-Bench useful in your research, please kindly cite:

@misc{shi2026saasbench,
  title   = {SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?},
  author  = {UniPat AI},
  year    = {2026},
  url     = {https://unipat.ai/blog/SaaS-Bench},
}

← Back to Benchmarks