Benchmarks

SaaS-Bench

๐Ÿ† Claude Opus 4.7: 43.9% ยท Computer-Use Agents

Can computer-use agents leverage real-world SaaS to solve professional workflows?

SaaS-Bench evaluates computer-use agents inside 23 real, deployable SaaS systems โ€” where progress depends on persistent state, cross-application coordination, domain constraints, and verifiable final artifacts. The strongest model resolves fewer than 4% of tasks end-to-end.

23SaaS Systems
106Tasks
3,971Checkpoints
6Domains
SaaS-Bench Leaderboard
Score (%) 100 80 60 40 20 0
3.8%
Claude Opus 4.7
VendorAnthropic
Released2026‑04
Avg. Steps175
43.9
1.9%
GPT-5.5 High
VendorOpenAI
Released2026‑04
Avg. Steps200
43.8
1.9%
Claude Opus 4.6
VendorAnthropic
Released2026‑02
Avg. Steps257
43.2
3.8%
GPT-5.4 High
VendorOpenAI
Released2026‑03
Avg. Steps252
37.0
0.9%
Kimi K2.6
VendorMoonshot AI
Released2026‑04
Avg. Steps269
34.1
1.9%
Qwen 3.6 Plus
VendorAlibaba
Released2026‑04
Avg. Steps249
29.9
0.0%
Kimi K2.5
VendorMoonshot AI
Released2026‑02
Avg. Steps270
27.7
0.0%
Gemini 3.1 Pro
VendorGoogle DeepMind
Released2026‑02
Avg. Steps140
27.1
1.9%
Doubao Seed 2.0 Pro
VendorByteDance
Released2026‑02
Avg. Steps216
27.1
0.9%
Gemini 3.5 Flash High
VendorGoogle DeepMind
Released2026‑05
Avg. Steps324
23.3
0.9%
Claude Sonnet 4.6
VendorAnthropic
Released2026‑02
Avg. Steps155
23.3
1.4%
DeepSeek V4 Pro  (text-only)
VendorDeepSeek
Released2026‑04
Avg. Steps236
21.5
0.0%
GLM-5.1  (text-only)
VendorZhipu AI
Released2026‑04
Avg. Steps166
17.4
0.0%
MiniMax M2.7  (text-only)
VendorMiniMax
Released2026‑03
Avg. Steps256
15.8
Claude
Opus 4.7
GPT-5.5
High
Claude
Opus 4.6
GPT-5.4
High
Kimi K2.6
Qwen 3.6
Plus
Kimi K2.5
Gemini
3.1 Pro
Doubao Seed
2.0 Pro
Gemini 3.5
Flash High
Claude
Sonnet 4.6
DeepSeek
V4 Pro
GLM-5.1
MiniMax
M2.7
Overall Checkpoint Score (bar height, %)
x.x%Resolved Score — fraction of tasks with all checkpoints passing
† Text-only domains; overall checkpoint score reported on text-only tasks.
43.9%
best checkpoint score (Claude Opus 4.7) โ€” but only 3.8% of tasks resolved end-to-end. Agents can start work, but rarely finish it.
Six Professional Domains

Business Operations & Finance

Expense reimbursement, accounting closeout, invoice and payment workflows spanning HR, accounting, and CRM systems.

Frappe HRMS ยท BigCapital ยท Twenty CRM ยท ERPNext

Software Engineering

Test execution audits, regression tracking, and project management work packages from IDE to issue tracker.

code-server ยท Baserow ยท OpenProject

Healthcare Administration

Duplicate patient merges, clinical data integrity audits, and HIPAA-compliant audit reporting.

OpenEMR ยท OnlyOffice

Team Collaboration

Document creation, cloud sharing with tiered access, and email distribution with read receipts.

OnlyOffice ยท ownCloud ยท Roundcube ยท Mattermost

Artisan Agri-Food Supply Chain

Inventory traceability and harvest log cross-referencing across warehouse and farm management systems.

Grocy ยท FarmOS

Independent Media Creation

Knowledge synthesis from academic sources into structured research notes for content production.

SiYuan ยท PDF input

Citation

For full details, please read the SaaS-Bench blog. If you find SaaS-Bench useful in your research, please kindly cite:

@misc{shi2026saasbench,
  title   = {SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?},
  author  = {UniPat AI},
  year    = {2026},
  url     = {https://unipat.ai/blog/SaaS-Bench},
}