SaaS-Bench
๐ Claude Opus 4.7: 43.9% ยท Computer-Use Agents
Can computer-use agents leverage real-world SaaS to solve professional workflows?
SaaS-Bench evaluates computer-use agents inside 23 real, deployable SaaS systems โ where progress depends on persistent state, cross-application coordination, domain constraints, and verifiable final artifacts. The strongest model resolves fewer than 4% of tasks end-to-end.
Opus 4.7
High
Opus 4.6
High
Plus
3.1 Pro
2.0 Pro
Flash High
Sonnet 4.6
V4 Pro†
M2.7†
Business Operations & Finance
Expense reimbursement, accounting closeout, invoice and payment workflows spanning HR, accounting, and CRM systems.
Software Engineering
Test execution audits, regression tracking, and project management work packages from IDE to issue tracker.
Healthcare Administration
Duplicate patient merges, clinical data integrity audits, and HIPAA-compliant audit reporting.
Team Collaboration
Document creation, cloud sharing with tiered access, and email distribution with read receipts.
Artisan Agri-Food Supply Chain
Inventory traceability and harvest log cross-referencing across warehouse and farm management systems.
Independent Media Creation
Knowledge synthesis from academic sources into structured research notes for content production.
Citation
For full details, please read the SaaS-Bench blog. If you find SaaS-Bench useful in your research, please kindly cite:
@misc{shi2026saasbench,
title = {SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?},
author = {UniPat AI},
year = {2026},
url = {https://unipat.ai/blog/SaaS-Bench},
}







