SaaS-Bench
Can computer-use agents leverage real-world SaaS to solve professional workflows? 23 deployable SaaS systems, 106 tasks, 3,971 checkpoints across 6 professional domains.
Can computer-use agents leverage real-world SaaS to solve professional workflows? 23 deployable SaaS systems, 106 tasks, 3,971 checkpoints across 6 professional domains.
115 real-world version-upgrade tasks from 17 repositories across 5 programming languages. Evaluates whether coding agents can implement substantial new functionality guided only by a multi-target specification.
A continuously updated benchmark evaluating AI coding agents on real-world software engineering tasks from GitHub issues.
A dynamic evaluation engine for AI prediction systems, featuring multi-point aligned Elo ranking, three-track data collection, and adaptive scheduling across diverse domains including finance, politics, crypto, sports, and esports.
A benchmark for visual reasoning that challanges frontier MLLMs yet 3-year-olds can solve.