UniPat AI

🏆 2026.05: GPT-5.5-High · Coding

Monthly-SWEBench

A continuously updated benchmark evaluating AI coding agents on real-world software engineering tasks from GitHub issues.

🏆 Claude Opus 4.7: 43.9% · Computer-Use Agents

SaaS-Bench

Can computer-use agents leverage real-world SaaS to solve professional workflows? 23 deployable SaaS systems, 106 tasks, 3,971 checkpoints across 6 professional domains.

Claude-Opus-4.7: 54.0 MT@4 · Coding

EvoCode-Bench

26 multi-turn coding tasks with 227 evaluated rounds. Each task keeps the same workspace and agent session while requirements change, accumulate, and sometimes conflict.

🏆 Claude-Opus-4.7: 39.1% · Coding

115 real-world version-upgrade tasks from 17 repositories across 5 programming languages. Evaluates whether coding agents can implement substantial new functionality guided only by a multi-target specification.

🏆 EchoZ-1.0: Elo 1035.7 · Prediction Intelligence

Echo Leaderboard

A dynamic evaluation engine for AI prediction systems, featuring multi-point aligned Elo ranking, three-track data collection, and adaptive scheduling across diverse domains including finance, politics, crypto, sports, and esports.

🏆 Seed-2.0: 60.6% · Visual Reasoning

BabyVision

A benchmark for visual reasoning that challanges frontier MLLMs yet 3-year-olds can solve.

Define the Frontier and Track the Acceleration.

Monthly-SWEBench

SaaS-Bench

EvoCode-Bench

RoadmapBench

Echo Leaderboard

BabyVision