Monthly-SWEBench
A continuously updated benchmark evaluating AI coding agents on real-world software engineering tasks from GitHub issues.
A continuously updated benchmark evaluating AI coding agents on real-world software engineering tasks from GitHub issues.
A dynamic evaluation engine for AI prediction systems, featuring multi-point aligned Elo ranking, three-track data collection, and adaptive scheduling across diverse domains including finance, politics, crypto, sports, and esports.
A benchmark for visual reasoning that challanges frontier MLLMs yet 3-year-olds can solve.