๐Ÿ† 2026.03: Claude-Opus-4.6 ยท Coding

Monthly-SWEBench

A continuously updated benchmark evaluating AI coding agents on real-world software engineering tasks from GitHub issues.

๐Ÿ† EchoZ-1.0: Elo 1035.7 ยท Prediction Intelligence

Echo Leaderboard

A dynamic evaluation engine for AI prediction systems, featuring multi-point aligned Elo ranking, three-track data collection, and adaptive scheduling across diverse domains including finance, politics, crypto, sports, and esports.

๐Ÿ† Seed-2.0: 60.6% ยท Visual Reasoning

BabyVision

A benchmark for visual reasoning that challanges frontier MLLMs yet 3-year-olds can solve.