2026-05-06

ExpertEval: Can AI Match the Judgment of Seasoned Professionals?

ExpertEval is a large-scale, expert-annotated evaluation infrastructure spanning Medicine, Finance, and Law. We measure productive intelligence: the capacity to reason under genuine professional constraints where errors carry irreversible consequences.

2026-05-02

Terminal-X: Evaluate Coding Agents across Depth, Iteration, and Evolution in Terminal Environments

Terminal-X is a suite of three terminal-based datasets for evaluating LLM coding agents: DeepTerminalBench, with 50 calibrated tasks across 20+ engineering domains; EvoCodeBench, with 26 multi-turn tasks comprising 227 evaluated rounds of evolving requirements; and RoadmapBench, with 115 version-upgrade tasks from 17 real open-source repositories across 5 programming languages. Together, they evaluate coding agents along three practical dimensions of real-world software work: depth, iteration, and evolution.

2026-03-27

Echo: Towards General AI Prediction

We present Echo, a full-stack prediction intelligence system centred on EchoZ-1.0, the first large language model trained end-to-end under the Train-on-Future paradigm — spanning a dynamic evaluation engine, a post-training pipeline, and an AI-native prediction API.

2026-03-09

SWE-Vision: A Minimal Agent for Advancing Visual Intelligence

While coding capabilities have surpassed human-level performance in many benchmarks, visual reasoning continues to lag behind. In this work, we introduce SWE-Vision, a minimal agentic workflow that leverages a simple coding environment to enhance visual understanding, also a more achievable test time scaling direction.

2026-01-12

BabyVision: Visual Reasoning Beyond Language

State-of-the-art MLLMs achieve PhD-level language reasoning but struggle with visual tasks that 3-year-olds solve effortlessly. We introduce BabyVision, a benchmark revealing the infancy of AI vision.