UniPat AI

2026-05-25

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

AI agents can browse the web — but can they actually do your job? SaaS-Bench puts computer-use agents inside 23 real SaaS systems to find out, with 106 professional workflows spanning finance, healthcare, engineering, and more.

2026-05-06

ExpertEval: Can AI Match the Judgment of Seasoned Professionals?

ExpertEval is a large-scale, expert-annotated evaluation infrastructure spanning Medicine, Finance, and Law. We measure productive intelligence: the capacity to reason under genuine professional constraints where errors carry irreversible consequences.

2026-05-02

Terminal-X: Evaluate Coding Agents across Depth, Iteration, and Evolution in Terminal Environments

Terminal-X evaluates coding agents in executable terminal tasks. It combines DeepTerminalBench for single-shot depth, EvoCode-Bench for multi-turn iteration, and RoadmapBench for version-upgrade work on real repositories.

2026-03-27

Echo: Towards General AI Prediction

We present Echo, a full-stack prediction intelligence system centred on EchoZ-1.0, the first large language model trained end-to-end under the Train-on-Future paradigm — spanning a dynamic evaluation engine, a post-training pipeline, and an AI-native prediction API.

2026-03-09

SWE-Vision: A Minimal Agent for Advancing Visual Intelligence

While coding capabilities have surpassed human-level performance in many benchmarks, visual reasoning continues to lag behind. In this work, we introduce SWE-Vision, a minimal agentic workflow that leverages a simple coding environment to enhance visual understanding, also a more achievable test time scaling direction.

2026-03-04

UniScientist: Advancing Universal Scientific Research Intelligence

UniScientist is designed to advance universal scientific research intelligence through a unified paradigm. Leveraging an evolving polymathic synthesis, we generate research-grade data that enables structured, rubric-based supervision.

2026-01-12

BabyVision: Visual Reasoning Beyond Language

State-of-the-art MLLMs achieve PhD-level language reasoning but struggle with visual tasks that 3-year-olds solve effortlessly. We introduce BabyVision, a benchmark revealing the infancy of AI vision.