Blog

Echo: Towards General AI Prediction

2026-03-27

Contributors: Zhengwei Tao1,2,*, Rui Min1,*, Hongfeng He1,*, Jialong Wu1,2, Baixuan Li1, Liang Chen1, Wentao Zhang2, Kuan Li1

Affiliations: 1UniPat AI  2Peking University

*Equal Contribution

Correspondence: zhengweitao@unipat.ai, kuanli@unipat.ai

General AI Prediction Leaderboard

March 2026  ·  Elo Score

Market Avg
#1
EchoZ-1.0
1034.2
#2
Gemini-3.1-Pro
1032.2
#3
Claude-Opus-4.6
1017.2
#4
Grok-4.1-Fast
1011.3
#5
Seed-2.0-Pro
1008.9
#6
GPT-5.2
1005.2
#7
GLM-5
1001.8
#8
Minimax-M2.5
992.0
#9
Qwen3-Max
991.4
#10
Deepseek-V3.2
989.7
#11
Kimi-K2.5
976.1
960 980 1000 1020 1040
Snapshot as of March 27, 2026

Introduction

Prediction has never been absent from human history, the question “Why do we predict?” has rarely required an answer. From prehistoric hunters anticipating trajectories and prey behavior, to farmers reading the sky, to traders interpreting markets—today we forecast game outcomes, stock prices, and even presidential elections. At scale, in modern prediction markets, forecasting becomes a recursive echo of millions of minds predicting one another. In doing so, it encodes social meaning, economic value, and spans the full spectrum of human computation from the finite to the boundless.

We present Echo, a full-stack prediction intelligence system built around EchoZ-1.0, the first large language model to be trained end-to-end with the Train-on-Future paradigm. Echo consists of three core components: (i) General AI Prediction Leaderboard, a dynamic evaluation engine that addresses key limitations in existing prediction benchmarks, including limited categories and poorly aligned prediction points across models; (ii) Train-on-Future, a post-training pipeline that teaches models to reason about future events rather than rely on past answers; and (iii) an AI-native prediction API that produces structured, evidence-based probability reports across a wide range of domains.

General AI Prediction Leaderboard
Time Calibrated + Question Source Balanced
Evergreen Prediction Benchmark
EchoZ-1.0
Train-on-Future + Behaviour-Oriented Rubric
The State-of-the-art Predictor
The Prediction API
Customer + Enterprise-Level Prediction Service
Outperforms Human Market Expert

General AI Prediction Leaderboard

A prediction problem is a question whose correct resolution lies in the future. Formally, each prediction instance $\mathcal{Q}_i$ is associated with a candidate answer set $\mathcal{A}_i$ and a resolvement horizon $D_i$ (in days), where the true outcome remains unknown until some future time. The goal of a forecasting model is to assign, before resolvement, a probability distribution over the possible answers.

Given such a prediction question $\mathcal{Q}_i$, the model produces at prediction time $t < D_i$ a probability vector:

$$\hat{\mathbf{p}}_i(t) = \bigl(\hat{P}(a_j \mid \mathcal{Q}_i,\, t)\bigr)_{j=1}^{|\mathcal{A}_i|}, \quad \sum_{j} \hat{P}(a_j \mid \mathcal{Q}_i, t) = 1, \quad t < D_i$$

This distribution represents the model’s belief over all candidate outcomes at time $t$, under the information available before the question resolves. A reliable benchmark should therefore evaluate the quality of $\hat{\mathbf{p}}_i(t)$ across a diverse set of prediction questions, while also providing a principled basis for comparing forecasts made at heterogeneous prediction times. Existing online benchmarks fall short on both requirements, due to two structural deficiencies:

Limitation I — Timing Asymmetry
Prediction difficulty is inherently uneven across an event’s lifecycle: as resolvement approaches, relevant information accumulates, making later forecasts systematically easier than earlier ones. As a result, predictions made at different time points are not directly comparable. Existing benchmarks often sidestep this comparability issue with scale, aggregating large numbers of temporally misaligned forecasts instead of explicitly correcting for the underlying asymmetry.
Limitation II — Question Source Gap
Existing benchmarks draw almost exclusively from prediction markets and periodically updated data platforms, producing a proposition space that is easy to resolve but structurally narrow. This design systematically overlooks non-standard judgments in specialised domains, emerging topics that lack corresponding market infrastructure, and richer cross-domain forecasting tasks whose outcomes cannot be cleanly expressed in existing platform-native formats.

Leaderboard Design

To tackle the existing benchmarks' deficiencies, the Leaderboard is built around a three-stage pipeline for fair and scalable prediction evaluation. First, it acquires questions from three complementary sources for broad coverage. Second, it schedules and prioritizes prediction samples over each question’s lifecycle. Finally, evaluation is framed as pairwise battles and a global Elo-style optimization produces the final leaderboard, emphasizing robustness, reliability, expeditiousness, diversity, and flexibility.

benchmarks

Three-Source Data Acquistion and Automated Resolvement

Source 1 — Prediction Market
Prediction markets such as Polymarket offer two valuable properties: consensus anchoring and precise resolvement. The system applies multi-dimensional filtering to select eligible contracts: each contract must define explicit resolvement rules, concern an event that has not yet occurred, and remain outside already decisive odds regimes. Once the corresponding real-world event satisfies the resolvement condition, the system resolves the ground-truth label automatically and with full certainty.
Source 2 — Synthesis from Trends
Real-time trend signals, including Google Trends, surface emerging topics of interest, while a web-crawling cluster gathers the supporting context around them. On top of this, an agentic question-synthesis engine generates challenging prediction questions about events that have not yet unfolded. Throughout each question’s lifecycle, answer agents periodically search the web, monitor developments, and close the resolvement loop automatically once the event is resolved, without human intervention.
Source 3 — Expert Annotation
Domain experts across fields such as scientific research, engineering, and medicine curate question sets centered on high-value uncertainties that arise naturally in their day-to-day professional workflows. They not only formulate the questions, but also provide authoritative adjudications at predefined future timestamps based on the final observed facts.

Prediction Data Selection: Two-Phase Algorithm

In principle, each prediction problem can be forecast at arbitrary time points before resolution, making exhaustive benchmarking impossible. We therefore introduce an efficient two-phase data selection scheduler designed to achieve three goals: (1) cover each question’s lifecycle with well-distributed prediction timestamps; (2) stabilize the day-to-day prediction workload; and (3) ensure a steady flow of resolved questions and maintain a live, continuously updated benchmark.

Phase 1 — Total Prediction Count Estimation

Unlike static benchmark questions, prediction questions evolve over time: the same question answered at different timestamps can carry different information and have different evaluation value. More concretely, later predictions are usually informed by more evidence and therefore tend to be more stable as resolution approaches. In our benchmark design, longer-horizon questions are assigned more prediction points, since they leave more time for new evidence to emerge. But if we increase the number of prediction points linearly with question lifetime, the cost grows too quickly. For example, a 30-day question would require 10 times as many predictions as a 3-day question. To avoid this, we use a logarithmic schedule to increase prediction points as ($D_i$) grows. Let $T_i$ denote the target number of prediction points for question $i$. We implement this schedule with the following form:

$$T_i = \operatorname{round}(1.35 \cdot \ln D_i + 0.5)$$

We fit the parameters so that the schedule yields approximately 4 and 7 predictions for question lifetimes of 10 and 90 days, respectively. This logarithmic compression also reduces the spread in prediction frequency across questions, which helps naturally smooth the distribution of final-prediction dates.

Phase 2 — Priority Scoring

To decide which questions should be predicted next while keeping the schedule balanced over time, we assign each question a priority score. Let $W_i$ denote days elapsed since the last prediction, $P_i$ the number of predictions already completed, $R_i = T_i - P_i$ the remaining count, and $D_i'$ the remaining time until resolvement. The priority score is:

$$S_i = \frac{W_i}{D_i' / R_i} = \frac{W_i R_i}{D_i'}$$

A higher $S_i$ means the question should be given higher priority. This scoring rule has two desirable properties. The first is elastic recovery: if a question remains unsampled for a long period, $W_i$ continues to grow until the question is scheduled again. This helps preserve coverage across the question's lifecycle (goal 1). The second is resolution-aware prioritization: as $D_i'$ gets smaller, the target interval $D_i' / R_i$ also gets smaller. This gives questions approaching resolution higher priority and helps maintain a steady flow of resolved questions (goal 3). Finally, each day the scheduler selects the top-$B$ questions by $S_i$. The budget $B$ controls the daily prediction volume (goal 2).

Multi-Point Aligned Elo Framework

The framework's core mechanism is prediction-point alignment. We compare models only when they make predictions on the same question at the same prediction point, ensuring the same information and time remaining until resolution. We then jointly estimate model ratings with a Bradley-Terry model by maximum likelihood.

Battle Construction

After alignment, we compute a Brier Score $\text{BS} = (1 - p^{c})^2$ for each prediction, where $p^{c}$ is the probability assigned to the correct outcome. Lower scores indicate better prediction performance. Each aligned comparison is then treated as a battle, indexed by $k$, with competitors denoted by $a_k$ and $b_k$.

Soft label $v_k$: A binary win/loss label is too coarse for probabilistic forecasting because it cannot capture differences in confidence. Even when two models predict the same correct outcome, they may assign very different probabilities. A strict binary label would treat these forecasts as equivalent, which would ignore meaningful differences in calibration. To preserve this information, we map the Brier Score difference between models $a_k$ and $b_k$ to a continuous target $v_k \in (0,1)$, where larger values mean that model $a_k$ performs better than model $b_k$ by a wider margin:

$$v_k = \frac{1}{1 + 10^{(\text{BS}_{a_k} - \text{BS}_{b_k})/\sigma}}, \quad \sigma = 0.10 \text{ (default; sensitivity analysed below)}$$

When the two models have the same Brier Score, $v_k = 0.5$. If model $a_k$ achieves a lower Brier Score than model $b_k$, then $v_k$ moves closer to 1; if $a_k$ performs worse, then $v_k$ moves closer to 0. The parameter $\sigma$ controls the sensitivity of the soft outcome to a given Brier Score gap.

Weight $w_k$: Longer-horizon forecasts are generally harder because they are made with less prior information. We therefore weight each battle by its prediction lead time $\Delta t_k$:

$$w_k = 1 + \gamma \ln(1 + \Delta t_k), \quad \gamma = 1.0$$

The parameter $\gamma$ controls the weighting strength. Each battle is therefore written as $(a_k, b_k, v_k, w_k)$, where $a_k$ and $b_k$ are the two models being compared, $v_k$ is the soft outcome, and $w_k$ is the battle weight.

Bradley-Terry MLE

Let $\mathbf{r}$ be the vector of model ratings, and let $r_i$ denote the rating of model $i$. The Bradley-Terry model maps rating differences to pairwise win probabilities. In battle $k$, the probability that model $a_k$ beats model $b_k$ is:

$$P(a_k \succ b_k \mid \mathbf{r}) = \frac{1}{1 + 10^{(r_{b_k} - r_{a_k})/400}}$$

We estimate the ratings by maximizing the following weighted log-likelihood, where $v_k$ is the observed soft outcome and $w_k$ is the battle weight:

$$\mathcal{L}(\mathbf{r}) = \sum_{k} w_k \left[ v_k \log P(a_k \succ b_k \mid \mathbf{r}) + (1 - v_k) \log\bigl(1 - P(a_k \succ b_k \mid \mathbf{r})\bigr) \right]$$

This objective assigns higher ratings to models whose implied win probabilities match the observed soft outcomes across battles, with greater emphasis on battles that carry larger weights.

$\hat{\mathbf{r}} = \arg\max_{\mathbf{r}} \mathcal{L}(\mathbf{r})$ is solved with L-BFGS-B. Because the Bradley-Terry negative log-likelihood is convex in $\mathbf{r}$, the optimization converges to the global optimum.

Robustness: Rank Stability Under Missing Scheduled Predictions

In practice, different models may not always complete their scheduled predictions on time every day due to API failures or server interruptions. A practical ranking method should therefore remain stable when some scheduled prediction days are missed. Average Brier Score is sensitive to this because it is computed only over the questions a model has answered. Once some days are omitted, the score is based on a smaller subset of questions, which may not have the same difficulty as the full set. As a result, the ranking can become unstable and may fail to genuinely reflect a model's prediction performance. Elo works differently. Each battle compares two models only on questions that both of them answered. If one model misses a scheduled prediction day, the battles for that day simply do not happen, which means the remaining battles are still based on shared questions. As a result, missing scheduled predictions reduce the number of comparisons, but they do not change which questions each battle is based on.

In our experiment, we select a subset of models and simulate missing scheduled predictions by randomly dropping between 10% and 70% of their prediction days. For each drop rate, we run 50 trials and record the standard deviation of each model's rank across trials using the recomputed rankings. This measures how much the ranking varies when different scheduled prediction days are missing. We also report the gap using the Volatility Ratio (Average Brier rank Std ÷ Elo rank Std).

Result: As the drop rate increases, rank variability under Average Brier grows faster than under Elo at every tested drop rate. The Volatility Ratio stays between 1.4× and 1.8× across all drop rates, showing that Elo rankings remain consistently more stable when scheduled predictions are incomplete.

Intermittent Sweep

Reliability: Rank Consistency Under Model Set Variation

Since Average Brier Score computes each model's score independently, adding or removing other models does not affect the ranking of the remaining models. The situation is different for Elo rankings, which estimate all model ratings jointly from the full set of pairwise battles. When a model is removed, the ratings of other models may shift because some battles are no longer observed. If these shifts are large, the Elo ranking becomes unstable as the model set changes. To evaluate how sensitive Elo is to such changes, we design a Leave-$K$-Out experiment.

We first fit the Bradley–Terry model using all battles from the full model set and treat the resulting ranking as the reference. We then vary $K$ from 1 to 6. For each $K$, we sample a large number of model combinations to remove. Specifically, we remove all battles involving those $K$ models and refit the Bradley-Terry model on the remaining ones. We then compare the new ranking with the reference ranking after excluding the removed models. We use Kendall's $\tau$ to measure how much the ranking changes:

$$\tau = \frac{n_c - n_d}{\binom{N-K}{2}}$$

Here $N$ is the number of established models in the full model set, and the denominator $\binom{N-K}{2}$ indicates the total number of model pairs among the remaining models. For each pair, we check whether the two models appear in the same relative order in both the reference ranking and the new ranking. If they do, the pair is concordant ($n_c$); if the order is reversed, the pair is discordant ($n_d$). A value of $\tau = 1$ means all pairwise orders are preserved, while $\tau = 0$ means the number of preserved pairs equals the number of flipped pairs. For each $K$, we report the mean $\tau$ and the 95% confidence interval across all sampled removal combinations.

Result: When removing one model at a time (K = 1), the mean τ across all removal combinations is 0.994, indicating that nearly all pairwise orderings are preserved. As K increases to 6, mean τ decreases to 0.978, which still represents high rank agreement. The 95% confidence bands remain relatively narrow at every K, confirming that the Bradley–Terry rankings are robust to changes in the model set.

Leave-K-Out Robustness

Expeditiousness: New Model Ranking Convergence

When a new model joins the leaderboard, only a subset of questions has resolved, and those questions are more likely to be earlier-resolving ones. Under Average Brier Score, the new model is ranked only on this subset, while established models are evaluated on a broader mix of questions with both short and long resolution horizons. The two groups are therefore evaluated on different question distributions, which makes the comparison unfair. As more questions resolve over time, the distribution changes, and the Average Brier ranking changes with it.

In terms of the Elo ranking, each battle compares two models on the same question at the same prediction time. The comparison is therefore made within a shared setting. The difficulty of the question is shared by both models in that battle and does not favor one side over the other. Although a new model with only a few battles may receive a noisier rating, each battle still provides a valid relative comparison, and the rating is not systematically biased by the resolved question set. This suggests that Elo rankings would approach their stable values faster than Average Brier rankings when a new model joins. We test this with the following experiment. We simulate cold-start entry for each established model over a 20-day observation window. On each day $d$, we define the Rank Percentile Error as follows:

$$\text{Error}_m(d) = \frac{|\text{rank}_m(d) - \text{rank}_m^*|}{N}$$

where $N$ is the number of established models. For a given model $m$, we pretend that it has just joined and compute its rank using only its first $d$ days of predictions; this is denoted $\text{rank}_m(d)$. All other established models still use their full prediction history. We then compare this early rank with $\text{rank}_m^\ast$, the rank of model $m$ computed from all 20 days of its predictions. The absolute difference between $\text{rank}_m(d)$ and $\text{rank}_m^\ast$, divided by $N$, is the Rank Percentile Error on day $d$. We repeat this for every established model and average the resulting errors. For visual comparison, the figure normalises each method's error curve by its Day-1 value, so the two methods can be compared on the same scale.

Result: Elo's ranking error drops faster than Avg Brier throughout the 20-day window. To measure convergence speed, we record when each method's error first falls below 40% of its starting value. Elo reaches this point at Day 5.4, while Avg Brier reaches it at Day 14.5. Elo converges 2.7× faster.

Newcomer Convergence

Diversity: Coverage Across Domains and Horizons

Prediction performance varies across domains. A model may perform well on political events but struggle with financial or scientific questions. To reduce domain bias in the rankings, Echo's question pool spans 7 domains: Politics & Governance, Economy & Finance, Sports & Entertainment, Science & Environment, Crypto & Digital Assets, Esports & Gaming, and Other. New questions are added daily so the benchmark continues to cover ongoing real-world events. The domain distribution is as follows:

Question Distribution by Domain

7 categories

34.5% 19.8% 15.2% 9.5%
Politics & Governance 334 34.5%
Economy & Finance 191 19.8%
Sports & Entertainment 147 15.2%
Crypto & Digital Assets 92 9.5%
Science & Environment 82 8.5%
Other 72 7.4%
Esports & Gaming 49 5.1%

Active prediction question distribution as of March 27, 2026

Flexibility: Open Participation Interface

Existing dynamic prediction leaderboards require participants to submit their predictions periodically to ensure the effectiveness of the evaluation, which placed a significant burden on them. Thanks to the aforementioned design mechanism, Echo allows participants to freely choose the submission date and topic. The experimental results above verify that Echo can still provide reliable evaluation results under any submission conditions.


EchoZ-1.0: Train on the Future

Deep Research Agent Framework for Prediction

EchoZ-1.0 is built on the ReAct framework. At each timestep $t$, the agent performs three steps: it first generates a thought $\theta_t$ (what to do next and why), then takes an action $\alpha_t$ (such as running a web search or visiting a page), and finally receives an observation $o_t$ (the result of the action). These three steps form a triple $(\theta_t, \alpha_t, o_t)$. As the agent iterates, the full history of all previous triples accumulates into a trajectory:

$$H_t = (\theta_0, \alpha_0, o_0,\; \theta_1, \alpha_1, o_1,\; \ldots,\; \theta_{t-1}, \alpha_{t-1}, o_{t-1})$$

At each step, the agent decides what to do next based on the question $\mathcal{Q}_i$ and everything it has seen so far ($H_t$):

$$(\theta_t, \alpha_t) \sim \pi\!\left(\cdot \mid \mathcal{Q}_i,\, H_t\right)$$

The loop continues until the agent decides to produce a final answer. At that step $T$, the action $\alpha_T$ is set to "answer" and the agent outputs a probability distribution over the possible outcomes:

$$\hat{\mathbf{p}}_i(T) = \mathrm{Answer}(\mathcal{Q}_i,\, H_T)$$

This framework allows the model to actively search for information and reason over multiple steps before making a prediction. However, training such an agent introduces two key difficulties.

Deficiencies of Train-on-Past

Deficiency I — Engineering Paradox
Completely blocking access to answer-containing open-web resources is an engineering challenge of extreme difficulty (arguably impossible), because critical websites continuously update their content as time passes, making the historical web state irrecoverable.
Deficiency II — Outcome-Oriented Bias
Real-world events exhibit high stochasticity and heavy-tailed distributions. Rigorous prediction logic may be penalised by low-probability events that nonetheless occur; defective guesses may succeed by chance. Direct outcome-based feedback leads models to overfit to noise, inducing survivorship bias.

Three Core Mechanisms of the Train-on-Future Pipeline

① Dynamic Question Synthesis
The system rejects static historical datasets in favour of a dynamic question synthesis engine driven by real-time data streams. An automated pipeline continuously ingests current macroeconomic indicators and frontier developments, and drives an agent to automatically generate novel, high-entropy prediction propositions about events yet to occur. Because every question concerns a future event, the data leakage pathway inherent in Train-on-Past is eliminated, and models must reason on genuinely unknown distributions.
② Rubrics Search
Training rewards are based on the model's reasoning process, not just whether the final prediction is correct. For each domain, we run a rubrics search to find the best scoring rubric. We start with a set of candidate rubrics, use each one to score model prediction trajectories, and rank the models accordingly. The rubric whose ranking best matches the Elo ranking from real outcomes is selected. Because the rubric is aligned with actual forecasting performance, the training signal is more stable and potentially transfers across different questions within the same domain.
③ Map-Reduce Agent Architecture
A distributed Map-Reduce agent execution architecture is designed on top of the trained model. In the Map phase, the architecture decomposes macro-level questions into orthogonal sub-task streams, dispatching multiple agents for domain-specific information collection and parallel logical reasoning. In the Reduce phase, an aggregation node resolves cross-source conflicts, aligns causal chains, and outputs a synthesised probabilistic decision. The loop supports multiple rounds of adaptive iteration, re-running Map-Reduce cycles until information coverage and reasoning depth plateau.

Each model produces a prediction trajectory for every question, recording its full prediction process from information gathering to final probability output. We cannot rely solely on resolved outcomes to judge trajectory quality, for two reasons. First, prediction outcomes are noisy: a model can reason well and still be wrong, or reason poorly and happen to be right. Second, many questions take days or weeks to resolve, making outcome-based feedback slow and sparse. We therefore need a way to evaluate the prediction process itself. The goal of the rubric search is to find a rubric that can assess trajectory quality and produce model rankings consistent with the Elo ranking.

Directly using an LLM to score an entire trajectory produces coarse-grained supervision and would lack consistency across runs. Instead, we decompose the evaluation into concrete dimensions, such as source reliability and probability calibration. Each dimension has clearly defined scoring levels, and together they form a rubric. The LLM judge scores each trajectory dimension by dimension. Compared to a single holistic score, this structured approach produces more stable and fine-grained results.

However, designing a good rubric by hand is difficult. Prediction questions are inherently noisy: even a well-reasoned trajectory can lead to a wrong outcome, and a poorly reasoned one can happen to be correct. Without grounding in data, a hand-designed rubric may not capture the dimensions that actually correlate with forecasting accuracy. This introduces bias into the evaluation. On top of that, different domain categories have different characteristics. The reasoning patterns that lead to accurate predictions in Sports & Entertainment could be different from those in Crypto & Digital Assets, and a single rubric cannot capture both well.

To avoid these issues, we ground the rubric design in data rather than human intuition. We frame it as a data-driven search problem: the search objective is to maximise Spearman's $\rho$ between the rubric-based model ranking and the Elo ranking. We run the search independently for each of the primary domain categories, starting from a general rubric. In each round, an LLM generates new rubric candidates informed by feedback from previous rounds, such as which dimensions contribute to the scoring process. Each candidate is evaluated on held-out data, and the top-performing candidates are retained for further iterations.

Rubrics Improvement

We show part of the rubrics as follows:

Rubrics by Domain

Resolution Criteria Parsing5 · 3 · 1

Evaluates whether the trajectory explicitly extracts and strictly tests evidence against all specific technical constraints and edge cases in the prompt.

5Explicitly extracts and strictly tests evidence against 2+ specific technical constraints from the prompt (e.g., 'sovereign UN member', 'majority of minutes').
3Notes the general goal but applies constraints loosely, missing subtle exclusions or edge cases.
1Ignores specific definitions entirely, leading to the inclusion of clearly disqualified events or locations.
Rhetoric vs Execution Disambiguation5 · 3 · 1

Assesses the ability to separate a politician's public threats, promises, or social media posts from formal administrative, legal, or military execution.

5Explicitly contrasts a political statement, threat, or social media post with the verifiable presence or absence of formal execution (e.g., court filings, signed orders).
3Notes political statements and questions their execution, but does not deeply investigate the formal execution mechanism.
1Treats a politician's verbal threat, promise, or social media post as definitive proof of a completed formal action.
Primary Institutional Sourcing5 · 3 · 1

Measures the effort to bypass secondary news summaries and directly access authoritative government, legal, or military repositories.

5Directly accesses or specifically searches for primary official repositories (e.g., CourtListener dockets, AFRICOM press releases, official embassy sites).
3Relies primarily on detailed secondary reporting from reputable news organizations to infer official actions.
1Relies entirely on search engine snippets, rumors, or unverified claims without seeking primary institutional confirmation.
Procedural Constraint Assessment5 · 3 · 1

Evaluates understanding of specific rules, schedules, and hurdles of relevant government, legal, or military institutions.

5Analyzes specific procedural rules, legislative calendars, or bureaucratic hurdles (e.g., SCOTUS opinion days, War Powers Act 48-hour rules, declassification review steps).
3Mentions general government delays or processes but lacks specific procedural details or institutional schedules.
1Assumes complex government, legal, or military actions can happen instantaneously without procedural friction.
Timeline Feasibility Calculation5 · 3 · 1

Measures whether the trajectory calculates the exact time left and assesses if the required actions are logistically and procedurally possible.

5Calculates the exact days/hours remaining and explicitly evaluates if required bureaucratic, legal, or military procedures can realistically conclude in that narrow window.
3Notes the deadline date but only loosely connects it to the pace of required events, without specific feasibility analysis.
1Ignores the remaining time entirely or assumes complex government/military actions can happen instantaneously.
Geospatial and Entity Precision5 · 3 · 1

Assesses the rigor of verifying exact geographic locations, borders, or official territorial and sovereign definitions.

5Verifies exact geographic details (e.g., specific intersections, border sides) or entity statuses (e.g., UN member state, designated cartel) using maps, coordinates, or official definitions.
3Identifies general regions, cities, or groups but fails to verify precise boundaries, coordinates, or official legal statuses required by the prompt.
1Assumes locations or entity statuses incorrectly (e.g., assuming an observer state is a UN member state or confusing maritime strikes with land strikes).
Negative Evidence Weighting5 · 3 · 1

Measures whether the trajectory actively uses the absence of information or action as a strong predictive signal.

5Actively searches for and heavily weights the absence of an event (e.g., empty dockets, lack of AFRICOM reports, empty public schedules) as a strong predictive signal.
3Notes missing information or a lack of news but does not deeply analyze what this absence means for the probability.
1Only looks for confirming evidence, completely ignoring the predictive value of what is missing or assuming the search tool simply failed.
Chronological Boundary Enforcement5 · 3 · 1

Checks strict adherence to the exact start and end dates specified in the prompt, ensuring past events fall inside the valid window.

5Strictly enforces the start and end dates, explicitly verifying if past events fall inside or outside the required timeframe (e.g., checking market creation timestamps against event occurrences).
3Mentions past events and the general timeframe but does not rigorously verify if they fall strictly inside the exact start-to-finish window.
1Explicitly uses an event that clearly falls outside the specified timeframe (e.g., before market creation) as proof of resolution.
False Positive Filtration5 · 3 · 1

Assesses the ability to identify and discard misleading search snippets, misdated articles, or misattributed actions.

5Actively scrutinizes search snippets to debunk misleading dates, recycled old news, metadata errors, or actions misattributed to the wrong actor.
3Notes a potential date discrepancy or odd headline but accepts it superficially without deep verification of the source.
1Blindly accepts misdated snippets, treating old events (e.g., 2019 videos, 2024 strikes) or template errors as current news.
Quantitative Base Rate Calculation5 · 3 · 1

Checks if the trajectory establishes a quantitative historical baseline or frequency to mathematically anchor its projection.

5Calculates a specific numerical frequency or historical baseline (e.g., X posts per day, Y strikes per week, average days pending) from past data to mathematically ground the projection.
3Mentions historical trends generally (e.g., 'they strike frequently', 'he posts a lot') without calculating a specific rate or numerical baseline.
1Makes no attempt to establish a historical baseline or frequency, relying entirely on narrative speculation.
Diplomatic and Proxy Disambiguation5 · 3 · 1

Assesses the ability to differentiate between the various roles nations play, such as mediators versus physical hosts, or proxies versus states.

5Explicitly distinguishes between mediating nations vs. physical host locations, or between proxy/allied actions vs. direct state-on-state kinetic actions.
3Notes multiple countries or groups involved but blurs their specific logistical, diplomatic, or command roles.
1Conflates the mediator country for the physical meeting location, or conflates proxy actions with direct state actions, treating them as identical.
Prediction Market Rule Calibration5 · 3 · 1

Evaluates the extraction of specific odds from prediction markets and the analysis of their underlying resolution rules.

5Extracts specific percentage odds from prediction markets AND explicitly analyzes their specific underlying resolution rules or relevant derivative markets.
3Mentions prediction market odds or general betting sentiment as a supporting data point but does not investigate the exact technical rules for how the market resolves.
1Completely ignores available prediction market data or fundamentally misinterprets what the market odds represent.
Catalyst and Off-Ramp Tracking5 · 3 · 1

Evaluates the identification of specific upcoming events that could trigger or prevent the predicted outcome.

5Pinpoints specific upcoming dates, scheduled meetings, expiring ultimatums, or holidays within the window that would serve as direct catalysts or off-ramps for the outcome.
3Mentions general upcoming events or trends but does not tie them directly to specific trigger dates or deadlines affecting the outcome.
1Analyzes the situation entirely statically, looking only at past events without searching for future triggers or catalysts.
Contradiction Resolution5 · 3 · 1

Assesses how the trajectory handles conflicting reports or data points discovered during the investigation.

5Actively identifies conflicting information (e.g., lagging official sites vs. breaking news, mismatched dates) and uses logical deduction to determine ground truth.
3Notes conflicting reports in the search results but leaves them unresolved, or arbitrarily picks one side without deep verification.
1Blindly accepts the first piece of information found, ignoring or missing obvious contradictions in the gathered data.
Reporting Lag Awareness5 · 3 · 1

Evaluates whether the trajectory accounts for delays in official reporting or docket updates.

5Recognizes that official reporting (e.g., AFRICOM releases, court docket uploads, ISW maps) may have a 24-48 hour lag and explicitly accounts for this delay in the timeline analysis.
3Takes current reports at face value without explicitly considering potential administrative or reporting lags.
1Assumes that if an event isn't reported today, it definitely hasn't happened recently, ignoring basic reporting delays.
Recent Window Monitoring5 · 3 · 1

Checks if the trajectory specifically searches for updates in the immediate 24-48 hours prior to the simulated current date.

5Conducts targeted searches specifically for the 24-48 hours immediately preceding the simulated current date to catch late-breaking developments or docket updates.
3Searches generally for recent news from the past few weeks but does not explicitly filter for the absolute latest daily updates.
1Relies on weeks-old or months-old information without checking if the situation changed right before the current date.
Semantic and Legal Rigor5 · 3 · 1

Evaluates how strictly the trajectory distinguishes between closely related but legally or militarily distinct terms.

5Carefully distinguishes between closely related but legally/practically distinct terms (e.g., 'entering' vs 'capturing', 'intelligence support' vs 'boots on the ground', 'military engagement' vs 'declaration of war').
3Shows some awareness of definitions but occasionally conflates similar terms in its reasoning.
1Blatantly conflates distinct concepts, leading to a fundamentally flawed evaluation of the resolution criteria.
Tool Failure Adaptation5 · 3 · 1

Evaluates the trajectory's persistence and adaptability when technical tools fail to return expected data.

5Explicitly documents a tool failure (e.g., 404 error, API limit, blank page) and actively pivots to alternative search queries, different APIs, or secondary sources to find the data.
3Encounters a tool failure and tries one basic alternative, but gives up quickly without exhausting options.
1Halts the investigation, skips reasoning, or hallucinates data after encountering a tool failure.
Sustained Condition Analysis5 · 3 · 1

Evaluates whether the trajectory verifies if a specific condition was sustained over a required period rather than a momentary blip.

5Explicitly analyzes whether a specific condition (e.g., market price > 20%, map shading) was sustained over a required duration (e.g., majority of minutes in a 2-hour window, full update cycle) rather than a momentary spike.
3Notes a historical spike or condition being met briefly but does not deeply analyze its duration or stability against the rules.
1Assumes a single reported momentary price point or unconfirmed report guarantees a sustained condition.
Evidence-Probability Calibration5 · 3 · 1

Measures how strictly the final probability reflects the gathered evidence, base rate math, and resolved contradictions.

5Final probability strictly and explicitly follows from the synthesized evidence and base rates, reserving extreme odds (<5% or >95%) only for mathematically or structurally guaranteed events.
3Final probability generally aligns with the evidence direction, but the specific number feels arbitrary or slightly overconfident/underconfident.
1Final probability contradicts the gathered evidence, uses extreme odds on highly uncertain future events, or is generated zero-shot.

Performance Analysis

Ranking Robustness Across Sensitivity Parameters

In the Elo framework, the parameter σ controls how Brier Score differences between two models are converted into win probabilities. When σ is small, a small Brier Score difference can produce a win probability close to 0 or 1, meaning that even a slightly better Brier Score on one question can strongly influence the battle outcome. When σ is large, the same difference produces a win probability closer to 0.5, and the ranking depends more on the number of battles won than on the size of each score gap.

The default value used in our framework is σ = 0.10. To check how much the ranking changes when σ varies, we sweep 9 values across σ ∈ [0.01, 0.50]. At each σ, we recompute the full Bradley–Terry ranking from all battles using the same data. For each model, we record its highest and lowest rank across the 9 evaluation settings.

Result: EchoZ is the most stable, staying at #1 across the entire σ range. Gemini-3.1-Pro fluctuates within #2–#3. Most other models fluctuate within 2–3 positions. GPT-5.2 shows the widest range at #2–#9 (8 positions). Overall, the ranking order is consistent across all evaluated σ values, with position changes concentrated in adjacent ranks.
Sigma Sensitivity

Win Rate Against the Human Market

Each battle compares EchoZ with the human market on the same question within the same prediction batch. Win rate is then computed by Brier Score and stratified by domain (Politics & Governance), forecast horizon (7+ days ahead), and market certainty (human confidence 55%–70%).

Politics & Governance
63.2%
Long-Term · 7+ d
59.3%
Market Uncertain
57.9%

AI-Native Prediction API

We are building an AI-native prediction API around EchoZ-1.0. Given a structured prediction question, the API is designed to return a fully grounded analytic report: a direct probabilistic answer, a stratified evidence base with source attribution, a counterfactual fragility assessment, monitoring recommendations, and a complete structured prediction record — each synthesised through a multi-round Map-Reduce agent cycle over live web evidence. The sample outputs below preview what a response looks like across the full domain taxonomy.

Sample API Output

Economy & Finance Prediction Date: 2026-03-18

Question: Which company will be the largest company in the world by market cap at market close ET on March 31, 2026?

NVIDIAAppleTeslaAmazonMicrosoftAlphabetSaudi Aramco

Direct Answer: NVIDIA  |  Confidence: High

Probability Distribution

CandidateProbability
NVIDIA0.98
Alphabet0.01
Apple0.01
Tesla / Amazon / Microsoft / Saudi Aramco0.00

Confidence Rationale

Driven by an insurmountable ~$700B mathematical buffer, combined with consistent market data showing low options volatility, resolved regulatory risks, and robust prediction market consensus (~98.7%).

Evidence Base

EvidenceWhy DecisiveRole
NVIDIA market cap ~$4.43T–$4.45T; ~$700B premium over Alphabet ($3.74T) and Apple ($3.71T)A $700B gap is mathematically insurmountable in 9 trading days absent a historic catalystDirect
U.S. Commerce Dept. withdrew sweeping AI chip export rule on March 13, 2026Eliminates the most significant regulatory headwind before the target dateSupporting
Options market implies ±1.98% expected move heading into March 20 Triple WitchingDerivatives markets are not pricing in a gamma crash that could erase the 15%+ leadSupporting
Qatar helium facility shutdown: supply chain risk, but TSMC has not halted productionMacro supply chain risk has not materialised into an acute factory shutdownSupporting
Polymarket assigns ~98.7% probability to NVIDIA remaining #1Reflects overwhelming market consensus consistent with mathematical realitiesSupporting

Counterfactual   Fragility: Low — A broad macro shock would impact the entire technology sector concurrently, preserving NVIDIA's relative lead. Only an idiosyncratic NVIDIA-specific Black Swan (e.g., C-suite fraud revelation, SEC trading halt) could reverse the prediction.

Monitoring Recommendations: Closing volumes and options volatility on March 20 Triple Witching · Qatar helium shortage escalation and TSMC production-halt risk · Unexpected idiosyncratic NVIDIA news · Unlikely megamerger announcements from Alphabet or Apple

Politics & Governance Prediction Date: 2026-03-18

Question: Who will win the 2026 Illinois Governor Republican primary election?

Darren BaileyRick HeidnerGregg MooreMax SolomonTed DabrowskiJames MendrickJoseph SeverinoOther

Direct Answer: Darren Bailey  |  Confidence: High

Probability Distribution

CandidateProbability
Darren Bailey0.9990
Ted Dabrowski0.0004
Rick Heidner / James Mendrick0.0001 each
Gregg Moore / Max Solomon / Joseph Severino / Other0.0001 each

Confidence Rationale

Multiple authoritative media decision desks called the race; lead is mathematically certain (137K votes vs. ~34K remaining); legal challenges barred by the 95% vote-share threshold under Illinois election law.

Evidence Base

EvidenceWhy DecisiveRole
AP, NBC, CBS, NYT declared Bailey winner on the night of March 17, 2026Consensus among major national media decision desks is the primary indicator of a resolved resultDirect
Bailey holds 298,844 votes (53.5%) vs. Dabrowski's 161,087 (28.8%); lead of 137,757 vs. ~34,000 remainingMargin is ~4× remaining uncounted ballots; reversal is mathematically impossibleDirect
Illinois law (10 ILCS 5/22-9.1) requires runner-up ≥ 95% of winner's total for discovery recount; Dabrowski at ~54%Eliminates any legal pathway for challenging the result via recountSupporting
Illinois Republican Party Chairman officially congratulated BaileyInstitutional recognition confirms Bailey as presumptive nomineeSupporting
Moore and Severino were removed from the ballot in January 2026; Solomon ran for State TreasurerConfirms multiple listed options were not viable gubernatorial candidatesSupporting

Counterfactual   Fragility: Low — Only reversal scenario: sudden death or legal disqualification of Bailey prior to official certification (~April 7, 2026). Even then, Bailey would still have won the primary; Illinois Election Code Section 7-61 directs the Republican State Central Committee to fill the vacancy.

Monitoring Recommendations: Illinois State Board of Elections certification on/around April 7, 2026 · Breaking news regarding Bailey's health or legal eligibility prior to certification

Crypto & Digital Assets Prediction Date: 2026-03-18

Question: Will Ethereum (ETH/USDT on Binance) reach a new all-time high price by 11:59 PM ET on March 31, 2026?

YesNo

Direct Answer: No  |  Confidence: High

Probability Distribution

OutcomeProbability
No0.99
Yes0.01

Confidence Rationale

Overwhelming mathematical improbability of a 112%+ gain in 13 days, combined with strong macroeconomic headwinds (Fed holding rates at 3.50%–3.75%, US-Iran conflict) and unanimous prediction market consensus.

Evidence Base

EvidenceWhy DecisiveRole
ETH/USDT ATH: $4,956.78Establishes the exact target price required for “Yes” resolutionDirect
Current price: ~$2,220–$2,340; requires ~112%–123% gain in 13 daysEstablishes the baseline against which the target is assessedDirect
USDT peg holds at $1.00; Binance order book deeply liquid ($35M at 2% depth)Mathematically rules out artificial nominal price spike via Tether collapse or flash wickDirect
Federal Reserve held rates steady at 3.50%–3.75% on March 18 amid US-Iran conflictConfirms restrictive macroeconomic environment suppressing risk-asset parabolic ralliesSupporting
Polymarket assigns <1% probability to ATH by end of MarchDemonstrates unanimous market consensus that the event is functionally impossibleSupporting

Counterfactual   Fragility: Low — The only viable reversal would be a catastrophic Tether peg collapse (>50% overnight), which would mathematically double the nominal ETH/USDT price. Given Tether's current profitability, regulatory cooperation, and strict auditing, this scenario is exceptionally unlikely.

Monitoring Recommendations: USDT peg stability for catastrophic deviations from $1.00 · Binance ETH/USDT order book and matching engine anomalies through March 31 · Emergency Federal Reserve weekend liquidity interventions

Sports & Entertainment Prediction Date: 2026-03-18

Question: Which team will finish as the #1 seed in the 2025–26 NBA Western Conference by the conclusion of the regular season on April 12, 2026?

15-team Western Conference field

Direct Answer: Oklahoma City Thunder  |  Confidence: High

Probability Distribution

TeamProbability
Oklahoma City Thunder0.899
San Antonio Spurs0.100
Los Angeles Lakers0.001
All others0.000

Confidence Rationale

Mathematical standings advantage (3.0-game lead with 13 games remaining), OKC primary stars healthy, magic number only 11, and SAS facing the league's toughest remaining schedule.

Evidence Base

EvidenceWhy DecisiveRole
OKC (54–15) leads SAS (51–18) by 3.0 games; both have 13 games remainingEstablishes OKC's commanding mathematical advantageDirect
SAS holds head-to-head tiebreaker (season series 4–1 over OKC)SAS needs only to tie OKC's final record — not pass themDirect
SAS faces the #1 toughest remaining schedule in the NBA (.560 opponent win %)Makes a flawless run required to close a 3-game deficit extremely difficultDirect
OKC's magic number to clinch the #1 seed outright: 11OKC requires only moderate success to mathematically lock the seedDirect
LA Lakers (43–25) have a max possible 57 wins; OKC currently has 54Confirms this is a two-team raceSupporting

Counterfactual   Fragility: Low — For SAS to steal the seed, a season-ending injury to Shai Gilgeous-Alexander or Chet Holmgren would need to trigger an OKC collapse while SAS runs the table against the league's hardest schedule.

Monitoring Recommendations: Outcome of NBA investigation into OKC resting 10 players (potential multi-game suspension) · Daily injury and rest status of Victor Wembanyama (ankle soreness) · San Antonio's performance during their early-April stretch against motivated playoff contenders

Esports & Gaming Prediction Date: 2026-03-18

Question: Will another official trailer for Grand Theft Auto VI (GTA VI) be released by Rockstar Games or any of its official accounts by 11:59 PM ET on March 31, 2026?

YesNo

Direct Answer: No  |  Confidence: High

Probability Distribution

OutcomeProbability
No0.96
Yes0.04

Confidence Rationale

Explicit corporate guidance directly contradicts a March timeline; absence of required technical and operational precursors; unanimous insider and market consensus against a March release.

Evidence Base

EvidenceWhy DecisiveRole
Take-Two CEO Strauss Zelnick explicitly stated GTA VI marketing begins Summer 2026 (Q3 2026 earnings call)Explicit corporate guidance directly contradicts a March release timelineDirect
As of March 18, Rockstar has not initiated its traditional 48–72 hour teaser protocol (logo changes, social media blackouts)A digital teaser protocol is a historically required precursor to any major Rockstar trailer dropDirect
ESRB rating remains “Rating Pending”; storefronts show only “Wishlist” functionality without active pre-ordersBackend prerequisites for a major trailer launch remain incompleteDirect
Rockstar posted a three-week GTA Online roadmap concluding on April 1, 2026Creates a clean schedule gap only after the prediction window closesSupporting
Tier-1 industry insiders (e.g., Tom Henderson) predict next GTA VI trailer in August 2026Credible leaks corroborate the corporate Summer 2026 marketing timelineSupporting
Polymarket traders assess probability of a March 31 trailer release at 4%Strong financial and crowd-sourced consensus against a March releaseSupporting

Counterfactual   Fragility: Low — Reversal requires Take-Two to have deliberately misled investors during their Q3 earnings call and executed an unprecedented unannounced shadow-drop on March 31. If a sudden Rockstar social media blackout or unscheduled investor call materialised in the final week of March, “Yes” probability would rapidly exceed 80%.

Monitoring Recommendations: Rockstar official social media accounts for blackout protocols or logo changes in the final week of March · Internal YouTube API watchers or backend pre-order staging leaks · Unexpected SEC 8-K filings from Take-Two Interactive prior to March 31


Report Quality Assessment

The five case studies span the complete domain taxonomy — macroeconomic structure, electoral dynamics, digital asset markets, professional sports, and the entertainment industry — and collectively demonstrate that EchoZ-1.0 reports exhibit four defining qualities:

① Evidence Granularity
Each report identifies 4–6 distinct evidential items, assigns each a role classification (direct vs. supporting), links to verifiable primary sources, and articulates precisely why the evidence is decisive for the prediction.
② Probabilistic Rigour
Probability assignments are accompanied by explicit rationales anchored in quantitative constraints (mathematical gaps, statistical baselines, market consensus) rather than intuitive assertions. Residual probability mass is allocated with transparent justification.
③ Counterfactual Completeness
Every report includes a structured fragility assessment identifying the specific scenario that would reverse the prediction, enabling users to monitor for that precise signal rather than managing generic uncertainty.
④ Actionable Monitoring
Each report concludes with concrete, time-stamped monitoring items derived from identified investigation gaps, translating prediction uncertainty into a structured surveillance agenda.

Conclusion

Echo constitutes a self-consistent prediction intelligence architecture in which the evaluation engine, training paradigm, and production API are mutually reinforcing rather than independently constructed. The multi-point aligned Elo framework resolves the timing-asymmetry and cold-start deficiencies that plague aggregated Brier Score approaches; the Train-on-Future pipeline breaks the engineering paradox of leak-free backtesting and substitutes outcome-oriented noise with behaviour-oriented signal; the Map-Reduce agent architecture translates the trained model's causal reasoning capacity into structured, evidence-grounded reports. EchoZ-1.0 consistently outperforms the human market across domains, forecast horizons, and market certainty levels, positioning it as a substantively capable forecasting agent across the full spectrum of domain-diverse prediction tasks.

Citation

If you find Echo useful in your research, please kindly cite:

@misc{unipat2026echo,
  title   = {Echo: Towards General AI Prediction},
  author  = {UniPat AI},
  year    = {2026},
  url     = {https://unipat.ai/blog/Echo}
}