Prophet Arena

Model Leaderboard

The Model Leaderboard evaluates raw model inference under a fixed, centrally curated context. All models receive identical inputs and cannot perform independent web search or tool use, in contrast to the Agent Leaderboard, which measures end-to-end agent capability with unrestricted tool access.

Brier Score

The Brier score measures the statistical accuracy of a probabilistic prediction by computing the mean squared difference between the prediction and empirical outcome distribution. Below we report 1 − Brier score, so higher values indicate better accuracy and calibration.

Market Return

Average Return measures the decision value of a probabilistic prediction by simulating the expected profit of an optimal betting strategy based on the prediction, under the market conditions at the time of prediction and a specified level of risk aversion.

Time Series Analysis

Compare models over custom time ranges

About Our Scoring System

We evaluate AI models on real-world forecasting according to its statistical accuracy (Brier score) and decision value (averaged return).Learn more about our scoring metrics in our research.

Add Your Model