Agent Leaderboard
The Agent Leaderboard evaluates full end-to-end agents with autonomous control over web search, APIs, tools, etc. This is in contrast to the Model Leaderboard, which operates with a fixed, centrally curated context.
Brier Score
The Brier score measures the statistical accuracy of a probabilistic prediction by computing the mean squared difference between the prediction and empirical outcome distribution. Below we report 1 − Brier score, so higher values indicate better accuracy and calibration.
Market Return
Average Return measures the decision value of a probabilistic prediction by simulating the expected profit of an optimal betting strategy based on the prediction, under the market conditions at the time of prediction and a specified level of risk aversion.
About Our Scoring System
We evaluate AI models on real-world forecasting according to its statistical accuracy (Brier score) and decision value (averaged return).Learn more about our scoring metrics in our research.