Large language models (LLMs) have evolved far beyond clever chatbots — they’re now powerful autonomous agents capable of reasoning, planning, and executing complex tasks. They can write code, assist in scientific discovery, and automate entire workflows in marketing or engineering.

This rapid progress begs an exciting question: can these AI agents conquer one of the most challenging, high-stakes arenas in the world — the stock market?

The potential is enormous. An AI capable of analyzing market data, interpreting news, and making profitable trades could transform finance. But testing whether an LLM has what it takes is not straightforward.

Most existing financial benchmarks for LLMs are like written exams — they measure “book smarts” through question-answering tasks (e.g., “What was Apple’s revenue last quarter?”). While these are useful, they’re nothing like the chaotic, fast-moving reality of live trading. Acing a finance quiz doesn’t mean you can make money when prices swing wildly.

To address that gap, the research paper STOCKBENCH introduces a benchmark that simulates real-world trading for LLM agents. Instead of answering trivia, AIs are given a portfolio and must make sequential buy, sell, or hold decisions every day for several months, reacting dynamically to market prices, company fundamentals, and breaking news.

This article explores how STOCKBENCH was built, how the researchers turned generic LLMs into trading agents, and what happened when today’s top models entered this virtual trading floor. The results reveal both the promise and the limitations of AI in finance.


Why We Need a Realistic Benchmark for Financial AI

Before STOCKBENCH, evaluating AI trading ability was difficult and prone to bias. The authors argue that a worthwhile benchmark must satisfy three principles:

  1. Realistic Market Interaction — Agents should operate in a dynamic environment with constant change, responding to prices, fundamentals, and real-time news.
  2. Continuous Decision Making — Trading is iterative and long-term, not a one-off prediction. Agents must make repeated decisions over an extended horizon.
  3. Data Contamination Free — Models must not have trained on the test data. If benchmarks use older historical data, it’s possible that the model has “seen” the answers during training, making evaluation unfair.

Existing benchmarks often fail on one or more of these points.

A table comparing STOCKBENCH to nine other financial benchmarks. STOCKBENCH is the only one that checks all five desirable criteria: Market Simulation, Multi-Month Horizon, Continuous Decision, Contamination Free, and Direct Economic Value.

Table: STOCKBENCH meets all five criteria for a realistic, fair, and meaningful assessment of financial AI agents.


Inside STOCKBENCH: Building a Virtual Trading Floor

STOCKBENCH has two core components:

  1. Back-Trading Environment — a realistic simulation built from historical market data, fundamentals, and news.
  2. Stock-Trading Agent Workflow — a defined routine that the LLM follows each trading day.

An overview diagram of STOCKBENCH. On the left, the “Back-Trading Environment” shows the data inputs: investment targets, price & fundamental data, and news. On the right, the “Stock Trading Agent Workflow” outlines the 4-step process the agent follows: Portfolio Overview, In-depth Analysis, Decision Generation, and Execution.

Figure 1: STOCKBENCH framework: environment inputs on the left, agent workflow on the right.


The Back-Trading Environment: What the Agent Sees

Three pillars define the environment:

  1. Investment Targets — Agents trade only the top 20 highest-weighted stocks in the Dow Jones Industrial Average (DJIA). These large, stable companies reduce the impact of luck and ensure diverse industry coverage.

A donut chart showing the industry distribution of the 20 selected stocks, including Technology, Finance, Consumer & Retail, Industry & Manufacturing, and Medical Care, with company tickers listed for each sector.

Figure 2: Industry distribution of STOCKBENCH’s selected DJIA constituents.

  1. Historical Market Data — Each day, agents see the opening price, market cap, P/E ratio, dividend yield, and 52-week range for each stock.

  2. News Corpora — For each stock, agents get the five most relevant news headlines and summaries from the prior 48 hours. This lets them react to sentiment and events.

All data is from March 3 to June 30, 2025, chosen to be after the knowledge cutoffs of all tested LLMs, ensuring no data contamination.


The Stock-Trading Agent Workflow: The Daily Routine

The workflow repeats each market day:

  • Step 1: Portfolio Overview — Morning scan: see prices, holdings, and recent news.
  • Step 2: In-Depth Stock Analysis — Select stocks for deeper review; receive detailed fundamentals.
  • Step 3: Decision Generation — Choose to increase, decrease, or hold each stock position, specifying target dollar allocations.
  • Step 4: Execution & Validation — Trades are computed into share quantities and checked against available cash; invalid trades are revised before execution.

This process transforms an LLM from a text model into a trading agent making daily, risk-conscious decisions.


The Grand Experiment: Testing LLMs in STOCKBENCH

The Setup

  • Models: Proprietary giants like GPT-5 and Claude-4-Sonnet, plus leading open-weight models such as Qwen3, Kimi-K2, DeepSeek, and GLM-4.5.
  • Starting Point: Each agent begins with $100,000 and no holdings.
  • Duration: 82 trading days over four months.
  • Baseline: An equal-weight, buy-and-hold portfolio — the classic passive strategy.

Scoring the Agents

Three metrics measure success:

  1. Final Return — Total % change in portfolio value: \[ \mathrm{Final~Return} = \frac{V_T - V_0}{V_0} \]
  2. Maximum Drawdown — Worst % drop from a peak: \[ \mathrm{Max~Drawdown} = \min_{t \in [0,T]} \left( \frac{V_t - \max_{s\leq t} V_s}{\max_{s\leq t} V_s} \right) \]
  3. Sortino Ratio — Risk-adjusted return penalizing only downside volatility: \[ \mathrm{Sortino~Ratio} = \frac{R_p}{\sigma_d}, \quad \sigma_d = \sqrt{\frac{1}{N_d} \sum_{i=1}^{N_d} \min(R_i,0)^2} \]

These are combined into a Composite Rank:

\[ \mathrm{Composite~Rank} = \frac{z(\mathrm{Final~Return}) - z(\mathrm{Max~Drawdown}) + z(\mathrm{Sortino~Ratio})}{3} \]

Results

Table 2 showing the performance of 14 models on STOCKBENCH. Kimi-K2 ranks first overall, with a 1.9% return and -11.8% max drawdown. Several LLMs outperform the Passive Baseline, which had a 0.4% return and -15.2% max drawdown.

Table 2: Performance comparison of LLM agents vs. the passive baseline. Lower (less negative) drawdown and higher Sortino indicate better risk management.

Key findings:

  1. AI Can Be Profitable — Most agents beat the baseline’s 0.4% return. Kimi-K2 reached 1.9%, with Qwen3 variants up to 2.5%.
  2. Better Risk Management — All agents reduced drawdowns compared to -15.2% for the baseline; top models stayed near -11%.
  3. Reasoning ≠ Trading Excellence — Complex reasoning models didn’t always outperform instruction-tuned peers in trading, showing that market decision-making requires more than raw reasoning skill.

Deeper Analysis: What Impacts Performance?

Scaling the Portfolio

Table 3 showing that as the number of stocks increases from 5 to 30, the performance of both Kimi-K2 and GPT-OSS-120B degrades, with returns falling and volatility rising. Kimi-K2 shows more robustness than the smaller model.

Table 3: Portfolio size vs. performance. Larger portfolios strain agent capacity.

Performance dropped as the number of tradable stocks increased — especially for smaller models like GPT-OSS-120B. Larger models like Kimi-K2 remained more resilient at moderate sizes (10–20 stocks).


Common Agent Errors

The workflow requires math and strict JSON output.

  • Arithmetic Errors — Miscalculating share quantities.
  • Schema Errors — Violating the JSON output format.

A bar chart comparing error rates for ‘Think’ vs ‘Instruct’ models. ‘Think’ models have fewer arithmetic errors but more schema errors, while ‘Instruct’ models show the opposite pattern.

Figure 3: Error rates: Thinking models excel at math accuracy but often format incorrectly; Instruction models do the reverse.


Do Agents Use the Data?

An ablation study removed the news data, then both news and fundamentals.

Table 4 showing an ablation study. For both Kimi-K2 and GPT-OSS-120B, removing news and then fundamental data causes the cumulative return to consistently decrease, proving the value of these information sources.

Table 4: Removing key data sources consistently reduces returns.

Results confirm agents rely on both modalities — textual news and numerical fundamentals — to guide trades.


Market Regimes: Bull vs. Bear

A chart showing how model rankings based on cumulative return change between a downturn period and an upturn period. Some models, like GPT-OSS-120B, perform much better in the upturn, while Kimi-K2 is more stable across both.

Figure 4: Agent performance rankings shift between bear (downturn) and bull (upturn) markets.

Agents struggled in a downturn: none beat the baseline. In an upturn, most surpassed it. Kimi-K2 stayed relatively stable across both scenarios, while others excelled only in bullish conditions.


Conclusion: The Road Ahead for AI Traders

STOCKBENCH delivers the most realistic evaluation so far of LLMs as stock-trading agents. The takeaways are:

  • Promise: LLM agents can integrate diverse market signals, make profitable trades, and manage risk better than passive strategies.
  • Limitations: Returns are modest, models struggle with larger portfolios, are sensitive to market regimes, and aren’t yet consistently superior to baselines — especially in bearish markets.

By open-sourcing STOCKBENCH, the researchers provide a valuable foundation for advancing AI trading agents. The road to a true AI Wall Street contender remains long, but benchmarks like STOCKBENCH offer the tools to navigate it.