Can LLMs Read the Stock Market? Inside the DATATALES Benchmark

Large Language Models (LLMs) have mastered poetry, code generation, and summarizing emails. But if you hand an LLM a spreadsheet of raw stock market data and ask, “What is the story here?”, the results are often surprisingly lackluster.

While models like GPT-4 are excellent at fluency, they struggle with data narration—the ability to transform complex, structured data into meaningful, analytical stories. In the business world, this is a critical skill. It’s not enough to say “Stock X went up”; an analyst needs to explain the trend, identify the cause, and predict the implication.

Today, we are diving into a research paper titled “DATATALES: A Benchmark for Real-World Intelligent Data Narration.” This work introduces a new, rigorous benchmark designed to test whether AI can truly act as a financial analyst. We will explore how the dataset was built, the complex reasoning it requires, and why even state-of-the-art models are currently failing to make the grade.

The Problem: Why Current Benchmarks Fall Short

Before we look at the solution, we need to understand the gap in the field. There is a sub-field of AI called “Data-to-Text Generation.” Several benchmarks already exist here, such as:

RotoWire: Generating basketball game summaries from box scores.
ToTTo: Generating sentences from Wikipedia tables.

However, these tasks are primarily descriptive. They require the model to translate a cell value (e.g., “Points: 24”) into a sentence (“Player X scored 24 points”). They rarely require deep analysis.

Financial reporting is different. It requires analytical complexity. A financial report doesn’t just read numbers; it synthesizes them to find insights. As shown in the example below, a good report involves historical references (“since November”), trend analysis (“20 basis point swing”), and causal reasoning (“investors reassessed rate-hike wagers”).

Figure 1: DATATALES example featuring a report and tabular data on 28 equity market entities, with 7 columns. Bolded text denotes the six mentioned entities. Historical references cover periods of months, day of the week, and days. Blue text describes analyses, such as trend, causal, and predictive analysis.

The researchers realized that existing datasets were too simple to test this level of reasoning. They needed a benchmark that mirrored the difficulty of real-world financial analysis. Enter DATATALES.

Introducing DATATALES

DATATALES is a dataset comprising 4.9k financial market reports paired with their corresponding tabular data. Unlike previous datasets that might focus on small tables, DATATALES pairs narratives with comprehensive financial ticker data.

How Was It Built?

Creating a high-quality dataset for data narration isn’t as simple as scraping the web. The researchers followed a meticulous three-step process:

Market Report Collection: They sourced daily market reports from financial platforms covering equities, treasury, currencies, and commodities.
Sentence Classification: Real-world reports often contain information not found in the data (like political news). To ensure the benchmark tests data reasoning, they filtered the text. They kept sentences about “Market Movements” and “Predictions” (which are grounded in data) and removed purely external context.
Data Extraction: They aligned the text with actual historical market data (Open, High, Low, Close prices, Volume, etc.) extracted from sources like Yahoo! Finance.

Figure 2: Steps in collecting DATATALES.

As illustrated in Figure 2, this pipeline results in a “Clean” dataset where the text is strictly grounded in the numbers provided in the accompanying tables. This allows for a fair evaluation: if the model hallucinates a number, it’s a failure of reasoning, not a lack of access to external news.

A Leap in Complexity

How does DATATALES compare to the “easy” benchmarks mentioned earlier? The comparison table below highlights the difference. While datasets like ToTTo usually have small input sizes and require no advanced analysis, DATATALES involves large input sizes and demands Causal Relation, Trend Analysis, and Prediction.

Table 1: Comparison of DATATALES against QuickInsight, TAT-QA, FinQA, ToTTo, RotoWire, and SciGen, presenting statistics related to the task, number of input-output pairs, domain, size of tabular data per input, average number of tokens in target text, and advanced analysis types involved.

The Core Method: Anatomy of a Financial Insight

The heart of this paper is the analysis of how insights are generated. The researchers didn’t just dump data; they categorized the types of reasoning required to write these reports. They identified seven key operations, ranging from simple to complex.

The Hierarchy of Operations

Simple Lookup: Retrieving a specific number (e.g., “The stock closed at $100”).
Basic Quantitative:

Comparison: “Stock A performed better than Stock B.”
Subtraction/Rate of Change: Calculating the difference or percentage gain/loss.

Advanced Analytical:

Trend Analysis: Identifying patterns over time (e.g., “The 7th straight session of gains”).
Causal Analysis: Linking a movement to a driver (e.g., “Tech stocks led the Nasdaq higher”).
Predictive Analysis: Forecasting future movement based on current data.

The flowchart below visualizes how a model must navigate these operations. To generate a single sentence, the model might need to perform a lookup, calculate a rate of change, and then identify a trend.

Figure 3: Example of analytical operations involved in the market report.

The Dimension of Time

One of the unique challenges in finance is that “data” isn’t just what happened today. It’s what happened yesterday, last week, and last month.

The researchers analyzed the “Time Gap” in their reports—the difference between the report date and the data being referenced. As shown in the histogram below, while most references are to the same day, a significant portion refers to data from a week, a month, or even a year prior.

Figure 4: Distribution histogram of the time gap from the date of the referencing data to the report date. The x-axis shows the time gap, while the y-axis shows the percentage of the time gap in tabular data referencing instances, and their cumulative percentage.

This implies that a capable model cannot just look at a single row of data. It must ingest a large window of historical data (e.g., the last 7 days of stock prices) to accurately narrate trends.

Experiments: Putting LLMs to the Test

The researchers tested several major models, including Llama-2-7B, Llama-2-13B, GPT-3.5-Turbo, and GPT-4. They evaluated the models in two settings:

Zero-shot: Asking the model to write a report without seeing examples.
Fine-tuned: Training the Llama models specifically on the DATATALES training set.

They also varied the input data: providing only the Same Day data versus providing 1 Week of historical data.

How Do You Grade an AI Analyst?

Evaluating text generation is notoriously difficult. The authors used three primary metrics:

Style: Using BLEU scores to check if the writing sounded professional (like the human references).
Insightfulness: Human experts rated the generated reports on “Impact” (breadth of the claim) and “Significance” (magnitude of the change).
Factuality: This is the most innovative and critical metric. Financial reports are useless if the numbers are wrong.

To test factuality, they used a “fill-in-the-blank” approach. They took a human-written report, cut it off right before a number, and asked the model to predict the next token based on the table.

Figure 5: Illustration of the factuality evaluation process. We provide the model with the prefix of human reports and assess whether its predicted numerical values align with those provided by experts, thus evaluating the accuracy of the content.

The Results: A Reality Check for AI

The results paint a sobering picture of current AI capabilities in specialized domains.

1. Factuality is Low

Even the best models struggled to get the numbers right. As seen in the breakdown below, accuracy drops significantly as the complexity of the operation increases.

Simple Lookup: Models perform decently but not perfectly.
Subtraction/Comparison: Performance degrades.
Trend/Causal Analysis: Accuracy is very low (often below 20-30%).

Interestingly, GPT-4 (represented by the solid blue and orange bars) generally outperformed the open-source Llama models, but even GPT-4 struggled with “Trend Analysis” when given only same-day data.

Figure 6: The accuracy of the operations in the sampled sentences generated under different settings. The green dotted line represent perfect reference, which reveals the gap between current model generations and proficient output. Predictive analysis is not included due to its unverifiable nature.

The Historical Data Paradox: You might assume giving the model 1 week of data would improve accuracy. Surprisingly, for operations like “Lookup,” adding historical data often decreased accuracy. This is likely a “needle in a haystack” problem—the more data the model has to scan, the harder it is to retrieve the specific number required.

2. Insightfulness vs. Accuracy

There was a tradeoff between being “right” and being “insightful.”

GPT-4 tended to be more factually accurate but less “insightful” in terms of significance. It stuck to safe, descriptive statements.
Fine-tuned Llama models hallucinated more facts but generated reports that sounded more like expert analysis, with higher “Impact” scores.

However, providing historical data did improve the Significance score. This confirms that to write a truly meaningful report, models need access to historical context, even if they currently struggle to process it accurately.

Table 5: Average impact and significance scores of analysis sentences, ordered by overall insightfulness.

As shown in Table 5, sentences involving Causal Analysis and Predictive Analysis were rated as having the highest impact and significance. These are exactly the types of insights human analysts value most, and exactly the areas where current LLMs struggle the most.

3. Style Matters

Finally, looking at the writing style (Table 6), fine-tuning made a massive difference. The fine-tuned Llama models achieved much higher BLEU scores and used more correct domain-specific verbs and entities compared to the zero-shot GPT models. This suggests that while GPT-4 is smart, it doesn’t naturally speak “Finance” without specific instruction.

Table 6: The BLEU scores, and cosine similarities of verb and entity contained comparing model generated reports with the human created ones. The results here are for generations with 1 week historical data setting. We omit the result for same day setting as they show similar pattern.

Conclusion & The Future of Data Narration

The DATATALES benchmark reveals that we are not yet at the point where we can blindly trust an LLM to act as a financial analyst. The task requires a combination of retrieval (finding the right number), arithmetic (calculating change), and logic (inferring cause and effect)—a “triple threat” that challenges even the most advanced models.

The authors formally define this challenge as a mapping function:

\[ y = M ( T _ { i , j } | i \leq E _ { T } , j \leq D _ { T } ) \]

Where the model $M$ must generate a narrative $y$ based on a matrix of market data $T$ spanning multiple entities $E$ and days $D$.

Key Takeaways:

Complexity: Real-world data narration requires complex reasoning (trends, causality), not just data description.
The Accuracy Gap: Models hallucinate numbers, especially when performing math or analyzing trends over time.
Context is King: To generate significant insights, models need historical data, but handling large context windows without losing precision remains an open challenge.

The researchers suggest promising paths forward, such as using intermediate “insight recommendation” systems (calculating the insights before generating the text) or integrating visual data. For now, DATATALES stands as a rigorous testing ground for the next generation of intelligent data agents.

The Problem: Why Current Benchmarks Fall Short#

Introducing DATATALES#

How Was It Built?#

A Leap in Complexity#

The Core Method: Anatomy of a Financial Insight#

The Hierarchy of Operations#

The Dimension of Time#

Experiments: Putting LLMs to the Test#

How Do You Grade an AI Analyst?#

The Results: A Reality Check for AI#

1. Factuality is Low#

2. Insightfulness vs. Accuracy#

3. Style Matters#

Conclusion & The Future of Data Narration#