Can LLMs Read Charts? Benchmarking Time Series Understanding in Large Language Models

The capabilities of Large Language Models (LLMs) like GPT-4 and Llama 2 have exploded in recent years. We know they can write poetry, debug code, and summarize history. But can they look at a string of numbers representing a stock price or a patient’s heart rate and “understand” what is happening?

Time series analysis—the study of data points collected over time—is critical for finance, healthcare, climate science, and energy. Traditionally, this domain belongs to statistical models (like ARIMA) or specialized deep learning architectures. However, researchers from J.P. Morgan AI Research recently asked a compelling question: Can general-purpose LLMs analyze time series data without specific fine-tuning?

In their paper, Evaluating Large Language Models on Time Series Feature Understanding, the authors propose a rigorous framework to test just that. This post dives into their methodology, their new taxonomy of time series features, and the surprising (and sometimes disappointing) results of how modern AI handles numerical sequences.

The Core Problem: Text vs. Temporal Data

LLMs are trained on vast corpora of text. While they encounter numbers, they process them as tokens (text fragments) rather than mathematical entities. When a data scientist looks at a time series, they look for specific characteristics:

Is it going up or down? (Trend)
Does it repeat every week? (Seasonality)
Did something weird happen on Tuesday? (Anomalies)
Is the data becoming more erratic? (Volatility)

For an LLM to be useful in automated reporting or analysis, it needs to recognize these features purely from a textual representation of the data. The researchers aimed to identify which of these features LLMs can natively comprehend and where they fail.

Step 1: A Taxonomy of Time Series

To evaluate “understanding,” one must first define it. The authors developed a comprehensive taxonomy of time series features. This isn’t just a list; it is a hierarchy of complexity ranging from simple visual patterns to complex statistical properties.

Table 1: Taxonomy of time series characteristics.

As shown in Table 1 above, the taxonomy is split into Univariate (single variable) and Multivariate (multiple variables) categories:

Trend & Seasonality: The most basic features. Is the data moving in a direction? Is there a cycle?
Anomalies: Spikes, level shifts (where the average value suddenly changes), or missing data.
Volatility: This is harder. It refers to the variance. “Clustered volatility” is common in finance, where markets are calm for a while, then chaotic, then calm again.
Structural Breaks & Stationarity: These are advanced statistical concepts. A structural break might occur if the underlying mechanism generating the data changes (e.g., a policy change in economics). Stationarity checks if the statistical properties (mean, variance) remain constant over time.
Multivariate Features: How do two lines relate? Do they move together (Correlation)? Does one lead the other (Lead-lag)?

Step 2: Generating the Benchmark Data

One challenge in evaluating time series models is that real-world data is messy and often lacks “ground truth.” If you look at a stock chart, experts might disagree on exactly when a trend started or ended.

To solve this, the authors created a synthetic dataset. By generating the data mathematically, they knew exactly what features were present. If they generated a sine wave with a spike at index 50, they knew the ground truth was “Seasonality + Anomaly.”

Figure 1: Example synthetically generated time series.

Figure 1 illustrates the diversity of this synthetic data. You can see simple trends with noise (top-left), complex regime changes where the behavior shifts halfway through (top-right), and non-stationary data that wanders randomly (bottom-left).

The researchers didn’t just generate numbers; they generated textual descriptions for them. This allows them to test if the LLM can match a chart to its description.

For a closer look at the specific shapes the LLMs were tested on, look at the univariate examples below. Notice the “Step spike” (a permanent jump) versus a “Sudden spike” (a blip).

Table 6: Examples of the generated univariate time series.

The Experiment: Testing the Models

The researchers tested five prominent LLMs: GPT-4, GPT-3.5, Llama2-13B, Vicuna-13B, and Phi-3.

They evaluated the models on four distinct tasks:

Feature Detection: A binary “Yes/No” question. (e.g., “Is there a trend in this time series?”)
Feature Classification: Multiple choice. (e.g., “Is the trend positive, negative, or quadratic?”)
Information Retrieval: Finding specific data points. (e.g., “What is the value on 2024-01-01?”)
Arithmetic Reasoning: Computing values. (e.g., “What is the minimum value in the series?”)

They used different prompting strategies, including Zero-Shot (just asking) and Chain-of-Thought (CoT) (asking the model to “think step by step”).

Key Results: Who is the Time Series Champion?

The results highlight a significant divide between proprietary “frontier” models (GPT-4) and smaller open-source models.

1. General Performance Overview

The radar charts below summarize the performance. The red line (GPT-4) consistently encompasses the others, indicating superior performance across almost all metrics.

Figure 2: Feature detection and arithmetic reasoning scores.

Key Takeaways from the Radar Charts:

Feature Detection (Left Chart): Almost all models represent a “Trend” well (high scores on the Trend axis). However, look at “Stationarity” and “Structural Break.” The scores collapse toward the center. This indicates that while LLMs can see “up” or “down,” they struggle with complex statistical concepts.
Arithmetic (Right Chart): The disparity is stark. GPT-4 and GPT-3.5 are near perfect at finding Min/Max values and dates. The smaller models (Vicuna, Llama2) struggle significantly with these retrieval tasks.

2. The Limits of “Understanding”

While the radar chart looks good for GPT-4, the detailed breakdown reveals limitations.

Trends: GPT-4 achieved an F1 score of 0.89 in trend detection using Chain-of-Thought prompting.
Seasonality: It scored a massive 0.98.
The Hard Stuff: For “Stationarity” (is the mean constant?), GPT-4’s zero-shot score was 0.33—essentially random guessing or refusing to answer.

Interestingly, for complex statistical questions like stationarity, GPT-4 often hallucinated or simply stated it couldn’t perform the statistical test required, which is technically true (it’s a language model, not a statistical software package).

3. Arithmetic and Retrieval

One might assume that finding the maximum value in a list of numbers is easy for a computer. But for an LLM, which predicts the next token based on probability, numbers are tricky.

Table 2: Performances across all reasoning tasks.

Table 2 shows that GPT-4 is nearly perfect (1.00 Accuracy) at retrieving values and finding minimums. However, smaller models like Llama2-13B drop to 0.54 accuracy on retrieving a value on a specific date. This suggests that precise numerical reasoning is an “emergent property”—it only appears reliably in the most capable models.

Deep Dive: What Factors Break the Models?

The researchers went beyond simple accuracy scores to investigate why models fail. They identified three critical factors: formatting, length, and position bias.

Factor 1: Data Formatting

How do you feed a time series into ChatGPT? Do you use a CSV format? Do you put spaces between digits?

The researchers tested roughly 9 different textual formats. The results were counter-intuitive.

Table 4: Performance measured by accuracy for different time series formats.

As shown in Table 4:

Plain text (e.g., Date: 2020-01-01, Value: 100) often outperformed structured formats like CSV or JSON for retrieval tasks.
“Spaces” (inserting spaces between digits, proposed in previous literature to help tokenization) actually destroyed performance for models like Llama2 and Vicuna (Accuracy dropped to ~0.05).
Symbolic: Adding arrows (\(\uparrow\), \(\downarrow\)) to indicate direction helped significantly with Trend Classification, acting as a “hint” to the model.

Factor 2: Time Series Length

LLMs have a context window (a limit on how much text they can process). But even within that window, performance degrades as complexity increases.

Figure 3: Retrieval performance for different time series lengths.

Figure 3 shows that as the number of data points increases (x-axis), the accuracy (y-axis) trends downward.

GPT-3.5 (Blue) and Phi3 (Orange) are relatively robust. Their lines stay high.
Llama2 (Green) and Vicuna (Red) crash hard. Once the time series exceeds roughly 60 data points, Llama2’s ability to retrieve information plummets. This suggests smaller models lose “focus” over long sequences of numbers.

Factor 3: Position Bias

Does it matter where the answer is? If the maximum value is at the very end of the list, is the model more likely to find it?

The study found evidence of Position Bias (or Recency Bias). In smaller models particularly, performance varied depending on which “quadrant” of the time series contained the target information. The models were often better at identifying features or values that appeared later in the context window (more recently generated/read text). GPT-4, notably, was largely immune to this, maintaining consistent attention across the whole sequence.

Conclusion and Future Implications

This research papers serves as a reality check for the “AI for everything” hype train.

The Good: State-of-the-art LLMs (GPT-4) are excellent zero-shot analysts for basic tasks. They can reliably detect trends, seasonality, and perform arithmetic retrieval on short-to-medium length time series. They can generate accurate text descriptions of these charts.

The Bad: They struggle with the “quant” side of things. Concepts like stationarity, volatility clustering, and structural breaks—critical for financial risk modeling—are poorly understood by current LLMs. They simply lack the statistical intuition (or the ability to compute internal statistics) required to diagnose these features accurately via text.

The Implications for Students: If you are building an application that summarizes data:

Use the biggest model available. Numerical reasoning degrades sharply with model size.
Pre-process your data. Don’t just dump raw numbers. formatting matters. Using “Plain” formats or enriching the data with symbolic hints (arrows, diffs) helps the model significantly.
Don’t trust the AI for stats. Use Python/R for the heavy statistical lifting (calculating volatility, checking stationarity) and use the LLM to interpret those results, rather than asking the LLM to calculate them from raw data.

The future likely lies in multimodal models that can “see” the plot image rather than reading raw numbers, or systems that give LLMs access to code interpreters (like Python sandboxes) to perform the math before generating an opinion.

The Core Problem: Text vs. Temporal Data#

Step 1: A Taxonomy of Time Series#

Step 2: Generating the Benchmark Data#

The Experiment: Testing the Models#

Key Results: Who is the Time Series Champion?#

1. General Performance Overview#

2. The Limits of “Understanding”#

3. Arithmetic and Retrieval#

Deep Dive: What Factors Break the Models?#

Factor 1: Data Formatting#

Factor 2: Time Series Length#

Factor 3: Position Bias#

Conclusion and Future Implications#