Introduction

We have all seen the magic of a great data storyteller. Think of Hans Rosling explaining global population growth with bubbling charts, or a detailed investigative piece in The New York Times where the text perfectly weaves through interactive visualizations. These narratives don’t just dump numbers on a page; they contextualize data, highlighting trends and causal relationships to deliver a clear message.

However, crafting these stories is incredibly difficult. It requires a rare intersection of skills: data analysis, graphic design, and narrative writing. For business analysts, journalists, and educators, the process of identifying insights (“story pieces”), designing the right charts, and writing the accompanying text is a time-consuming, mentally taxing bottleneck.

With the rise of Large Language Models (LLMs), the obvious question arises: Can we automate this? Can we hand an AI a spreadsheet and ask it to write a compelling, visually supported article?

While LLMs are great at writing text, they often struggle with the “multimodal” nature of data stories—coordinating accurate numbers, insightful text, and precise visualization specifications simultaneously. To solve this, researchers have introduced DATANARRATIVE, a new benchmark and a novel “Agentic Framework” designed to mimic the human editorial process.

An example data story from the GapMinder corpus showing global birth rate trends. Panel 1 shows historical data, Panel 2 focuses on the drop since 1965, Panel 3 highlights the steep decline, and Panel 4 predicts future trends.

As shown in Figure 1 above, a data story isn’t just a caption; it is a sequence of panels where the visualization and the text evolve together to make a point. In this post, we will deconstruct how the DATANARRATIVE paper tackles this challenge, moving from a simple prompt to a complex, multi-agent AI system.

The Background: Why is this Hard?

To understand the solution, we must first understand the limitations of current technology.

The Data Storytelling Gap

Data-driven storytelling combines visualizations (communicating patterns and outliers) with text (explaining context). Early attempts to automate this relied on rule-based systems. Tools like “DataShot” or “Calliope” could generate simple fact sheets, but they lacked the narrative flow that makes a story engaging. They were rigid and often missed the “big picture.”

The LLM Limitation

Modern LLMs (like GPT-4) are excellent at generating fluent text. However, when you ask an LLM to look at a complex data table and generate a full story with charts:

Hallucination: It might invent numbers that aren’t in the table.
Lack of Planning: It often dives into writing without structuring the narrative, leading to rambling text.
Visual Disconnect: The text might describe a trend that the requested chart doesn’t actually show.

Furthermore, research in this specific niche has been stalled by a lack of high-quality training data. There hasn’t been a standard “benchmark” dataset of high-quality data stories for researchers to test their models against.

The Benchmark: Building DATANARRATIVE

Before building a model, the researchers needed data. They constructed the DATANARRATIVE corpus, a collection of 1,449 data stories sourced from three high-quality platforms:

Pew Research Center: Known for deep, report-style journalism on social issues.
Tableau Public: A hub for business intelligence and community visualizations.
GapMinder: Educational data stories on global development.

This is not a simple collection of captions. As seen in the table below, these stories are semantically rich and diverse.

Table 3 showing statistics of the DataNarrative dataset. Pew stories are significantly longer with more paragraphs and charts compared to Tableau and GapMinder.

The dataset covers a wide range of topics. While Pew leans heavily into Politics & Policy, the Tableau and GapMinder subsets introduce variety in Economy, Education, and Health.

Two pie charts showing the distribution of story topics. Pew is dominated by Politics & Policy (52.8%), while Tableau has a balanced mix of Education, Economy, and Business.

Reverse-Engineering the Data

One interesting engineering challenge the researchers faced was that the raw data tables for these stories weren’t always available—often, they only had the images of the charts. To build a training set where the AI learns to go from Table \(\rightarrow\) Story, they had to reverse-engineer the source material.

They utilized a Vision-Language Model (Gemini-1.0-pro-vision) to “read” the chart images and extract the underlying data tables. This allowed them to create a training pair: the input (data table) and the output (the human-written story).

Overview of the chart data extraction process. A chart image about news interest is fed into Gemini-1.0-pro-vision, which outputs the corresponding raw data table.

The Core Method: The Agentic Framework

This is the most significant contribution of the paper. The researchers found that simply asking an LLM (Direct Prompting) to “write a data story” produced mediocre results. The models would often get facts wrong or lose narrative focus.

To fix this, they proposed an LLM-Agent Framework. Inspired by how human writers work, they split the job into distinct stages. They also introduced two distinct roles for the AI:

The Generator (Actor): Creative, writes the content.
The Evaluator (Critic): Analytical, checks for errors and logic.

The framework operates in a feedback loop across two main stages: Planning and Narration.

Diagram of the LLM-Agent framework. It shows the flow from Planning (Reflection, Outline) to Narration. Each step involves a Generator producing content and an Evaluator verifying it, creating a revision loop.

Let’s break down the workflow illustrated in Figure 2 above.

Stage 1: The Planning Stage

You wouldn’t write a research paper without an outline; an AI shouldn’t write a data story without one either.

Step A: Reflection (Understanding the Data)

Before writing a single sentence of the story, the Generator Agent is asked to generate a “Reflection.” This is a bulleted list of insights, trends, and outliers found in the data table.

The Critic’s Role: The Evaluator Agent looks at this reflection and compares it strictly against the data table. If the Generator claims “sales doubled,” but the table shows a 10% increase, the Evaluator flags it. The Generator must then revise the reflection.

Step B: Outline Generation

Once the data is understood, the Generator creates a narrative outline. It decides the “Beginning” (Intro), “Middle” (Analysis), and “End” (Conclusion). It also determines where charts should appear to support the text.

The Critic’s Role: The Evaluator checks if the outline follows the user’s intent and if the flow is logical. It ensures the story has a coherent structure before the heavy lifting of writing begins.

Stage 2: The Narration Stage

Now that the AI has a verified plan, it moves to execution.

Step C: Narration

The Generator writes the full text of the story, paragraph by paragraph. Crucially, it also generates Visualization Specifications (code that defines what the chart looks like, e.g., JSON parameters for a bar chart).

The Critic’s Role: The Evaluator performs a final sweep. It checks:

Factual Consistency: Does the text match the numbers in the table?
Chart Accuracy: Do the visualization specs actually visualize the correct data?

If errors are found, the Critic issues a “Revision Plan,” and the Generator rewrites the specific sections. This iterative “Write \(\rightarrow\) Critique \(\rightarrow\) Revise” loop significantly reduces hallucinations.

Experiments and Results

Does this complex agent setup actually work better than just asking GPT-4 to write the story in one go? The researchers conducted extensive experiments to find out.

They compared the Agentic Framework against a Direct Prompting baseline across three different models: GPT-4o, LLaMA-3-70b, and LLaMA-3-8b.

Automatic Evaluation

Using an LLM-as-a-judge (Gemini-1.5-pro), they performed pairwise comparisons. The judge looked at two stories (one from the Agent, one from Direct prompting) and decided which was better based on informativeness, coherence, and accuracy.

Table showing automatic evaluation results. GPT-4o with the Agentic framework wins 78.17% of the time against the direct method. LLaMA-3-70b also shows a strong preference for the agentic approach.

As shown in Table 4, the difference is stark. For GPT-4o, the Agentic framework won nearly 78% of the time. This confirms that even the most powerful models benefit significantly from the “Plan-and-Critique” structure.

Human Evaluation

Metrics are useful, but human judgment is the gold standard for storytelling. The authors recruited human evaluators to rate the stories on attributes like “Clarity,” “Visualization Quality,” and “Factual Correctness.”

Table 5 compares human evaluation results. The Agentic framework wins significantly across all metrics, with p-values indicating strong statistical significance.

The human evaluation results (Table 5) mirror the automated findings. The Agentic framework achieved a 74-75% win rate on Informativeness and Narrative Quality. The gap in “Visualization Quality” (59% win rate) was smaller but still favored the agents.

What does a generated story look like?

Below is an example of a success story generated by GPT-4o using the framework. The model successfully integrated multiple charts (Line charts for trends, Bar charts for comparisons) and wrote coherent text explaining the political divides shown in the data.

An example data story generated by GPT-4o. It features sections on ‘Voter Enthusiasm’, ‘Party Perceptions’, and ‘Concerns Over Democratic Oversight’, each with charts and descriptive text.

Challenges and Limitations

While the Agentic framework is a major step forward, the paper provides an honest look at where AI still stumbles. Data storytelling is unforgiving; a single wrong number can ruin credibility.

Hallucinations and Factual Errors

Even with a Critic agent, errors slip through. In the example below, the model generated a story about voter enthusiasm. However, the text claimed a value of 42%, while the underlying table showed 59%. It also hallucinated a peak date that didn’t match the data.

A GPT-4o story with annotated errors. Blue text indicates hallucinated facts, and a red circle highlights a factual error where the text says 42% but the table indicates 59%.

The “Small Model” Problem

The framework works wonders with giant models like GPT-4o, but smaller open-source models (like LLaMA-3-8b) struggle. They tend to lose the plot. In the example below, the LLaMA model creates a “Coherence” issue. Panel 3 discusses statistics unrelated to the header, and Panel 4 simply repeats the text from Panel 3 verbatim.

An example of coherence issues in LLaMA-3-8b. Section 4 repeats the exact text from Section 3, and the statistics mentioned in Section 3 are disconnected from the visual.

This suggests that while the architecture (Agents) is sound, the engine (the LLM) needs a minimum level of reasoning capability to handle the complexity of planning and self-correction.

Conclusion

The DATANARRATIVE paper bridges a significant gap in automated content creation. By treating data storytelling not as a simple generation task, but as a multi-step planning and editing process, the authors demonstrated that AI can produce high-quality, multimodal narratives.

The key takeaways are:

Planning is Essential: LLMs cannot simply “wing it” with complex data. They need a reflection and outlining phase.
Critics are Necessary: A dedicated “Evaluator” agent acts as a safety net, catching factual errors that a single generator would miss.
Multimodal is Hard: Coordinating text and visualization specifications is far more difficult than generating text alone, but agentic workflows make it viable.

This research opens the door to powerful human-in-the-loop tools. Imagine a “Co-pilot for Analysts” where the AI drafts the initial reflection and outline, and the human expert refines the narrative. As models become more capable and less prone to hallucination, the days of staring at a blank spreadsheet wondering how to tell its story might soon be over.

Introduction#

The Background: Why is this Hard?#

The Data Storytelling Gap#

The LLM Limitation#

The Benchmark: Building DATANARRATIVE#

Reverse-Engineering the Data#

The Core Method: The Agentic Framework#

Stage 1: The Planning Stage#

Step A: Reflection (Understanding the Data)#

Step B: Outline Generation#

Stage 2: The Narration Stage#

Step C: Narration#

Experiments and Results#

Automatic Evaluation#

Human Evaluation#

What does a generated story look like?#

Challenges and Limitations#

Hallucinations and Factual Errors#

The “Small Model” Problem#

Conclusion#