Large Language Models (LLMs) have mastered the art of text. They can write poetry, summarize legal documents, and even debug code. However, when you ask an LLM to visualize complex, real-world data—specifically, to write code that generates sophisticated plots—the results often fall short.
While models like GPT-4 are competent at creating basic bar charts or line graphs, they frequently struggle with more intricate visualizations like 3D surface plots, volumetric data, or vector fields. Furthermore, existing datasets for training these models are often limited to Question Answering (QA) tasks rather than generation, or they lack the diversity of plot types needed for scientific and professional applications.
In this post, we will dive into Text2Chart31, a research paper that tackles these limitations head-on. We will explore how the researchers created a massive dataset of 31 unique plot types and developed a novel Reinforcement Learning (RL) technique that fine-tunes models without requiring expensive human feedback.

As shown in Figure 1, the paper moves beyond simple pairwise plots (like bar charts) to tackle complex “Gridded” and “3D & Volumetric” visualizations. It introduces a pipeline that aligns natural language descriptions, code, and visual outputs through a clever feedback loop.
The Problem with Current Chart Generation
To understand the contribution of this paper, we first need to look at the gaps in the current landscape.
- Limited Datasets: Most existing datasets (like PlotQA or ChartQA) are designed for visual question answering (e.g., “What is the highest value in this bar chart?”). They aren’t optimized for generating charts from text. Furthermore, they are heavily biased toward common chart types. If you need a model to generate a “3D Triangular Surface Plot,” standard datasets won’t help.
- Supervised Fine-Tuning (SFT) Limits: Standard fine-tuning typically involves showing a model a description and the correct code, then telling it to minimize the difference. However, chart generation involves multiple components: the text description, the code, the data table, and the resulting visual. Standard SFT often fails to capture the intricate relationships between these modalities.
Contribution 1: The Text2Chart31 Dataset
The researchers introduced Text2Chart31, a dataset designed specifically to address the scarcity of complex plot types.
Unlike previous collections, Text2Chart31 focuses on the Matplotlib library, a standard tool in the Python data science stack. The dataset contains 11.1K data tuples. Each tuple isn’t just a question and answer; it is a comprehensive package containing:
- Description (\(x\)): A natural language explanation of what the chart shows.
- Code (\(c\)): The Python code to generate the chart.
- Data Table (\(d\)): The raw data used.
- Reasoning Steps (\(r\)): The logical steps taken to choose the plot type.
- Plot (\(y\)): The final visual output.
A Library of 31 Plot Types
The “31” in the name refers to the distinct plot types covered. The diversity here is significant. As seen in the image below, the dataset spans five major categories, ranging from standard Pairwise Charts (Bar, Line) to complex 3D & Volumetric Charts (3D Surface, Voxels) and Irregularly Gridded Charts.

To ensure this dataset didn’t just regurgitate the same topics (e.g., “Sales over Time”), the authors used a topic generation engine to ensure semantic diversity. The distribution of keywords in the dataset covers a wide array of subjects, from environmental science to economics.

Contribution 2: The Hierarchical Generation Pipeline
Collecting high-quality data for 3D and volumetric charts is difficult because these charts rarely appear in scraped web data in a structured format (text + code + data). To solve this, the authors built a Hierarchical Pipeline to synthesize the dataset using GPT-3.5-turbo and GPT-4.
This was not a simple “ask GPT-4 to write code” process. It involved a rigorous, multi-step workflow designed to minimize hallucinations and ensure code execution works.

Here is the breakdown of the pipeline shown in Figure 2:
- Topic Generation: A topic is selected from a diverse pool to avoid repetition.
- Description Generation: GPT-3.5 creates a plot description based on seed examples.
- Self-Evaluation: GPT-4 acts as a critic, checking if the description is logical and compatible with the requested plot type.
- Code & Data Generation: GPT-4 generates the Python code and the corresponding data table.
- Cycle Consistency Verification: This is the most critical quality control step.
The Power of Cycle Consistency
How do you verify if a generated chart is good without a human looking at it? The authors used Cycle Consistency.
They took the generated Code, used it to create a plot, and then asked an LLM to describe that plot back into text. If the regenerated description matched the original description, the data point was deemed high quality. If they diverged, it was discarded.
Here is an example of a failed verification. The description asked for a “2D histogram,” but the code produced a simple bar chart. The system caught this mismatch because the regenerated description didn’t match the original intent.

Conversely, here is a successful verification. The description asked for a “3D scatter plot” with specific data attributes. The code produced the correct visualization, and the regenerated description accurately reflected the original request.

Contribution 3: RL-Based Instruction Tuning
Once the dataset was created, the researchers needed to train a model to use it. They employed a two-stage training process:
- Supervised Fine-Tuning (SFT): The baseline training where the model learns to predict the ground truth code from the description.
- Reinforcement Learning (RL) with Automatic Feedback: The advanced stage where the model refines its policy.
Crucially, this RL phase does not require human feedback (RLHF). Instead, it uses the dataset and the intrinsic properties of the task to generate rewards.
Reward 1: Preference Reward
The first reward function focuses on code correctness. The researchers constructed a “preference dataset.” They took the code generated by their SFT model (\(c^-\)) and compared it to the ground truth code (\(c^+\)) from the dataset.
They trained a reward model to prefer the ground truth code. The policy network (\(\pi_{\theta_1}\)) is then optimized using PPO (Proximal Policy Optimization) to maximize this reward.

This equation ensures the model shifts its distribution toward generating code that is statistically similar to the correct, executable ground truth code.
Reward 2: Alignment Reward
The second reward leverages the Cycle Consistency concept seen in the data generation phase. The goal is to ensure that the code generated by the model produces a visual that aligns with the user’s text description.
The process works like a loop:
- Model takes a description (\(x\)) \(\rightarrow\) Generates Code (\(\hat{c}\)).
- A secondary model takes Code (\(\hat{c}\)) \(\rightarrow\) Regenerates Description (\(\hat{x}\)).
- The Alignment Reward compares the similarity between the original description (\(x\)) and the regenerated description (\(\hat{x}\)) using BERTScore.

By maximizing this reward, the model learns to generate code that captures all the semantic details of the prompt, ensuring nothing is “lost in translation” when the code is executed.
Experiments and Results
The researchers compared their fine-tuned models (based on Llama 3 Instruct-8B and Code Llama) against massive proprietary models like GPT-4, GPT-4o, and Claude 3 Opus.
Task 1: Description-to-Chart
The primary task was generating code from a text description. The metric “Error Ratio” measures how often the generated code failed to run or produced the wrong plot type.

Key Takeaway: Look at the [SFT+RLpref] L3I-8B row in Table 2. This represents the Llama 3 8B model fine-tuned with their method.
- It achieved a Total Error Ratio of 14.55%.
- This outperforms Claude 3 Opus (14.90%), GPT-3.5-turbo (18.62%), and is comparable to GPT-4-turbo (14.27%).
- This is remarkable because Llama 3 8B is significantly smaller than these proprietary models.
The improvement is most visible in complex categories like 3D & Volumetric, where open-source baselines usually fail.
Task 2: Raw Data-to-Chart
In this task, the model is given a raw data table and must (1) reason about the data to find the best plot type and (2) generate the description and code.

As shown in Table 3, the fine-tuned Llama 3 (SFT L3I-8B) achieved a Hit Rate of 0.413, meaning it correctly identified the most suitable plot type 41.3% of the time, significantly outperforming GPT-4 (0.286) and Claude 3 Opus (0.294) on this specific benchmark.
Human Evaluation
Metrics like BLEU or Error Ratio are useful, but do the charts actually look good? The researchers conducted a human evaluation where annotators compared the plots generated by different models against a reference.

Figure 3 shows that the authors’ fine-tuned Code Llama 13B model ([SFT] CLI-13B) achieved a 47.7% Win Rate against the base Llama 3 model, with very few losses. Even against GPT-3.5-turbo, the fine-tuned models held their ground, winning or tying the majority of the time.
Conclusion
The Text2Chart31 paper presents a significant step forward in automated data visualization. By moving beyond simple QA datasets and embracing a hierarchical generation pipeline, the authors created a resource that covers the “long tail” of complex scientific plotting.
More importantly, their RL-based instruction tuning demonstrates that we don’t always need human feedback to improve LLMs. By using Cycle Consistency—checking if the output can be translated back to the input—we can create automated feedback loops that allow smaller, open-source models to punch above their weight class, rivaling state-of-the-art proprietary systems in specialized tasks.
For students and practitioners, this implies that specialized fine-tuning, backed by high-quality synthetic data, is a viable path to building powerful tools for data analysis and visualization.
](https://deep-paper.org/en/paper/2410.04064/images/cover.png)