Introduction
If you’ve ever sat in a geometry class, you know that the text of a problem is often useless without the diagram next to it. “Find the length of side \(AC\)” means nothing if you can’t see the triangle. This reliance on visual aids makes geometry one of the most challenging frontiers for Artificial Intelligence.
While Large Language Models (LLMs) have become incredibly adept at solving text-based math word problems, they hit a wall when distinct visual reasoning is required. Even state-of-the-art Multi-modal Large Language Models (MLLMs)—models that can “see” images and read text—often struggle to match human performance in geometry. They might misinterpret a diagram or fail to connect the visual angles to the numerical values in the text.
The primary bottleneck isn’t necessarily the model architecture; it’s the data. Existing datasets are often either too difficult (taken straight from complex high school textbooks) or “misaligned” (where the text describes a shape that doesn’t perfectly match the image due to poor data augmentation).
In this deep dive, we are looking at a fascinating paper titled “GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation.” The researchers introduce a clever pipeline to generate high-quality, “aligned” geometry problems. By simplifying complex problems and using code to generate precise diagrams, they created a dataset that significantly boosts the geometric reasoning capabilities of AI models.
The Problem with Current Geometry AI
To understand why this paper is significant, we first need to understand the current landscape of AI mathematics.
The Visual Gap
Research has shown that human accuracy drops significantly when solving geometry problems if the visual aid is removed. We need the diagram to ground our logic. For an AI to solve these problems, it needs strong “visual perception”—the ability to identify that a line is a tangent or that a triangle is isosceles based on the image.
Current top-tier models like GPT-4V and Gemini have made strides, but open-source models (like LLaVA or ShareGPT4V) often lag behind.
The Data Dilemma
To make these models smarter, we need to train them. However, researchers face a “Goldilocks” problem with available data:
- Too Hard: Open-source datasets often consist of problems extracted from exams or textbooks. These require complex reasoning chains that models haven’t yet learned the basics for. It’s like trying to teach a student calculus before they know algebra.
- Too Broken: To create more data, researchers often use “data augmentation.” A common technique is taking a problem and asking an LLM (like ChatGPT) to change the numbers in the text. However, if you change “Side A is 5” to “Side A is 10” in the text, but you don’t update the image, the image and text are now misaligned. This confuses the model and hurts learning.
The authors of GeoGPT4V realized that to improve performance, they needed a dataset that was easier (curriculum learning) and perfectly aligned (text matches image).
The GeoGPT4V Method
The core contribution of this paper is a novel pipeline designed to generate this “perfect” training data. Instead of relying on manual annotation or risky text-only augmentation, they devised a four-step automated process.

As shown in Figure 1 above, the pipeline moves from existing hard data to new, simplified, and visually accurate data. Let’s break down each step of this workflow.
Step 1: Question-Answer Generation (Simplification)
The process starts with an existing dataset of geometry problems (denoted as “QA Example from Geometry3K” in the figure). The goal here is Curriculum Learning—the idea that models learn better if they start with simple concepts before moving to complex ones.
The researchers use GPT-4V to act as a “teacher.” They feed the complex problem to the model and instruct it to create a simplified version. The instructions prompt the model to:
- Create lead-up problems (stepping stones).
- Create sub-problems.
- Incorporate the final answer into the question conditions to reduce complexity.
For example, if the original problem asks for a complex calculation involving area and height, the simplified version might just ask to calculate the area given the base and height explicitly. This helps the model grasp basic geometric concepts.
Step 2: Geometric Image Generation
This is the most innovative part of the pipeline. Once the new, simplified text question is created, the old image is no longer accurate. Using a generative image model (like DALL-E or Midjourney) is risky because those models struggle with the precise mathematical constraints of geometry (e.g., ensuring a specific angle is exactly 30 degrees).
Instead, the researchers use a code-based approach.
They feed the new simplified question to GPT-4 and ask it to generate Wolfram (Mathematica) code. Wolfram is a computational language that is excellent for plotting mathematical graphs and shapes.
- By generating code, the output is mathematically precise.
- If the code says
Triangle[{{0,0}, {8,0}, {4, 13}}], the resulting image will mathematically correspond to a base of 8 and a height of 13.
Step 3: Execution and Diversity
Generating code with LLMs can be hit-or-miss. Sometimes the code has syntax errors, or it draws the shape off-canvas.
To mitigate this, the pipeline generates \(K\) different versions of the code (where \(K=3\)). They execute all of these code snippets to produce \(K\) distinct candidate images. This increases the probability that at least one of the images will be perfect.
Step 4: Scoring and Filtering
Now the system has a simplified question and several candidate images. Which image is best?
The researchers bring GPT-4V back into the loop as a “grader.” The model evaluates the alignment between the generated image and the text description. It assigns a score (0 to 1) based on how well the image depicts the problem.
- Is the triangle actually equilateral?
- Are the labels legible?
- Does it match the numbers in the question?
The system selects the image with the highest score. If the best score is below a threshold (0.9), the data point is discarded to ensure high quality.
The result is the GeoGPT4V Dataset: A collection of 4.9K newly generated, simplified, and perfectly aligned geometry problems, combined with 19K existing open-source problems.
Analysis: Is the Data Actually Better?
Before training models, the authors verified that their pipeline actually achieved its goals: making problems easier and ensuring better image-text alignment.

Figure 2(a) (the donut chart) confirms the difficulty adjustment. When GPT-4V compared the original problems to the generated ones, it found that 41% of the generated problems were easier, and 44% were of equal difficulty. This confirms that the dataset successfully introduces a “curriculum” of simpler problems.
Figure 2(b) (the bar chart) shows the crucial improvement in alignment.
- G-LLaVA: This represents the previous method of just rewriting text. The alignment score is low (0.6754) because the images weren’t updated.
- Generated Images: The GeoGPT4V method achieves a massive jump in alignment score to 0.9636.
This proves that using code generation to create new images is superior to reusing old images with modified text.
Experiments and Results
The researchers trained several open-source models (LLaVA-1.5, ShareGPT4V, and InternVL) using their new dataset. They tested these models on two major benchmarks: MathVista and MathVision.
The results were impressive.

Table 2 provides a comprehensive look at the performance. Here are the key takeaways:
- Consistent Improvement: Look at the rows ending in "-G" (e.g.,
LLaVA-1.5-G). These are the models trained with the GeoGPT4V dataset. In almost every single metric, the “-G” models outperform their standard counterparts. - Significant Gains:
- For
LLaVA-1.5-7B, the Geometry Problem Solving (GPS) score jumped from 20.67% to 32.69%. That is a massive relative improvement. ShareGPT4V-13Bsaw its GPS score rise from 27.4% to 43.27%.
- Closing the Gap: The models trained on GeoGPT4V started to close the gap with proprietary giants. For example, the
InternVL-G(40B parameters) achieved a score of 64.42% on MathVista GPS, outperforming GPT-4V (50.5%) and Gemini-1.0-Ultra (56.2%).
Why Did It Work? (Ablation Studies)
Science is about knowing why something works. The authors performed ablation studies to isolate the factors contributing to this success.

Table 3 answers two critical questions:
Q1: Was generating new images necessary? The row “- Image Generation” shows what happens if they used the simplified text but kept the old, original images. The score drops from 32.69 (GeoGPT4V) to 30.77. This confirms that misalignment hurts performance and generating fresh images is crucial.
Q2: Was the scoring/filtering step necessary? The row “- Image Scoring” shows what happens if they just picked a random image from the generated batch instead of using GPT-4V to score them. The score drops, confirming that the quality control step adds value.
Is the Generated Data Better than Open Source?
One might wonder if the improvement just came from adding more data, regardless of its quality. To test this, the authors compared mixing their generated data versus just using open-source data.

Table 5 reveals the answer.
- Base: Training only on open-source data.
- Mix: Mixing open-source with GeoGPT4V data.
The “Mix” strategy yielded the highest results (33.52 on MathVista GPS vs 29.33 for Base). This implies that the improvement isn’t just about volume; the nature of the GeoGPT4V data (simple and aligned) helps the model learn features that open-source data alone cannot teach.
Conclusion
The work presented in “GeoGPT4V” highlights a critical lesson in the development of Multi-modal LLMs: Data quality is king.
By acknowledging that existing datasets were too difficult and often misaligned, the authors built a pipeline that mirrors good pedagogy. They acted as teachers simplifying complex subjects and as technical illustrators ensuring diagrams matched the descriptions perfectly.
The use of Wolfram code generation to bridge the gap between text and visual geometry is particularly clever. It bypasses the “hallucination” problems of standard image generators, ensuring that if the text says a triangle has a height of 10, the pixels in the image reflect exactly that.
For students and researchers in AI, this paper serves as a proof of concept that we don’t always need bigger models to solve hard problems. Sometimes, we just need to be smarter about how we generate the data we feed them. The GeoGPT4V dataset and the models trained on it represent a significant step forward in making AI capable of seeing and understanding the geometric world.
](https://deep-paper.org/en/paper/2406.11503/images/cover.png)