Introduction
“Drawing is not what one sees but what one can make others see.” — Edgar Degas.
In the world of scientific research, software engineering, and education, a picture is worth a thousand words—but only if that picture is accurate. While we have witnessed a revolution in generative AI with tools like Midjourney or DALL-E, there remains a glaring gap in the capability of these models: structured, logical diagrams.
Ask a standard image generator to create a “neural network architecture with three layers,” and you will likely get a beautiful, artistic hallucination. The connections might go nowhere, the text will be illegible gibberish, and the logical flow will be nonexistent. On the other hand, asking a coding assistant to “write code for a plot” works for simple bar charts but often fails when the visual requirements become complex or unique, such as specific flowchart logic or intricate mind maps.
The core problem is the tension between visual realism and structural logic.
To bridge this gap, a research team from the University of Chinese Academy of Sciences and Westlake University has introduced DiagramAgent, a comprehensive framework designed specifically for Text-to-Diagram generation and editing. Alongside this model, they released DiagramGenBenchmark, a dataset tailored to evaluate how well AI can handle the strict logic of structured visuals.

As shown in Figure 1 above, the difference is stark. While text-to-image methods produce “hallucinated” visuals and standard text-to-code methods struggle with logical structure, DiagramAgent leverages a multi-agent approach to produce precise, editable, and colorful diagrams defined by code.
In this post, we will deconstruct how DiagramAgent works, the new benchmark used to test it, and why this represents a significant step forward for automated technical visualization.
The Challenge of Structured Visuals
Before diving into the solution, we must understand why generating diagrams is harder than generating a picture of a sunset.
The Limitations of Pixels
Standard image generation models (Diffusion models, GANs) operate in pixel space. They learn statistical correlations between text and pixel arrangements. However, diagrams are not defined by textures or lighting; they are defined by relationships and hierarchy. A flowchart arrow must connect Box A to Box B precisely. If a model generates pixels, it doesn’t “know” that an arrow represents a logical flow; it just knows that black lines often appear near rectangles. This leads to the “spaghetti code” of visual diagrams—messy, incoherent, and uneditable.
The Rigidness of Code
Alternatively, we can use Large Language Models (LLMs) to generate code (like Python’s Matplotlib or Graphviz). This ensures the lines are straight and the text is readable. However, standard LLMs often lack the “visual imagination” required to plan a complex layout. They might generate syntactically correct code that produces a cluttered, overlapping mess. Furthermore, editing these diagrams via text prompts is notoriously difficult because the model often struggles to map a visual change (e.g., “move the red node left”) back to the underlying code parameters.
The Solution: DiagramAgent
The researchers propose a framework called DiagramAgent. Instead of relying on a single AI model to do everything, they employ a “Agentic” workflow. This splits the cognitive load into four distinct roles: Planning, Coding, Checking, and Translating.

As illustrated in Figure 2, the process is not linear. It involves feedback loops and verification steps to ensure the final output is not just compilable, but correct. Let’s break down the four core agents.
1. The Plan Agent (The Manager)
The process begins with a user query, such as “Draw a flowchart for a login process.” The Plan Agent analyzes this request. If the instruction is vague, it performs Query Expansion—it uses an LLM to fill in the missing details (e.g., specifying distinct start/end nodes, decision diamonds, and error paths) to ensure the downstream agents have enough information.
\[ x _ { c o m p } = f _ { e x p a n d } ( x _ { i n s } ) \]If the user wants to edit an existing diagram, the Plan Agent routes the task differently, engaging the Diagram-to-Code agent to first understand the current state of the visual.
2. The Code Agent (The Architect)
Once a detailed plan is established, the Code Agent takes over. This agent is built on a fine-tuned version of Qwen2.5-Coder. Its job is to translate the natural language plan into a specific domain language, typically LaTeX (using TikZ) or DOT (using Graphviz). These languages are vector-based and logically structured, making them perfect for diagrams.
\[ c _ { d i a g } = f _ { c o d e } ( x _ { c o m p } ) \]This mathematical representation ensures that if the agent defines a “node,” it will be rendered as a crisp vector object, solving the illegible text problem of diffusion models.
3. The Diagram-to-Code Agent (The Translator)
This is arguably the most crucial innovation for the Editing task. If you give the system an image of a diagram and ask it to “change the blue box to red,” the system first needs to understand the code that would generate that image.
\[ c _ { d i a g } ^ { \prime } = f _ { \mathrm { d i a g r a m - t o - c o d e } } ( D _ { o r i } ) \]This agent looks at a visual diagram (\(D_{ori}\)) and reverse-engineers the source code (\(c'_{diag}\)). This allows the system to manipulate the diagram programmatically rather than trying to edit pixels directly.
4. The Check Agent (The QA Team)
Generating code is risky; a single missing bracket can crash the compiler. The Check Agent acts as a rigorous Quality Assurance step. It performs two functions:
- Debugging: It compiles the code. If the compiler throws an error, it sends the error message back to the Code Agent for a fix.
- Verification: Even if code compiles, it might look wrong. The Check Agent uses a Vision-Language Model (GPT-4o) to look at the rendered image and compare it against the user’s original request. \[ f _ { \mathrm { c h e c k } } ( c ) = f _ { \mathrm { d e b u g } } ( c _ { d i a g } , c _ { m o d } , c _ { d i a g } ^ { \prime } ) + f _ { \mathrm { v e r i f y } } ( c _ { d i a g } , c _ { m o d } , c _ { d i a g } ^ { \prime } ) \] This loop ensures that the final output is not only valid code but also a faithful visual representation of the user’s intent.
DiagramGenBenchmark: A New Standard
To train and evaluate these agents, the researchers needed high-quality data. Existing datasets were either focused on natural scenes (like COCO) or simple charts. They created DiagramGenBenchmark, a curated dataset of over 6,900 samples covering eight distinct categories of structured visuals.

As seen in Figure 3, the diversity is significant. The benchmark includes:
- Model Architectures: Neural networks and system designs.
- Flowcharts: Decision trees and logic flows.
- Graphs: Directed and Undirected network graphs.
- Data Charts: Bar charts, Line charts, and Tables.
- Mind Maps: Hierarchical concept clusters.
The dataset includes training samples for three specific tasks: Generation (Text \(\to\) Diagram), Coding (Image \(\to\) Code), and Editing (Text + Diagram \(\to\) New Diagram).
Experimental Results
Does this multi-agent approach actually work better than just asking GPT-4o to “write some LaTeX”? The experiments suggest a resounding yes.
Diagram Generation Performance
The researchers compared DiagramAgent against state-of-the-art models including GPT-4o, DeepSeek-Coder, and Llama-3. They used metrics like Pass@1 (does the code compile and run on the first try?) and Visual Fidelity (does it look like the reference?).

Table 2 highlights the dominance of DiagramAgent. It achieves a Pass@1 rate of 58.15%, significantly higher than GPT-4o (49.81%) and nearly double that of general coding models like WizardCoder. The CodeBLEU score (measuring code similarity) is also state-of-the-art at 86.83.
The Power of Verification (Ablation Study)
Why does DiagramAgent perform so well? The ablation study reveals that the Check Agent is vital.

Table 3 shows that removing the “Compiler” check drops performance, but removing the “GPT-4o Verification” also hurts visual fidelity metrics (LPIPS). The combination of verifying code syntax and visual output creates the robust performance seen in the main results.
Diagram Editing Capabilities
Perhaps the most impressive results come from the editing tasks. This is historically a very difficult problem for AI—taking an existing complex structure and making a specific, localized change without breaking the rest of the diagram.

In Table 6, DiagramAgent achieves a staggering 98.00% Pass@1 rate for editing tasks. This indicates that once the system has successfully “translated” the diagram into code (via the Diagram-to-Code agent), the Code Agent is extremely efficient at modifying that code to reflect user changes (e.g., changing colors, line styles, or labels).
Human Evaluation
Objective metrics are useful, but do humans actually prefer the diagrams? The researchers enlisted human evaluators to rate the outputs on a scale of 1-5.

Figure 4 visualizes these results. DiagramAgent (represented by the outermost line) consistently scores higher in both generation and editing tasks compared to powerful closed-source models like Gemini and GPT-4o.
Error Analysis: Where does it fail?
Despite the high scores, the model is not perfect. The authors provide a transparent look at common failure modes.

As shown in Figure 14, the model sometimes struggles with:
- Shape Understanding: Using a rectangle when a circle was requested.
- Structure Understanding: Connecting nodes in the wrong order (e.g., a neural network layer pointing backward).
- Content Understanding: Hallucinating text or mislabeling axes.
These errors highlight that while the code generation is strong, the model’s semantic understanding of complex spatial relationships still has room for improvement.
Conclusion
The DiagramAgent framework represents a shift in how we think about generative AI for visuals. By moving away from pixel-based diffusion and embracing a structured, code-centric, multi-agent approach, the researchers have unlocked the ability to create precise, editable, and logically coherent diagrams.
For students and professionals, this technology hints at a future where creating a complex system architecture diagram or a scientific flowchart is as easy as typing a sentence. No longer bound by the tedious drag-and-drop of manual tools or the hallucinated mess of image generators, we are moving toward a world where words can be instantly transformed into structured knowledge.
This paper not only provides a powerful tool but also establishes the DiagramGenBenchmark, paving the way for future research to tackle the remaining challenges in layout optimization and complex spatial reasoning.
](https://deep-paper.org/en/paper/2411.11916/images/cover.png)