The world of 3D graphics—the backbone of modern video games, blockbuster movies, and architectural visualization—is notoriously complex. Creating a photorealistic scene isn’t just about artistic vision; it requires technical mastery of sophisticated software like Blender, Maya, or Unreal Engine. An artist doesn’t just “draw” a 3D chair; they manipulate geometry nodes, adjust material shaders, tweak lighting coordinates, and wrangle with physics simulations.
Because this process is so time-consuming and specialized, researchers have been racing to automate it using Artificial Intelligence. We’ve seen the rise of Vision-Language Models (VLMs) that can look at an image and understand what’s in it. The dream is simple: tell an AI, “Make the lights dimmer and turn that wooden table into glass,” and have it execute the task instantly.
However, there is a significant hurdle. While we have benchmarks to test how well an AI can write code or answer math questions, we haven’t had a comprehensive way to measure how well an AI can act as a 3D graphics artist—until now.
In this post, we are diving deep into BlenderGym, a pioneering research paper that introduces the first comprehensive benchmark for VLM-based 3D graphics editing. We will explore how researchers are testing AI agents within Blender, why current models struggle against human novices, and how the “verify and refine” approach might be the key to unlocking true autonomous 3D creation.

The Problem: Why 3D Editing is Hard for AI
To understand why BlenderGym is necessary, we first need to distinguish between generation and editing. Generative AI (like Midjourney) creates pixels from scratch. But in a 3D production pipeline, you rarely want a flat image. You need a 3D scene represented by code and parameters that you can modify later.
Previous attempts to automate this relied on Large Language Models (LLMs) to write code, but these systems often lacked “eyes.” They couldn’t see if the chair they just generated was floating three feet above the floor. Newer VLMs bridge this gap by processing both code and visual information.
However, evaluating these models has been messy. Researchers would often rely on:
- Limited Samples: Testing on a handful of scenes.
- Human Evaluation: Expensive, slow, and hard to scale.
- AI-as-Judge: Asking GPT-4 to grade its own homework, which introduces massive bias.
BlenderGym solves this by treating 3D editing as a reconstruction task. The benchmark provides a “Start Scene” and a “Goal Scene.” The AI’s job is to edit the underlying Python code of the Start Scene so that the resulting render looks exactly like the Goal Scene. This allows for objective, quantitative evaluation without needing a human in the loop.
The BlenderGym Benchmark: Five Pillars of 3D Editing
The researchers constructed 245 handcrafted scene pairs covering five essential skills required of any 3D artist. As illustrated in Figure 1 above, these tasks range from simple movements to complex procedural generation.
1. Object Placement
This tests spatial reasoning. Can the VLM perceive where objects are and move them to a target location? The AI must calculate coordinates or use relative directions (e.g., “move the chair to the right of the table”).
2. Lighting Adjustment
Lighting sets the mood of a scene. This task involves manipulating light sources—changing their color, intensity, position, and orientation. Crucially, the AI sometimes has to infer the position of a light source based solely on the shadows it casts, a task requiring high-level physical understanding.
3. Procedural Material Editing
In Blender, materials are often defined by “nodes” (a visual programming language represented as code). The AI must look at a surface (e.g., a paved road) and edit the code to change its texture (e.g., to a dirt path), adjusting roughness, color, and normals.
4. Blend Shape Manipulation
This involves morphing the geometry of an object using pre-defined sliders (blend shapes). Common examples include changing facial expressions on a character or modifying the body shape of a car. The AI must match semantic labels (like “smile”) to the correct visual outcome.
5. Procedural Geometry Editing
Perhaps the most difficult task, this requires editing the code that generates the 3D mesh itself. The AI might need to change the height of a procedural plant, the thickness of a donut, or the stacking order of books.
The Method: The Generator-Verifier Architecture
How does a VLM actually perform these edits? The researchers utilized a multi-agent system based on a pipeline known as BlenderAlchemy. It’s not enough to just ask a model to “fix the scene.” The process is broken down into a structured loop of generation and verification.

As shown in Figure 2, the system operates in two main phases:
Phase 1: The Generator
The Generator is the creative engine. It consists of two sub-agents:
- The Brainstormer: It looks at the Start and Goal images and identifies the visual gap (e.g., “The TV is too small”). It then scans the Python script to find the relevant lines of code (e.g.,
TV_Size = 1.0). - The Code Editor: It takes the Brainstormer’s instructions and writes the actual Python code to apply the fix (e.g., changing
TV_Sizeto5.0).
To maximize the chances of success, the Generator produces multiple different “drafts” or candidate edits in parallel.
Phase 2: The Verifier
This is the quality control step. The Verifier is a VLM that takes the rendered images of the Generator’s drafts and compares them to the Goal image. It asks a simple question: “Which of these attempts looks most like the goal?”
The best candidate is selected, and this becomes the new “Start Scene” for the next iteration. This cycle repeats, incrementally refining the scene.
Human vs. Machine: The Reality Check
One of the most valuable contributions of this paper is the inclusion of a human baseline. The researchers asked average Blender users to attempt these tasks (capped at 8 minutes per task) and compared their performance to state-of-the-art VLMs like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5.
The results were stark. 3D graphics editing remains an unsolved problem for AI.

As you can see in the comparison above, even powerful closed-source models struggle with tasks that humans find relatively easy.
- Lighting: In the top row, the goal is a pinkish ambient light. While the human gets close, models like GPT-4o and Gemini produce intense purples or blues.
- Blend Shapes: In the bottom row, the models fail to capture the specific combination of changes required for the police car (lights, bumper, engine cover).
Why do the models fail?
The paper identifies several failure modes:
- Visual Hallucination: The “Brainstormer” often invents differences that don’t exist or fails to spot subtle ones.
- Code Syntax Errors: Writing executable Blender Python (BPY) code is difficult. For procedural tasks, open-source models often failed to generate any working code more than 75% of the time.
- Semantic Disconnect: The model might correctly identify that the “road needs to be yellow” but edits the wrong variable in the code, changing the sky color instead.

Figure 11 highlights the coding struggle. In the “Material” task, models like Claude 3.5 Sonnet and GPT-4o attempt to change road markings but stumble on the syntax, inserting non-existent arguments or formatting the code incorrectly, rendering the edit useless.
To visualize the difficulty across all tasks, the researchers provide a calibration of metrics in the figure below. It shows how numerical scores (like Photometric Loss and CLIP score) correspond to visual quality.

The Insight: Scaling the Verifier
While the benchmark results show that current models aren’t ready to replace 3D artists, the paper offers a fascinating path forward involving Inference Scaling.
In the world of Large Language Models (LLMs), we know that “thinking longer” (generating more tokens or reasoning steps) often leads to better answers. The authors of BlenderGym asked: Does this apply to VLM graphics editing?
They discovered that simply asking the model to generate more ideas isn’t enough. The bottleneck is often the Verifier. If the system produces a perfect edit but the Verifier can’t recognize it, the edit is discarded.
Experiment: Scaling Up Verification
The researchers introduced a parameter \(k\) to the verification process. Instead of asking the Verifier to pick the winner once, they run the verification tournament \(k\) times (shuffling the candidates) and aggregate the results. This effectively gives the Verifier “more time to think.”

The results (Figure 5) are compelling. As the number of verification queries increases, the selected edits get significantly closer to the goal. Notably, the open-source model InternVL2-8B, when allowed to scale its verification compute, actually outperformed unscaled versions of GPT-4o and Claude 3.5 Sonnet.
The Strategic Allocation of Compute
This leads to a critical question for system designers: If you have a fixed budget (time or money), should you spend it on generating more ideas or on double-checking the ideas you have?
The paper suggests the answer depends on your total budget.

Looking at the Photometric Loss graph in Figure 6:
- Low Budget (< 100 queries): It’s better to prioritize Generation. You need to cast a wide net to find any decent solution. The “Generate” lines (lower verification ratio) perform better here.
- High Budget (> 100 queries): It’s better to prioritize Verification. Once you have a sufficient pool of ideas, the marginal gain of a new idea is low. The value comes from rigorously filtering for the absolute best one. The purple line (VeriRatio = 0.73) clearly wins in the long run.
This finding aligns with recent “System 2” reasoning research in LLMs: meaningful improvement in AI performance demands a shift from sheer generation capacity to a “propose-verify-improve” workflow.
Conclusion and Future Implications
BlenderGym serves as a reality check for the field of generative 3D. While flashy demos often suggest that text-to-3D is a solved problem, the ability to perform precise, code-based edits—the kind required for professional workflows—is still lagging behind human capability.
However, the paper provides a clear roadmap for improvement. It’s not just about training bigger models; it’s about better systems. By treating VLMs not just as creators but as critics, and by strategically allocating compute to the verification phase, we can significantly boost performance.
For students and researchers entering this field, BlenderGym offers a standardized playground. It moves the goalposts from “does this look cool?” to “is this exactly what I asked for?"—a necessary maturation for AI if it is ever to become a true co-pilot for 3D artists.
Key Takeaways
- BlenderGym is the first comprehensive benchmark for code-based VLM 3D editing.
- Humans still rule: Current VLMs struggle with the nuance of 3D attributes and BPY syntax.
- Verification is key: Scaling the “critic” (verifier) is often more effective than scaling the “creator” (generator).
- Compute Strategy: As you scale up, allocate more resources to verification to achieve the best results.
](https://deep-paper.org/en/paper/2504.01786/images/cover.png)