The field of Generative AI has moved at a breakneck pace. We started with blurry 2D images, moved to high-fidelity photorealism, and have now arrived at the frontier of generating 3D assets from simple text prompts. Tools like DreamFusion and various mesh generators can create a “A beagle in a detective’s outfit” in seconds.
But there is a catch.
Most current methods generate what we call “unstructured” assets. The beagle, the detective hat, and the magnifying glass are all fused into a single, continuous mesh or radiance field. For a game developer or an animator, this is a problem. You cannot simply take the hat off the dog, nor can you animate the legs independently, because the model doesn’t know where the leg ends and the body begins. In professional workflows, structure is just as important as appearance.
Enter PartGen, a new framework proposed by researchers from the University of Oxford and Meta AI. PartGen moves beyond generating “statues” and starts generating “action figures”—objects composed of distinct, meaningful, and separable parts.

As illustrated above, PartGen can take a text prompt, an image, or even an existing raw 3D scan and decompose it into semantic components—like separating a crown from a panda or wheels from a toy duck.
In this deep dive, we will explore how PartGen utilizes multi-view diffusion models to tackle two massive challenges: identifying where parts begin and end, and—crucially—hallucinating the geometry of parts that are hidden from view.
The Problem with “Fused” Geometry
To understand why PartGen is necessary, we must look at the limitations of standard 3D reconstruction. Most text-to-3D pipelines work in two stages:
- Multi-view Generation: A diffusion model generates images of an object from different angles (e.g., front, back, left, right).
- Reconstruction: A neural network (like a Large Reconstruction Model or LRM) takes these images and mathematically projects them into a 3D shape.
This process is inherently superficial—it only cares about the visible surface (the “shell”). If a character is wearing a backpack, the back of the character underneath the backpack simply doesn’t exist in the generated geometry. If you were to segment the backpack and remove it, you would be left with a gaping hole in the character’s back.
PartGen proposes a pipeline that doesn’t just cut the mesh; it understands the object’s internal structure and regenerates the missing pieces.
The PartGen Pipeline
The researchers approach the problem by leveraging the power of Multi-View Diffusion Models. These are generative models trained not just to make one image, but to make a consistent grid of images representing an object from multiple angles.
PartGen repurposes these models for two specific tasks: Segmentation and Completion.

The architecture, shown in Figure 2, follows a distinct sequence:
- Grid View Generation: Generate the initial multi-view images of the object.
- Multi-View Segmentation: Identify the parts across all views consistently.
- Part Completion: Inpaint and “hallucinate” the hidden geometry of each part.
- Reconstruction: Lift the completed parts into 3D.
Let’s break down the two core innovations: the segmentation strategy and the context-aware completion.
1. Multi-View Segmentation as a Generative Task
Traditional segmentation in computer vision is often treated as a classification problem: “Is this pixel part of a car or a road?” However, decomposing a 3D object is ambiguous. Does an artist consider a “shoe” one part, or are the “sole,” “laces,” and “tongue” separate parts? There is no single “gold standard.”
To handle this ambiguity, PartGen casts segmentation as a generative coloring problem.
The researchers fine-tuned a multi-view image generator to output color-coded segmentation maps instead of RGB images. Because the model is generative and stochastic (random), it can sample different plausible segmentations. If you run it once, it might segment a tank into “turret” and “body.” Run it again, and it might split the “tracks” from the “body” as well. This captures the variety of human artistic intent.

As seen in the comparison above, this approach yields significantly cleaner results for 3D objects compared to standard 2D segmentation models like SAM (Segment Anything Model). While SAM is excellent at identifying edges in single images, it often struggles to maintain consistency across multiple views of the same object. PartGen’s model, trained explicitly on multi-view artist data, understands that the “head” in the front view corresponds to the “head” in the side view.
2. Contextual Part Completion: Filling in the Blanks
This is arguably the most technically impressive contribution of the paper. Once a part (e.g., a chair leg or a character’s shirt) is segmented, it is often partially occluded by other parts.
If we simply fed the segmented view to a 3D reconstruction model, the model would fail to create a watertight mesh because it lacks information about the hidden surfaces. PartGen solves this by training a Multi-View Part Completion Network.
This network takes three inputs:
- The Masked Part: The image of the part with everything else blacked out.
- The Mask: The binary shape of the part.
- The Context (Full Object): The original image of the entire object.
The “Context” is vital. Imagine you are reconstructing a partially visible wheel on a car. If the model only sees the sliver of the wheel, it might reconstruct a generic cylinder. But if it sees the whole car (the context), it knows the style, texture, and proportion of the car, and can generate a wheel that fits the specific aesthetic of that vehicle.

Figure 5 demonstrates this capability. Look at the top row (the green sphere). The input is heavily occluded—essentially a slice of a sphere. The model successfully “hallucinates” the rest of the sphere to make it a complete, round object. In the bottom row, it reconstructs the hidden side of a tank turret. This is what transforms a surface-level scan into a functional 3D asset.
Training Data: Learning from Artists
To train these models, you cannot rely on standard photo datasets. The model needs to understand how humans structure 3D objects. The researchers utilized a dataset of roughly 140,000 artist-created 3D assets.

These assets (Figure 3) come “naturally” decomposed. A 3D artist naturally builds a snowman by placing a hat mesh on top of a head mesh. By rendering these distinct meshes into multi-view images and segmentation maps, the researchers created a rich training ground for PartGen to learn the “grammar” of object decomposition.
Experimental Results
The researchers evaluated PartGen against several baselines, primarily focusing on how well it segments objects and how accurately it reconstructs hidden geometry.
Segmentation Performance
In the segmentation task, PartGen was compared against SAM2 and its fine-tuned variants. The metric used was Mean Average Precision (mAP), which measures how well the predicted parts align with ground-truth artist decompositions.
The results were compelling. Because PartGen treats segmentation as a multi-view consistent generative task, it outperformed SAM2-based approaches by a large margin.

The recall curves above (Figure 9) clearly show PartGen (the top lines) retrieving a higher percentage of correct parts compared to the baselines.
Completion Accuracy
For part completion, the challenge is harder to quantify—how do you measure the quality of a “hallucination”? The researchers compared the reconstructed parts against the ground truth hidden geometry using metrics like CLIP score (semantic similarity) and LPIPS (perceptual similarity).

Table 2 highlights that Context is king. The version of the model that included the full object context (the second row) significantly outperformed the version without context. This proves that the network isn’t just memorizing shapes; it is actively using the surrounding visual information to infer what the hidden geometry should look like.
Applications: Editing and Decomposition
The utility of PartGen extends beyond just generating assets. It enables new workflows for editing.
Part-Aware Generation
PartGen handles text-to-3D and image-to-3D tasks while keeping the parts distinct. This allows for complex generations where internal parts are fully modeled, such as a “hamburger” where the patty and lettuce are complete meshes inside the bun, rather than just surface textures.

3D Part Editing
Perhaps the most exciting application for creatives is 3D Part Editing. Because the object is structurally decomposed, a user can target a specific part and use a text prompt to modify only that component.

As shown in Figure 7, you can take a character in a karate uniform and swap the top for a “Hawaii shirt” or change a standard coffee cup into a “pink cup with square bottom.” This is achieved by fine-tuning the completion network to accept text instructions along with the mask, effectively telling the model: “Keep the rest of the object, but regenerate this masked area to look like a cowboy hat.”

The mechanism for this (Figure 8) involves masking the target part and conditioning the diffusion model on the remaining context plus a text prompt describing the new desired part.
Conclusion
PartGen represents a significant step toward making generative 3D useful for actual production pipelines. By moving away from monolithic, fused meshes and towards structured, part-based representations, it bridges the gap between AI generation and human design workflows.
The core takeaway here is the versatility of Diffusion Models. We often think of them as simple image creators, but PartGen demonstrates they are excellent engines for:
- Ambiguous Decision Making: Deciding how to segment an object.
- Spatial Reasoning: Ensuring consistency across 4 different camera angles.
- Semantic Inpainting: Hallucinating hidden geometry based on context.
While limitations exist—the system is computationally heavy due to multiple diffusion steps, and it relies on the quality of the underlying 3D reconstruction model—PartGen paves the way for a future where we can generate fully rigged, editable, and articulated 3D characters from a single sentence.
](https://deep-paper.org/en/paper/2412.18608/images/cover.png)