Dream3DVG: Bridging the Gap Between Text-to-3D and Vector Graphics

In the world of digital design, vector graphics are the gold standard for clarity and scalability. Unlike pixel-based images (raster graphics), which get blurry when you zoom in, vector graphics are defined by mathematical paths—lines, curves, and shapes—that remain crisp at any resolution. They are the backbone of logos, icons, and conceptual art.

However, vector graphics have traditionally been shackled to a 2D plane. If you draw a vector sketch of a car, you can’t simply rotate it to see the back bumper; the drawing is fixed from that specific viewpoint. While recent advancements in AI have enabled “Text-to-3D” generation, applying these techniques to the sparse, abstract world of vector strokes has been notoriously difficult. When you try to force standard 3D generation methods to create line drawings, you often end up with a “tangle of wires”—messy, inconsistent lines that don’t look like a cohesive drawing when rotated.

Enter Dream3DVG, a novel framework presented by researchers from the University of Chinese Academy of Sciences and Chongqing University. This paper proposes a method to generate 3D Vector Graphics (3DVG) that are not only high-quality and consistent from any viewing angle but also smart enough to know which lines should be hidden behind others.

In this deep dive, we will explore how Dream3DVG empowers vector graphics to step into the third dimension.

The Core Problem: Why is 3D Sketching So Hard?

To understand the innovation here, we first need to look at why this is a difficult problem.

  1. The Domain Gap: Most modern Text-to-3D models (like DreamFusion) work by “distilling” knowledge from 2D image generators (like Stable Diffusion). These 2D models are trained mostly on photos. They struggle to guide the creation of abstract line drawings because vector graphics are sparse—they are mostly empty space with a few thin lines.
  2. View Consistency: A 3D object needs to look like the same object from every angle. Previous attempts to generate 3D sketches often resulted in “view-dependent” artifacts, where lines would jitter or disappear illogically as the camera moved.
  3. Occlusion (The “X-Ray” Effect): In a 2D drawing of a sphere, you only draw the front. In a 3D wireframe, you see the front and the back simultaneously. To make a 3D sketch look like a real drawing, the system needs to hide the lines that are supposed to be blocked by the object’s body. Standard 3D rendering handles this easily for solid surfaces, but vector curves don’t have “surfaces” in the traditional sense—they are just floating lines.

The Solution: Dream3DVG Architecture

The researchers propose a dual-branch optimization framework. Instead of trying to generate the vector graphics directly from text, they use a “buddy system.” They optimize a dense 3D model (specifically, 3D Gaussian Splatting) alongside the vector graphics to act as a guide.

Figure 2. The overall architecture. The method takes as input a text prompt and outputs rendered 2D vector graphics (2DVG). The entire network consists of two branches: a 3DGS optimization branch (top row) to optimize a 3DGS with the text prompt and sample coarse-to-fine guidance images; a 3D vector graphics (3DVG) optimization branch (bottom row) that generates 3DVG and renders 2DVG with reasonable occlusion by a Visibility-awareness Rendering module.

As shown in Figure 2, the architecture consists of two parallel processes:

  1. The Auxiliary Branch (Top): This branch optimizes a 3D Gaussian Splatting (3DGS) model based on the text prompt. 3DGS is excellent at representing solid geometry and textures.
  2. The 3DVG Branch (Bottom): This branch optimizes the actual 3D vector curves.

The key insight is that the 3DGS branch acts as a bridge. It provides a stable, consistent 3D structure that the vector branch can try to mimic.

1. Representing 3D Vector Graphics

Before we can optimize anything, we need a mathematical way to describe a sketch in 3D space. The authors use 3D Cubic Bézier Curves.

A standard 2D Bézier curve is defined by control points on a flat plane. A 3D Bézier curve extends this concept using control points in \((x, y, z)\) space.

Equation 1: 3D Bézier Curve parametrization.

Here, \(B^{3D}(t)\) represents the curve, and \(p^i\) are the 3D control points. By adjusting the positions of these control points, the AI can change the shape of the curves.

The system supports two styles:

  • Sketches: Single curves representing strokes.
  • Iconography: Closed loops of curves (usually 4 connected end-to-end) that can be filled with color.

When these 3D curves are projected onto a 2D camera view, they mathematically retain the properties of 2D Bézier curves, allowing them to be rendered by standard differentiable vector rasterizers.

2. Coarse-to-Fine Guidance Strategy

How do we train these curves? We can’t just ask the AI to “draw a cat.” We need to provide a target image for it to match.

The researchers use a technique called Interval Score Matching (ISM) to update the 3DGS model. ISM helps extract a consistent “trajectory” from the diffusion model, reducing the noise that often plagues Text-to-3D generation.

Equation 2: ISM Loss function.

But here is the clever part: they don’t just take the final rendered image from the 3DGS branch. They leverage the optimization process itself.

In the early stages of generation, you want the model to focus on the overall shape (the silhouette). Later, you want it to focus on details (texture, fur, eyes). The authors implement a Coarse-to-Fine (C2F) strategy by manipulating the Classifier-Free Guidance (CFG) scale of the diffusion model.

Equation 3: Guidance Sampling with scheduled CFG.

By scheduling the CFG and the timestep \(t\), they can generate guidance images that evolve from smooth, general shapes to detailed depictions.

Figure 3. Guidance samples during the 3DGS optimization. Note the bottom row “Ours” which shows smooth transitions.

As seen in Figure 3, the “Ours” row (bottom) produces guidance images that start very smooth and abstract (left) and become sharp and detailed (right). This prevents the vector generation from getting distracted by noise early on, ensuring distinct, clean strokes.

The loss function for the vector graphics (\(\mathcal{L}_{VG}\)) then tries to match the vector render to these guidance images using both perceptual loss (LPIPS) and semantic loss (CLIP).

Equation 4: The Vector Graphics Loss Function.

3. Visibility-Awareness Rendering (VAR)

Perhaps the most significant contribution of this paper is how it handles the “X-Ray” problem. When you look at a 3D wireframe of a cat, you shouldn’t see the curves that make up the back legs if the body is blocking them. But since curves are thin lines, standard depth buffers don’t work well.

The researchers devised a two-step module called Visibility-Awareness Rendering.

Figure 4. Illustration of Visibility-Awareness Rendering.

Step A: Importance Filtering

First, the system needs to decide which curves are actually important for the current view. They train a small neural network (an MLP) that takes a 3D point and the camera view as input and outputs an “importance” score.

Equation 5: Importance Function.

If a curve’s importance score is too low (meaning it’s in a region that contributes little to the visual appearance, or is heavily occluded), it gets filtered out early.

Step B: Antipodal-Depth Visibility Voting

This is the geometric check. Even if a curve is “important,” it might still be behind the object. To check this, the system compares the curve’s position against the depth map of the dense 3DGS model.

The system performs a “voting” mechanism. It projects points on the curve to the current camera view and looks at the depth \(D\) from the 3DGS model. It also looks at the “Antipodal” view (the camera view from the exact opposite side).

Equation 6: Depth comparison inequality.

Essentially, this equation asks: “Is this curve point closer to the front surface of the 3D object, or the back surface?” If the curve is closer to the back surface (the antipodal depth), it implies the curve is on the far side of the object and should be hidden (culled).

This combination of learning-based importance and geometry-based depth voting allows Dream3DVG to produce clean, occlusion-aware renderings.

Experiments and Results

The researchers compared Dream3DVG against several state-of-the-art methods, including Diff3DS (a text-to-3D sketch method), 3Doodle (which requires multi-view images as input), and CLIPasso (a 2D vectorization method applied to 3D renders).

Qualitative Comparisons

The visual results highlight the strength of the dual-branch approach.

Figure 5. Qualitative results of 3D sketch. Comparison between Diff3DS, 3Doodle, and Ours.

In Figure 5, look at the “Benz car” and the “Saber” character.

  • Diff3DS often produces “hairy” or messy lines that don’t clearly define the shape.
  • 3Doodle is cleaner but relies on input images, whereas Dream3DVG works from text.
  • Ours (Dream3DVG) produces clear outlines, defined wheels, and facial features. Crucially, notice how clean the interiors are—the occlusion handling is working.

Figure 1 illustrates the capability to generate both sketches and filled iconography.

Figure 1. Examples of multiview vector graphics generated by our method conditioned on text prompts.

The top two rows show sketches (clocks, handbags, shoes), while the bottom row shows “iconography”—filled vector shapes (dragon, llama, airplane). The consistency across different views is remarkable for vector data.

Quantitative Analysis

To measure performance mathematically, the authors used CLIP-Text (how well the image matches the prompt) and ALPIPS (how consistent the structure is across adjacent views).

Table 1. Quantitative results for sketch and Iconography.

As shown in Table 1, Dream3DVG achieves the highest scores in both metrics for sketches and iconography. The low ALPIPS score (lower is better) indicates superior 3D consistency compared to competitors.

Single View Quality

The team also compared their method against purely 2D vector generation tools like DiffSketcher and VectorFusion.

Figure 6. Qualitative results of sketch and iconography vs 2D methods.

Even though Dream3DVG is a 3D method, its 2D projections (Figure 6) compete favorably with native 2D generators. For example, look at the “Yellow Schoolbus.” The 2D methods often produce abstract blobs or blocky shapes. Dream3DVG, because it understands the underlying 3D geometry of a bus, produces a more structured and realistic vector illustration.

Ablation Studies: Do We Need All Parts?

The authors performed “ablation studies”—removing parts of the system to see if they are necessary.

Figure 7. Visual ablations by gradually adding components.

Figure 7 visually demonstrates the necessity of each component:

  • (a) SDS Only: Using standard Text-to-3D loss results in a mess.
  • (b) 3DGS Only: Better, but still noisy.
  • (c) + Sampling: Improves consistency.
  • (d) + C2F (Coarse-to-Fine): The shape becomes distinct (look at the cat).
  • (e) + Importance: Unnecessary lines are faded.
  • (f) + Vis. Voting: The final result. The lines behind the object are removed, leaving a clean sketch.

Figure 15. Guidance visualization in optimization.

Figure 15 further visualizes the “Guidance” process. You can see how the importance map (bottom rows) evolves, learning to focus on the silhouette and key features of the telephone and car as training progresses.

Conclusion and Implications

Dream3DVG represents a significant step forward in generative design. By successfully bridging the gap between dense 3D models (Gaussian Splatting) and sparse vector graphics, it allows for the creation of assets that are:

  1. Editable: Because they are vectors, designers can easily tweak curves.
  2. Scalable: They can be rendered at any resolution.
  3. 3D-Consistent: They can be used in 3D environments or animations.

The introduction of Visibility-Awareness Rendering solves the long-standing occlusion problem in 3D sketching, ensuring that the generated assets look like professional drawings rather than wireframe scans.

While the method currently focuses on sketches and simple iconography, the principles laid out in this paper—specifically the use of a dense 3D proxy to guide sparse generation—could pave the way for fully automated, production-ready 3D vector art in the future.

For students and researchers in computer graphics, Dream3DVG is a perfect example of how combining different representations (3DGS and Bézier curves) can solve the weaknesses inherent in using either one alone.