Imagine typing “a bear dressed in medieval armor” into a computer, and seconds later, receiving a fully rotatable, high-quality 3D asset ready for a video game. This is the dream of Text-to-3D generation.

While we have mastered 2D image generation (thanks to tools like Midjourney and Stable Diffusion), lifting this capability to 3D dimensions remains surprisingly difficult. A common failure mode is the “Janus problem”—named after the two-faced Roman god—where a generated model might have a face on both the front and the back of its head because the model doesn’t understand that the back view shouldn’t look like the front view.

Today, we are diving deep into a CVPR paper that proposes a robust solution to this consistency problem. The paper is titled “CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation.”

The authors introduce a novel architecture that combines the precision of Attention mechanisms with the efficiency of State Space Models (Mamba) to generate dense, consistent views of an object. If you are a student of computer vision or generative AI, this paper offers a masterclass in how to balance computational efficiency with high-fidelity output.

Figure 1. CoSER aims to generate detailed, diverse and detailed 3D objects from text prompts.

The Problem: Why is 3D so Hard?

To generate a 3D object from text, modern approaches often try to generate multiple 2D images of the object from different angles (multiview generation) and then stitch them together using reconstruction algorithms (like NeRFs or NeuS).

The challenge lies in Consistency.

If you generate a view of a car from the front, and a view from the side, the color, style, and geometry must match perfectly. If the front view shows a red car and the side view shows a dark crimson car, the 3D reconstruction will fail or look blurry.

Previous methods faced a dilemma:

  1. Dense Attention: You can use “Cross-View Attention” to force every pixel in every view to talk to every other pixel. This ensures consistency but is computationally explosive (Quadratic complexity, \(O(N^2)\)). It limits how many views you can generate.
  2. Sparse Attention: You can limit interactions to make it faster, but then you lose global context, leading to the dreaded Janus problem or drifting textures.

CoSER (Consistent Dense Multiview Text-to-Image Generator) proposes a hybrid way to eat your cake and have it too.

The CoSER Framework

The core philosophy of CoSER is simple yet profound: Treat local neighbors differently than global context.

  1. Local Consistency: Adjacent views (e.g., \(0^\circ\) and \(10^\circ\)) share a lot of visual information. They need high-precision, dense interaction.
  2. Global Consistency: Distant views (e.g., \(0^\circ\) and \(180^\circ\)) don’t look alike, but they must represent the same object. They need a mechanism that understands the whole picture without getting bogged down in pixel-perfect matching.

To implement this, CoSER modifies a standard Latent Diffusion Model (LDM) by adding specific modules for these tasks.

Figure 2. Illustration of our CoSER framework.

As shown in the architecture diagram above, the model takes a text prompt and generates typically 12 or more views. It processes these views through three distinct pathways:

  1. Green Path (Neighbors): Appearance Awareness (AA) and Detail Refinement (DR).
  2. Yellow Path (Whole): Rapid Glance (RG) and Accumulated Inconsistency Rectification (AIR).

Let’s break these down step-by-step.

Part 1: Mastering Local Neighbors

The first step is ensuring that if we rotate the camera slightly, the image changes predictably. CoSER achieves this through two sub-modules.

Appearance Awareness (AA)

This module acts as a “sanity check” for the basic look of the object. It uses a modified self-attention mechanism called Adjacent Attention. Instead of just looking at itself, the current view \(z^i\) looks at the previous view \(z^{i-1}\) and the next view \(z^{i+1}\).

The mathematical formulation for the Key (\(K\)) and Value (\(V\)) in this attention block effectively concatenates the features of the three views:

Equation for Adjacent Attention

This allows the model to “borrow” texture and shape information from immediate neighbors, ensuring a smooth transition between frames.

Detail Refinement (DR)

While Appearance Awareness handles the general vibe, it isn’t precise enough for pixel-perfect alignment. This is where Detail Refinement comes in.

The authors leverage the physics of 3D rotation. If you rotate an object, a pixel at location \((x, y)\) moves to a new location \((x', y')\) based on the rotation angle.

The authors use a simplified rotation formula (assuming small angles and unknown depth):

Simplified Rotation Equation

Here, \(W\) is the image width and \(\Delta\alpha\) is the rotation angle. Because we don’t know the exact depth \(d\) of every pixel during generation, the authors look at a \(3 \times 3\) window around the calculated target coordinate in the neighboring frame. This “Trajectory Attention” allows the model to align specific details—like the button on a shirt or the eye of a character—across views based on geometric logic rather than just semantic similarity.

Figure 3. Visualization of interaction schemes. Middle panel shows Detail Refinement.

Part 2: Understanding the Whole

If we only looked at neighbors, we would suffer from “drift.” View 1 looks like View 2, and View 2 looks like View 3, but by the time you get to View 12, the object might have morphed into something else entirely. We need a global supervisor.

Standard attention across all views is too slow. The authors propose a brilliant alternative using Mamba, a selective State Space Model (SSM) that offers linear complexity.

Rapid Glance (RG) with Spiral Mamba

State Space Models process data as a sequence. The challenge is: How do you turn 12 images into a single sequence that makes sense?

Standard methods might scan row-by-row. However, in 3D datasets, the object is usually in the center of the image. Scanning row-by-row breaks the object into disconnected chunks separated by background.

CoSER introduces the Spiral Bidirectional Scan.

Figure 16. Ablation of scanning strategy showing the spiral path.

As seen in the figure above (top right), the scan starts from the center of the image (where the object is) and spirals outwards. This keeps the semantically important “object” tokens close together in the sequence. The Mamba block then processes this sequence across all views rapidly.

Equation for SSM Rapid Glance

This “Rapid Glance” gives the model a quick, lightweight understanding of the global structure: “Okay, this is a red car, it has four wheels, and it’s facing left.”

Accumulated Inconsistency Rectification (AIR)

Finally, to fix any remaining disagreements between views, the model uses a heavy-duty attention mechanism but applies it sparsely.

The model generates a Score Map based on the text prompt. It asks, “Which parts of this image actually correspond to the text?” It assigns high scores to the object and low scores to the background.

Using this score map, the model down-samples the image, keeping only the important features (the object) and discarding the empty background. It then performs a global attention operation on these reduced features.

Equation for Weighted Pooling

This allows the model to perform expensive global reasoning without the expensive computational cost, because it’s only processing the pixels that matter.


Experiments and Results

Does this complex architecture actually work? The results suggest a resounding yes.

Qualitative Comparison

The authors compared CoSER against state-of-the-art methods like VideoMV, GaussianDreamer, and Hash3D.

In the comparison below, look at the bottom row (CoSER).

  • The Apples: Notice how the wireframe and the texture remain perfectly consistent as the apple rotates.
  • The Bear: The armor on the bear’s back is consistent with the front design, avoiding the Janus problem.

Figure 4 & 5. Qualitative comparison against VideoMV (top) and GaussianDreamer/Hash3D (bottom).

Compared to VideoMV (Top of Figure 4), CoSER produces sharper textures and better geometry. VideoMV often produces “blurry” or inconsistent shapes when the view angle becomes extreme.

Compared to GaussianDreamer and Hash3D (Figure 5), CoSER shows significantly higher realism. Look at the “Porcelain Dragon” (Row 2) or the “Race Car” (Row 4)—the reflections and material properties in CoSER’s output are distinct and high-fidelity.

Quantitative Metrics

Visually appealing images are great, but numbers tell the truth about consistency. The authors used CLIP Score (to measure how well the image matches the text) and user studies.

Table 1. Quantitative comparison.

CoSER achieves the highest Quality (33.07) and Alignment (37.7) scores. More importantly, in human user studies (“User Study” columns), participants preferred CoSER’s consistency and texture details by a wide margin over competitors.

Ablation Studies: Do we need all these modules?

You might wonder if the “Spiral Scan” or “Score Map” are actually necessary. The authors tested this by removing modules one by one.

Figure 6. Ablation of proposed modules.

  • AA Only (First column): The basic shape is there, but details are fuzzy.
  • AA + DR: Neighbors look better, but global consistency is weak.
  • AA + DR + RG: The model understands the object better, reducing ambiguity.
  • Full Model (Right): Sharpest details and best consistency.

A specific look at the Score Map (Figure 8, right side of the image below) reveals its importance. Without the score map (Left), the texture on the fox bust is flat. With the score map (Right), the model focuses its computational power on the statue itself, resulting in intricate marble textures.

Conclusion

The CoSER paper represents a significant step forward in Generative 3D. By acknowledging that not all views need the same type of attention, the authors designed a system that is both efficient and highly effective.

Key Takeaways for Students:

  1. Hybrid Architectures: The future of Deep Learning isn’t just “Transformers for everything.” It’s about combining tools—Attention for precision, Mamba/SSMs for efficiency.
  2. Physics Priors: Embedding physical knowledge (like rotation formulas) into the network (Detail Refinement) often works better than letting the network learn everything from scratch.
  3. Data Structure Matters: The “Spiral Scan” proves that how you feed data into a model (sequence order) changes how well the model learns.

CoSER moves us closer to a world where anyone can be a 3D artist, turning simple text into rich, consistent, digital assets.


For those interested in the mathematical details, the full training objective of the Latent Diffusion Model used as the backbone is provided below:

Equation 1: LDM Loss Function