Introduction

We are currently witnessing a golden age of generative video. Models like Sora, Runway, and Stable Video Diffusion can hallucinate breathtaking scenes from a simple text prompt or a single image. However, if you look closely, cracks begin to appear—specifically when the camera starts moving.

Imagine generating a video of a room. As the camera pans left, new objects appear. If you pan back to the right, do those original objects reappear exactly as they were? Often, they don’t. The vase on the table might change color, or a window might vanish entirely. Furthermore, trying to tell a video model to “move the camera 2 meters forward and pan 30 degrees right” is notoriously difficult. Most models treat camera parameters as abstract numbers, struggling to translate them into geometrically accurate pixel shifts.

The fundamental issue is that video generation models operate in 2D pixel space, but the world they are simulating exists in 3D geometry.

In this post, we are doing a deep dive into GEN3C, a research paper that proposes a hybrid solution. Instead of relying solely on a neural network to hallucinate geometry, GEN3C introduces a 3D-informed approach. It builds a crude but effective “3D cache” of the scene and uses it to guide the generative model. The result? Videos that are not only photorealistic but also spatially consistent and controllable.

Figure 1. GEN3C allows for diverse applications including dynamic video generation, cinematic effects like dolly zooms, and driving simulations.

Background: The Consistency Problem

To understand why GEN3C is necessary, we need to look at how current Novel View Synthesis (NVS) and video generation usually work.

Novel View Synthesis (NVS)

Traditional NVS methods, like NeRF (Neural Radiance Fields) or 3D Gaussian Splatting, are excellent at maintaining 3D consistency. They reconstruct a scene so you can look at it from any angle. However, they typically require dense inputs—dozens or hundreds of images of the same object to build a representation. If you only have one image, these methods fail to hallucinate the unseen sides of an object realistically.

Video Diffusion Models

On the other hand, Video Diffusion Models are trained on massive amounts of internet data. They are incredible at hallucinating missing information (like what’s behind a person). However, they lack an explicit “memory” of the scene’s geometry.

When researchers try to add camera control to these diffusion models, they usually feed camera parameters (like position and rotation) directly into the network. The network must then implicitly learn how perspective works. As shown below, this often leads to “drift.” When a camera moves forward and then backward to the starting point, standard “CameraCtrl” methods produce blurry or distorted results because they’ve forgotten the initial state. GEN3C (“Ours”) maintains crisp detail.

Figure 2. Motivation: Comparison between standard Camera Control methods and GEN3C. Note how GEN3C maintains stability and sharpness when the camera returns to previous positions.

The Core Method: 3D-Informed Generation

The brilliance of GEN3C lies in its refusal to choose between explicit 3D geometry and generative AI. It uses both.

The system functions on a simple premise: Don’t ask the AI to guess the geometry from scratch. Give it a scaffold.

The workflow consists of three main stages:

  1. Building a Spatiotemporal 3D Cache: Creating a 3D point cloud from input images.
  2. Rendering the Cache: Projecting that point cloud into the new camera view to create a “guide” video.
  3. 3D-Informed Generation: Using a diffusion model to refine that guide into a photorealistic video.

Figure 3. Overview of GEN3C architecture. From input to 3D cache, to rendering, to the final diffusion model.

1. Building the Spatiotemporal 3D Cache

Everything starts with the input, which could be a single image, a few sparse images, or a video. The goal is to lift these 2D pixels into 3D space.

The researchers use off-the-shelf monocular depth estimators (specifically Depth Anything V2) to predict a depth map for the input image. With the color (RGB) and the depth (D), they can “unproject” the pixels into a colored point cloud.

  • For a single image: They create one point cloud and duplicate it across time.
  • For multiple views: They create point clouds for each view.
  • For dynamic video: They create point clouds for frames over time, effectively making a 4D representation (3D space + time).

This collection of points is called the 3D Cache. It represents what the model knows about the scene’s geometry.

2. Rendering the Cache

Now, let’s say the user wants the camera to pan to the right. The system takes the user-provided camera trajectory and renders the 3D cache from these new viewpoints.

Because the 3D cache is just a point cloud derived from a single perspective, this rendering will look imperfect. It will have:

  • Holes (Disocclusions): Areas that were hidden in the original view (e.g., the wall behind a chair).
  • Artifacts: Stretching or distortions from imperfect depth estimation.

Critically, the system also generates a Mask (\(M\)). This mask tells the subsequent network, “These pixels came from the 3D cache (valid), and these black pixels are holes that you need to hallucinate (invalid).”

3. Fusing and Injecting into Video Diffusion

This is the most technical and innovative part of the paper. We now have a “guide” video (the rendered point cloud) and a mask. How do we tell the video diffusion model to respect the guide where it exists, but be creative where there are holes?

The authors modify a pre-trained image-to-video model (Stable Video Diffusion). They introduce a specialized adapter to inject the 3D information.

The Fusion Strategy

When dealing with multiple input images (e.g., a photo of a room from the left and one from the right), you might have conflicting 3D points or slight misalignments. The authors explored several ways to combine this data:

  1. Explicit Fusion: Merging the point clouds in 3D space before rendering. This often leads to artifacts if the depth estimation isn’t perfect (double edges, ghosting).
  2. Latent Concatenation: Stacking the features together. This limits the number of views the model can handle.
  3. Max-Pooling (The Chosen Solution): They process each view’s rendering independently through the encoder and then take the maximum value across features.

Figure 4. Comparison of fusion approaches. (c) The proposed Max-Pooling strategy is permutation invariant and robust to misalignment.

The Max-Pooling strategy (shown in Figure 4c) is elegant because it is permutation invariant—it doesn’t matter what order you feed the views in, and it handles overlapping geometry gracefully by letting the strongest feature win.

The Injection Mechanism

The rendered video (\(I^v\)) is encoded into a latent representation (\(z^v\)). This latent is multiplied by the mask (\(M^v\)) to zero out empty areas. It is then concatenated with the noisy latent of the diffusion process.

The equation governing the injection of a specific view \(v\) looks like this:

Equation for injecting the masked latent into the network.

Here, the model takes the rendered guide \(z^v\), masks it, and combines it with the current generation step \(z_\tau\). This forces the diffusion model to “paint over” the 3D guide. It essentially acts as a highly sophisticated texture completion tool. It trusts the geometry provided by the 3D cache but uses its learned priors to fix artifacts and fill in the holes marked by the mask.

4. Autoregressive Updates for Long Videos

One of the biggest limitations of video generation is length. Models usually generate 2-4 seconds. If you try to extend them, they drift.

GEN3C solves this by updating the 3D cache on the fly.

  1. Generate a short chunk of video.
  2. Take the last frame of that chunk.
  3. Estimate its depth.
  4. Align this new depth map to the existing 3D cache (solving for scale and shift).
  5. Add the new points to the cache.

By continuously adding to the 3D cache, the model “remembers” what it generated 10 seconds ago. If the camera circles back, the geometry is still there.

Experiments and Results

The researchers tested GEN3C across a variety of difficult tasks, comparing it to state-of-the-art methods like MotionCtrl, CameraCtrl, and sparse-view reconstruction techniques.

Single-View to Video

The most common use case is animating a single static photo. The results show that GEN3C adheres much more strictly to the requested camera path than competitors.

In the figure below, look at the green boxes. Notice how GEN3C preserves the fine details of the industrial pipes and the text on the train (“713”). Other methods blur these details or lose structural coherence as the camera moves.

Figure 5. Qualitative results for single-view novel view synthesis. The green boxes show GEN3C’s superior detail preservation compared to baselines.

Two-View Novel View Synthesis

A particularly challenging scenario is “Sparse NVS”—generating a video moving between two photos that are far apart.

Pure reconstruction methods (like 3D Gaussian Splatting) struggle here because they don’t have enough data to fill the gaps. Pure generative methods struggle to keep the two end-points consistent. GEN3C excels by using the 3D cache to enforce the geometry of the start and end images, while the diffusion model smoothly interpolates the texture.

Figure 6. Qualitative results on two-view NVS. GEN3C handles large gaps between views (top row) better than PixelSplat or MVSplat.

Quantitative metrics (PSNR/SSIM) confirm that GEN3C significantly outperforms competitors in both interpolation (filling the gap) and extrapolation (going beyond the input views).

Table 2. Quantitative results showing GEN3C outperforming PixelSplat and MVSplat in PSNR and SSIM metrics.

Robustness to “Bad” Geometry

A major concern with this approach is: “What if the depth estimation is wrong?”

The authors performed an ablation study where they intentionally misaligned the depth or added noise. Because the diffusion model is trained to translate “imperfect 3D renders” into “perfect video,” it is surprisingly robust. It learns to ignore minor geometric errors and correct lighting inconsistencies.

In the comparison below, “Explicit Fusion” (merging point clouds directly) results in ghosting and seams when inputs are imperfect. GEN3C’s latent fusion strategy smooths over these errors seamlessly.

Figure 10. Ablation study showing GEN3C’s ability to handle misaligned depth and different lighting conditions, unlike explicit fusion strategies.

Driving Simulation and Editing

One of the most promising applications is in autonomous driving simulation. Simulators need to generate realistic videos from novel angles to train cars.

GEN3C can take a driving video, build a cache, and then render the scene from a different lane or a higher angle. Because it understands the 3D scene, it allows for object manipulation. The authors demonstrate removing a car from the road or editing its trajectory, and the model fills in the road behind the removed car perfectly.

Figure 8 & 9. Top: 3D Editing results showing car removal. Bottom: Monocular dynamic NVS showing the model handling dynamic scenes.

Notice in the bottom half of the image above (Figure 9) that the model even handles dynamic scenes (a swimming hippo). It successfully separates the static background from the moving subject.

Improving the Base Model

Finally, because GEN3C is a method rather than a specific architecture, it improves as base models improve. The authors swapped Stable Video Diffusion (SVD) for a more advanced model, Cosmos.

The visual improvement is immediate. The Cosmos-based GEN3C produces sharper textures and better lighting, proving that this 3D-informed technique will remain relevant as generative AI scales up.

Figure 11. Comparison showing that plugging in a better base model (Cosmos vs SVD) instantly improves GEN3C’s output quality.

Conclusion & Implications

GEN3C represents a significant step forward in bridging the gap between Computer Graphics (which is consistent but hard to create) and Generative AI (which is creative but hard to control).

By explicitly modeling a 3D Cache, the authors have given the neural network a “working memory.” This allows the model to:

  1. Look back: Remembering geometry it has already seen.
  2. Look forward: Hallucinating new geometry consistent with the old.
  3. Stay on track: Strictly following camera paths provided by the user.

While the method relies on the quality of depth estimators and the computational cost of rendering point clouds, it solves the “wobbly world” problem that plagues current video generation. As we move toward creating interactive virtual worlds and realistic simulators, hybrid approaches like GEN3C—combining explicit geometry with implicit neural rendering—will likely become the standard.