A collage showing GS-LRM’s ability to reconstruct objects and complex scenes from just a few input images, including generating 3D models from text prompts.

Fig. 1: Novel-view renderings predicted by GS-LRM from object captures (top left), text-conditioned generated object images (top right), scene captures (bottom left) and text-conditioned generated scene images (bottom right, from Sora with the prompt “Tour of an art gallery with many beautiful works of art in different styles”). GS-LRM handles both objects and complex scenes with remarkable fidelity.

Creating a digital 3D model of a real-world object or scene is a cornerstone of computer vision and graphics. For decades, this meant a laborious process called photogrammetry, requiring dozens or even hundreds of photos and slow, complex software. But what if you could create a high-quality 3D reconstruction from just a handful of images — in less than a second?

This is the promise of a new class of AI called Large Reconstruction Models (LRMs). These transformer-based models are trained on vast datasets of 3D content, learning a generalized “prior” for shapes and structures. This lets them intelligently reconstruct complete 3D geometry from as few as two to four images. Yet early LRMs hit a bottleneck: they relied on a “triplane NeRF” representation, which struggled with speed, detail preservation, and scaling up to complex scenes.

A recent paper from Adobe Research and Cornell University, GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting, introduces a powerful new approach that overcomes these limits. By combining a simple, scalable Transformer architecture with the fast and high-quality 3D Gaussian Splatting representation, GS-LRM reaches state-of-the-art quality for both standalone objects and entire scenes — and does it instantly.

In this article, we’ll explore how GS-LRM works, why it matters, and how it is poised to reshape 3D content creation.


Background: The Journey to Instant 3D

Before diving into GS-LRM, let’s quickly recap two key technologies: 3D Gaussian Splatting and Large Reconstruction Models.

1. Beyond NeRF: The Rise of Gaussian Splatting

For several years, the go-to neural representation for 3D scenes was the Neural Radiance Field (NeRF). NeRF uses a neural network to learn a function mapping 3D coordinates and viewing angles to color and density. Images are rendered by evaluating this function along millions of rays.

The problem? NeRF training and rendering are slow, and scaling to high resolution is challenging.

In 2023, 3D Gaussian Splatting emerged as a faster alternative. Instead of a deep network, the scene is represented as millions of colorful, semi-transparent 3D ellipsoids (“Gaussians”), each with parameters like position, scale, rotation, color, and opacity. Rendering simply splats these Gaussians onto the image plane, enabling real-time high-quality rendering while preserving fine detail.

2. Large Reconstruction Models (LRMs)

Traditional pipelines train a NeRF or Gaussian Splatting model per scene, requiring many views and substantial computation.

LRMs flip that model: a large, pretrained transformer learns reconstruction priors from massive datasets like Objaverse, without needing per-scene training. Feed it a few images with camera poses, and it predicts a complete 3D model in one forward pass.

Earlier LRMs were NeRF-based, so they inherited NeRF bottlenecks. GS-LRM’s breakthrough is pairing the LRM approach with Gaussian Splatting, combining speed and quality.


The Core Method: How GS-LRM Works

At its core, GS-LRM is a clean, elegant Transformer model that takes 2–4 posed images and directly outputs per-pixel 3D Gaussians representing the scene.

A diagram of the GS-LRM architecture. Input images are patchified, processed by a Transformer, and then unpatchified to predict per-pixel Gaussians, which are merged into a final 3D model.

Fig. 2: GS-LRM architecture. Posed images are patchified, processed by Transformer blocks, and unpatchified into per-pixel Gaussian parameters, later merged into the final 3D model.

Let’s break it down:

Step 1: Tokenizing Posed Images

Inputs: 2–4 RGB images with known camera intrinsics and extrinsics.

  1. Pose Conditioning:
    For each pixel, compute its Plücker ray coordinates — a 6D vector mathematically representing the light ray from the camera through that pixel.
  2. Concatenation:
    Append these 6 channels to the pixel’s RGB values, creating a 9-channel feature map encoding both appearance and geometry.
  3. Patchify:
    Divide the 9-channel map into non-overlapping patches (e.g., 8×8 pixels).
  4. Linear Projection:
    Flatten each patch and project it into a token embedding for the Transformer.

Formally:

\[ \{\mathbf{T}_{ij}\}_{j=1}^{HW/p^2} = \operatorname{Linear}\left(\operatorname{Patchify}_p\left(\operatorname{Concat}(\mathbf{I}_i, \mathbf{P}_i)\right)\right) \]

This encoding naturally includes positional and view information, eliminating the need for separate embeddings.

Step 2: Transformer Backbone

Concatenate all tokens from all views and feed them through L Transformer blocks:

\[ \{\mathbf{T}_{ij}\}^l = \text{TransformerBlock}^l(\{\mathbf{T}_{ij}\}^{l-1}),\quad l = 1,\dots,L \]

Multi-head self-attention lets any patch “see” every other patch — across all input views — enabling powerful multi-view correspondence matching.

Step 3: Pixel-Aligned Gaussian Decoding

After the last block:

\[ \{\mathbf{G}_{ij}\} = \operatorname{Linear}(\{\mathbf{T}_{ij}\}^L) \]

An unpatchify operation reverses patching, producing Gaussian parameters for every single input pixel. Each pixel predicts:

  • RGB color (3)
  • Scale along x, y, z (3)
  • Rotation as quaternion (4)
  • Opacity (1)
  • Ray distance (1)

The 3D Gaussian center is positioned along the pixel’s camera ray using the predicted distance. All Gaussians across all views are merged into the scene. High-resolution inputs yield more Gaussians and finer reconstructions — something fixed-resolution triplane methods cannot natively handle.

Step 4: Training Objective

During training, predicted Gaussians are rendered from novel views. Loss combines Mean Squared Error (MSE) and perceptual similarity:

\[ \mathcal{L} = \frac{1}{M} \sum_{i'=1}^M \left[ \mathrm{MSE}( \hat{\mathbf{I}}_{i'}^{*}, \mathbf{I}_{i'}^{*}) + \lambda\,\mathrm{Perceptual}( \hat{\mathbf{I}}_{i'}^{*}, \mathbf{I}_{i'}^{*}) \right] \]

Experiments and Results: GS-LRM in Action

Two GS-LRM variants were trained independently:

  • Object-level: on the Objaverse dataset
  • Scene-level: on the RealEstate10K dataset (indoor/outdoor videos)

By the Numbers

A table comparing GS-LRM against baselines for both object and scene reconstruction.

Table 1: GS-LRM surpasses prior state-of-the-art methods in both object-level and scene-level benchmarks across PSNR, SSIM, and LPIPS metrics.

Highlights:

  • Objects (GSO dataset): PSNR 30.52 — nearly 4 dB higher than Instant3D’s Triplane-LRM baseline.
  • Scenes (RealEstate10K): 2.2 dB PSNR above pixelSplat, with notable SSIM and LPIPS improvements.

Visual Comparisons

Against Triplane-LRM:

Side-by-side comparison with Triplane-LRM.

Fig. 3: GS-LRM preserves fine detail like text and thin structures that Triplane-LRM blurs or distorts.

Against LGM:

Comparison against LGM.

Fig. 4: LGM reconstructions show distortion and broken geometry; GS-LRM remains close to ground truth.

Against pixelSplat (Scene-level):

Comparison against pixelSplat for scene reconstruction.

Fig. 5: For real-world scenes, GS-LRM yields sharper results with fewer artifacts (“floaters”) than pixelSplat.

High-resolution capability:

High-resolution reconstructions from GS-LRM.

Fig. 6: GS-LRM reconstructs readable text, transparent glass, and complex outdoor geometry from high-res inputs.


Applications: Fast 3D for Generative Pipelines

GS-LRM’s speed and flexibility make it ideal for integration into creative workflows.

1. Text/Image-to-3D Objects
By chaining GS-LRM with:

  • Instant3D (text-to-multi-view)
  • Zero123++ (image-to-multi-view)

Sparse generated views are fed to GS-LRM for instant 3D objects.

Gallery of 3D objects generated from text prompts and images using GS-LRM.

Fig. 7: Text-to-3D (top rows) and image-to-3D (bottom rows) objects reconstructed via GS-LRM.

2. Text-to-3D Scenes
Using Sora (text-to-video), sample frames are pose-estimated and reconstructed by GS-LRM into immersive environments.

3D coastal scene reconstructed from Sora video.

Fig. 8: A coastal landscape reconstructed from a generated video, with novel-view and depth renders from GS-LRM.


Conclusion and Future Directions

GS-LRM marks a notable advance in 3D reconstruction:

  • Simple, scalable architecture
  • Pixel-aligned Gaussian prediction
  • State-of-the-art quality for objects and scenes
  • Instant, high-resolution output

It isn’t without limits: current operation caps at ~512×904 input resolution, requires known camera poses, and struggles with unseen surfaces entirely outside input view frustums. Removing pose requirements and extending resolution are natural next steps.

Still, GS-LRM exemplifies the future of accessible 3D content creation — collapsing timelines from hours to seconds, lowering expertise barriers, and opening creative possibilities from game worlds to virtual retail, cultural preservation, and beyond. As research progresses, models like GS-LRM will make 3D creation as easy as taking a few photos or writing a sentence.