NeRF, Gaussian Splatting, and Beyond: A Guided Tour of Neural Radiance Fields

In March 2020, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” introduced a deceptively simple idea that reshaped how we think about 3D scene representation. From a set of posed 2D photos, a compact neural network could learn a continuous, view-consistent model of scene appearance and geometry, then synthesize photorealistic novel views. Over the next five years NeRF inspired a torrent of follow-up work: faster training, better geometry, robust sparse-view methods, generative 3D synthesis, and application-focused systems for urban scenes, human avatars, and SLAM.

Then, in 2023, 3D Gaussian Splatting arrived and rapidly claimed the spotlight for many novel-view-synthesis tasks thanks to dramatic speed and quality. That prompted an important question: is NeRF obsolete? The short answer is no. NeRF’s implicit, continuous representations remain uniquely useful in memory-constrained settings, volumetric phenomena, and tightly integrated 3D vision pipelines (e.g., SLAM, articulated avatars, and language-grounded scene understanding). This article is a guided tour of that five-year arc — foundations, the big milestones pre-2023, what changed afterwards, and where NeRF-style neural fields still shine.

Figure 1 gives a high-level timeline of influential papers from the NeRF era through the rise of Gaussian Splatting.

A timeline showing the rapid evolution of NeRF and related neural rendering methods from 2020 to 2025, highlighting the point where Gaussian Splatting emerged.

Fig. 1 — A timeline of important and influential NeRF and neural rendering methods from 2020–2025. The vertical annotation marks the emergence of Gaussian Splatting in late 2023.

The core idea in one paragraph

NeRF represents a scene as a continuous 5D function implemented by a neural network: given a 3D position x = (x, y, z) and a viewing direction d (a 3D unit vector), the network predicts volume density σ(x) and color c(x, d). That is,

\[ F(\mathbf{x}, \mathbf{d}) \mapsto (\mathbf{c}, \sigma). \]

To synthesize a pixel, we cast a camera ray r(t) = o + t d, sample along t, query the network at sample locations, and integrate colors by differentiable volume rendering. Training is self-supervised: rendered pixels are compared to ground-truth photos via a photometric loss, and gradients flow through the whole volume-rendering process into the network weights.

Figure 2 summarizes the pipeline.

The NeRF volume rendering and training pipeline. (a) Rays are cast through pixels and sampled. (b) A neural network predicts color and density for each sample. (c) Volume rendering integrates these samples into a final pixel color. (d) The rendered color is compared to the ground truth to compute the training loss.

Fig. 2 — The NeRF pipeline: sample points along rays, predict color and density with an MLP, integrate via volume rendering, and supervise rendered pixels against ground truth.

A concise primer on the math

Volume rendering integrates radiance contributions along a ray:

\[ C(\mathbf{r}) = \int_{t_1}^{t_2} T(t)\ \sigma(\mathbf{r}(t))\ \mathbf{c}(\mathbf{r}(t),\mathbf{d})\ dt, \qquad T(t)=\exp\Big(-\int_{t_1}^{t}\sigma(\mathbf{r}(u))du\Big). \]

In practice we discretize the interval into samples and approximate

\[ \hat{C}(\mathbf{r}) = \sum_{i=1}^N \alpha_i T_i\ \mathbf{c}_i, \qquad \alpha_i = 1 - \exp(-\sigma_i \delta_i), \]

where δi is the distance between adjacent samples and Ti is the transmittance up to sample i. The expected depth along a ray can be computed similarly and used for geometry regularization.

Two practical tricks that made NeRF work well in practice are (1) positional encoding — mapping coordinates with sinusoidal bases so the MLP can represent high-frequency detail — and (2) hierarchical sampling — a coarse network guides finer sampling along rays.

Datasets and metrics (short practical guide)

NeRF researchers use a mix of synthetic and real-world datasets:

NeRF Synthetic (Blender) — dense, controlled views of object-scale scenes (commonly used for prototyping and benchmarks).
LLFF (forward-facing real scenes) — handheld captures for real-world evaluation.
DTU — calibrated multi-view object scans (higher resolution, depth supervision available).
ScanNet, Replica, Matterport, KITTI / Waymo — large-scale indoor/outdoor and autonomous-driving datasets used for scene-scale NeRF and SLAM work.
Human datasets: ZJU-MoCap, Nerfies / HyperNeRF datasets for dynamic and deformable humans.

Benchmarking metrics are standard image-quality measures: PSNR, SSIM, and LPIPS (perceptual distance). Table-style comparisons between models commonly report those and training/inference speed.

The pre-Gaussian-Splatting era (2020–2023): key directions and milestones

The early years after NeRF can be grouped into several major thrusts:

improving photometric quality and geometry,
accelerating training and rendering,
reducing required input views (few-shot / sparse),
generative 3D synthesis (GANs & later diffusion),
composing and editing scenes (unbounded scenes, transient objects),
joint pose estimation and SLAM,
specialized applications: urban scenes, human avatars, surface reconstruction.

Figure 5 (taxonomy) shows these branches and many representative works.

A taxonomy of key NeRF innovations, categorized by improvements in quality, speed, data requirements, generative capabilities, composition, and pose estimation.

Fig. 3 — Taxonomy of selected NeRF innovation papers. Categories include photometric/geometric quality, speed, sparse-view methods, generative models, composition, and pose estimation.

Below I summarize the most influential ideas and representative methods.

Improving photometric quality and view-dependent effects

Mip-NeRF (2021): Tackled aliasing by modeling pixel footprints as conical frustums rather than infinitesimal rays. It introduced Integrated Positional Encoding (IPE) that encodes a sample’s spatial extent and thus performs anti-aliasing inherently. The effect is especially noticeable when rendering at different scales or with small/zoomed pixels.
Fig. 4 — Mip-NeRF: cone-based sampling (IPE) reduces aliasing and produces multi-scale-consistent renderings.
Ref-NeRF (2021): Reparameterized radiance to explicitly model reflection and specular behavior, recovering surface normals and producing much better results on shiny, reflective scenes.
Fig. 5 — Ref-NeRF improves modeling of specular reflections and recovers normals for reflective surfaces.
Other geometry-aware works used SDFs or occupancy representations (e.g., NeuS, UNISURF) to produce cleaner surfaces for reconstruction tasks (see the Surface Reconstruction section).

Speed: from hours to seconds

Two complementary approaches dominated acceleration efforts:

Baked representations: train a NeRF and then bake its outputs into efficient data structures (sparse voxel grids, octrees, spherical harmonic caches). Examples: SNeRG, PlenOctree, FastNeRF.
Hybrid / explicit representations or better encodings: represent most of the scene in a compact grid of learned features and use a small decoder MLP for final predictions. The watershed work here is Instant-NGP (2022) which introduced a multi-resolution learned hash-grid encoding. This single idea reduced training from hours to minutes (or seconds for small scenes) while improving quality. Other approaches (Plenoxels, TensoRF, DVGO) pushed the explicit end further—even fully removing the MLP in some setups.

The hybrid architecture of Instant-NGP, which combines a multi-resolution hash grid for storing features with a small decoding MLP. This combination allows for extremely fast training.

Fig. 6 — Instant-NGP’s multi-resolution hash encoding + small MLP made NeRF training dramatically faster while retaining high quality.

A compact comparison of representative models shows the magnitude of speedups and quality improvements (many papers report orders-of-magnitude faster inference with similar or improved PSNR/LPIPS).

A table comparing the performance, training time, and inference speed of select NeRF models, showing the evolution from the slow baseline NeRF to much faster methods like Instant-NGP and TensoRF.

Fig. 7 — Example comparison between NeRF variants: quality vs. speed trade-offs on a standard synthetic benchmark.

Sparse / few-shot view synthesis

NeRF originally required tens to hundreds of views. Two main strategies emerged to reduce this requirement:

Injecting learned 2D priors (pixelNeRF, MVSNeRF, GeoNeRF): extract CNN features from input images and use them to condition 3D queries or produce 3D feature volumes (via plane sweep cost volumes) that guide NeRF predictions.
Architectural and regularization techniques (RegNeRF): use depth and color regularizers, patch-based likelihood models, and geometry priors to prevent degeneracies when training from very few views.

RegNeRF, for instance, introduced depth-smoothness and patch-likelihood priors allowing plausible reconstructions with as few as 3–9 views.

RegNeRF produces plausible reconstructions from as few as 3, 6, or 9 input views, outperforming methods that rely on pretrained features in sparse-view scenarios.

Fig. 8 — RegNeRF demonstrates robustness under sparse-view settings versus other methods.

Generative 3D (GANs → diffusion)

Initial generative attempts used GANs (GRAF, π-GAN, EG3D) to produce 3D-aware images. Later, diffusion models unlocked powerful text-to-3D and image-conditioned 3D generation:

DreamFusion (2022) used a pre-trained text-to-image diffusion model to provide gradients (Score Distillation Sampling, SDS) and trained a NeRF from scratch for a text prompt.
Magic3D improved resolution with a coarse-to-fine pipeline (low-res NeRF → mesh extraction → high-res refinement with latent diffusion).

These methods turned 2D diffusion priors into useful signals for creating or editing 3D content.

Examples of 3D objects generated from text prompts using DreamFusion, showcasing its ability to create diverse and imaginative scenes.

Fig. 9 — DreamFusion: text-to-3D by distilling a text-to-image diffusion model into a NeRF.

Magic3D’s two-stage process allows for generating a low-resolution 3D model (left) and then refining it to a higher resolution with edited text prompts (middle and right).

Fig. 10 — Magic3D: coarse-to-fine pipeline for higher-resolution text-to-3D generation.

Composition, unbounded scenes, and transient appearance

Real-world capture introduces transient objects (cars, people) and unbounded backgrounds (sky, infinite distances). NeRF-W introduced per-image appearance and transient embeddings to handle varying lighting and transient content. NeRF++ and mip-NeRF 360 addressed unbounded scenes with special parameterizations and multi-scale sampling strategies. Fig-NeRF and object-compositional models supported editing and amodal segmentation by learning separate neural fields for foreground and background or per-object fields.

Pose estimation and SLAM

NeRFs need camera poses. Early approaches relied on COLMAP, but several works jointly optimized poses and radiance fields:

iNeRF solved for camera poses by optimizing the photometric error w.r.t. pose, given a pre-trained NeRF.
BARF and SCNeRF performed bundle-adjustment-style joint optimization of poses and NeRF weights with curriculum strategies to handle tricky initializations.
iMAP and NICE-SLAM pushed towards online SLAM: neural implicit mapping + real-time tracking, enabling dense mapping and pose estimation in live setups.

Applications that drove innovation

The NeRF machinery found real applications that demanded new techniques and representations.

Urban reconstruction: Mega-NeRF and Block-NeRF scaled NeRF ideas to city-scale datasets by partitioning scenes and handling transient objects and lighting variability.
Human avatars and dynamic scenes: Nerfies, HyperNeRF, Neural Body and many follow-ups built deformation fields and canonical spaces to model non-rigid and articulated humans for photorealistic, animatable avatars.
Image processing tasks: HDR view synthesis (RawNeRF / HDR-NeRF), deblurring (DeblurNeRF), denoising (NaN), and super-resolution (NeRF-SR).
Surface reconstruction: NeuS and UNISURF changed the geometry representation from volumetric densities to SDFs or occupancies for cleaner mesh extraction.

An overview of the diverse applications of NeRF models, including urban modeling, image processing, 3D reconstruction, and human modeling.

Fig. 11 — Applications of NeRF: urban modeling, generative models, surface reconstruction, human avatars, SLAM, editing, and image restoration.

NeuS produces high-quality 3D surface reconstructions by representing geometry as a signed distance function, outperforming the fuzzy geometry of standard NeRF.

Fig. 12 — NeuS (SDF-based) yields higher-quality mesh extraction suitable for reconstruction tasks.

The rise of Gaussian Splatting (post-2023) and what changed

3D Gaussian Splatting (3DGS) represents scenes as hundreds of thousands to millions of anisotropic 3D Gaussians (position, covariance, color, opacity). Rendering projects and blends these primitives (splatting) in screen space using a differentiable rasterization-like pipeline. The immediate practical benefits were:

Much faster training convergence for many capture setups.
Real-time or near-real-time rendering with high visual quality.
Straightforward conversion to point-cloud-like 3D outputs.

Because of these benefits, 3DGS rapidly dominated many novel-view-synthesis leaderboards and practical pipelines. That said, Gaussian Splatting is an explicit, point-cloud-like representation and has trade-offs: larger memory/storage and less natural representation of continuous volumetric effects (fog, dust). Many post-2023 works hybridized ideas: combining Gaussians with view-dependent neural encoders or combining neural fields with splatting to get the best of both worlds.

Where NeRF-style neural fields still matter

Even as Gaussian Splatting claims much of the novel-view-synthesis spotlight, implicit and hybrid neural fields retain significant advantages in several areas:

Memory- and storage-efficient representations: an MLP-based NeRF can be tiny on disk compared to an explicit Gaussian model storing millions of primitives — important for embedded or distributed systems.
Continuous queries and differentiability everywhere: neural fields are naturally continuous and differentiable with respect to spatial queries. This is valuable for tasks that require smooth gradients (e.g., differentiable optimization), per-point attribute queries, or continuous-space reasoning.
Volumetric phenomena: modeling participating media (fog, smoke) is more straightforward with a volumetric density field and volume rendering than with splatting.
Tight integration into robotics and SLAM: many SLAM and mapping systems prefer compact, queryable implicit fields for on-device mapping and loop closure deformation (several recent neural SLAM systems adopt hybrid implicit representations for these reasons).
Human avatars and articulated models: building deformation fields conditioned on skeletal parameters or canonical spaces meshes well with neural fields; the continuous deformation model often integrates more naturally than point-based splats.
Language grounding and semantic fields: embedding multi-scale semantic or language-aware features into continuous 3D fields (LERF and follow-ups) yields robust 3D relevancy maps and is a promising direction for 3D vision + language systems.

Figure 16 shows LERF applying text queries into a 3D field to produce semantically meaningful heatmaps.

LERF can generate 3D-consistent heatmaps in response to text queries, outperforming 2D methods by embedding language directly into the 3D scene representation.

Fig. 13 — LERF: language-embedded radiance fields map text queries to 3D relevancy heatmaps.

Figure 17 demonstrates modern neural implicit SLAM reconstructions.

Modern neural implicit SLAM systems like CP-SLAM can produce highly detailed and accurate 3D reconstructions of indoor environments in real-time.

Fig. 14 — CP-SLAM and related neural SLAM systems that exploit hybrid point/field representations for mapping and multi-agent consistency.

And Figure 18 shows a representative hybrid avatar (BakedAvatar) that bakes neural field structure into real-time-friendly proxies.

BakedAvatar uses a learned manifold and layered mesh proxies derived from a neural field to achieve real-time, photorealistic rendering of 4D head avatars.

Fig. 15 — BakedAvatar: neural-field-derived layered proxies for real-time head avatar synthesis.

Recent fronts and the synthesis of ideas (2023–2025)

After Gaussian Splatting’s emergence, the field diversified rather than collapsed. Key modern themes include:

Grid + MLP hybrids continue to evolve (hash grids, factorized tensors, radial basis functions, multiplicative filters) to improve spectral capacity and robustness.
Diffusion priors are heavily used for 3D editing, inpainting, super-resolution, and single-view lifting via Score Distillation or related schemes.
Language and grounding methods are extending 3D scene understanding by combining CLIP / vision-language models with 3D fields, enabling open-vocabulary 3D segmentation and language-driven editing (OV-NeRF, LERF, HNR).
SLAM systems adopt hybrid implicit representations for efficient mapping, loop-closure deformation, and multi-agent consistency.
Avatar systems converge on hybrid architectures that combine mesh priors (SMPL, 3DMM) with per-vertex or per-point learned features and small decoding MLPs for photorealistic, controllable avatars.

This evolution reflects a broader pattern: ideas that accelerated NeRF (efficient encodings, hybrid storage, distillation from 2D models) are being combined with new primitives (Gaussian splats, diffusion priors, language models) to produce practical, high-quality 3D systems.

Practical takeaways for researchers and practitioners

If you need the fastest possible high-quality novel view synthesis and rendering throughput (for many common capture setups), Gaussian Splatting and baked explicit representations are the current practical winners.
If you care about compactness, continuous and differentiable queries, or modeling volumetric effects — or you are building systems that require tight integration with optimization (SLAM, physics, differentiable editing) — implicit neural fields (NeRF-like models) still provide compelling advantages.
For sparse-view reconstruction, leverage pretrained 2D feature extractors or powerful regularizers (semantic or depth priors). RegNeRF-style regularization and feature-conditioned hybrid models are strong starting points.
For generative or text-driven 3D synthesis, diffusion-based priors (DreamFusion, Magic3D variants, latent diffusion) give controllable, high-quality outputs. Expect iterative, coarse-to-fine pipelines for best results.
For human avatars and animation, canonical-space + deformation-field architectures built on top of skeleton or morphable models remain a robust, flexible paradigm.

Concluding perspective

NeRF was a conceptually simple yet profound idea: represent a scene with a continuous neural field and render it with differentiable volume rendering. Between 2020 and 2023 the community dramatically improved its quality, speed, and data efficiency. From 2023 onward, Gaussian Splatting introduced a powerful explicit alternative that won many practical battles, but it did not erase the conceptual, practical, and theoretical legacy of NeRF.

NeRF-style neural fields remain indispensable for applications that require continuous representations, compact storage, volumetric modeling, or tight integration with other optimization-based systems (SLAM, robotics, neural avatars, and vision-language grounding). The field today is an ecosystem: hybrid methods, splatting, diffusion priors, language grounding, and implicit SLAM are cross-pollinating. Studying NeRF’s evolution gives practical recipes (hash encodings, hybrid grids, SDF-based geometry, diffusion distillation) and a mindset for building future 3D systems.

If you’re getting started, build from a few key reproducible baselines: NeRF/Mip-NeRF for theory, Instant-NGP for speed and encoding tricks, NeuS/UNISURF for geometry, and DreamFusion/Magic3D for generative pipelines. From there, decide whether your problem favors explicit, hybrid, or implicit representations and pick the appropriate toolbox.

For further reading, the survey “NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review” (the paper this post summarizes) is an excellent detailed roadmap, with references to representative implementations, datasets, and follow-up systems across 2020–2025.

The core idea in one paragraph#

A concise primer on the math#

Datasets and metrics (short practical guide)#

The pre-Gaussian-Splatting era (2020–2023): key directions and milestones#

Improving photometric quality and view-dependent effects#

Speed: from hours to seconds#

Sparse / few-shot view synthesis#

Generative 3D (GANs → diffusion)#

Composition, unbounded scenes, and transient appearance#

Pose estimation and SLAM#

Applications that drove innovation#

The rise of Gaussian Splatting (post-2023) and what changed#

Where NeRF-style neural fields still matter#

Recent fronts and the synthesis of ideas (2023–2025)#

Practical takeaways for researchers and practitioners#

Concluding perspective#