Introduction
Imagine a robot navigating a bustling city or a complex underground tunnel. To operate autonomously, it doesn’t just need to know where obstacles are; it needs to know where it is on a global map. GPS is often unreliable or unavailable in these environments (think urban canyons or indoor spaces). This is where LiDAR Place Recognition (LPR) comes in. The robot scans its surroundings with a laser scanner and asks, “Have I seen this pattern of geometry before?”
For years, the robotics community has treated LPR as a purely geometric problem, building specialized deep learning models trained from scratch on 3D point clouds. Meanwhile, the computer vision world has been undergoing a revolution with Vision Foundation Models (VFMs) like DINOv2. These massive models, trained on hundreds of millions of images, possess an incredible generalized understanding of the visual world.
But there is a problem: A LiDAR point cloud is not an image. It is a sparse, unordered set of 3D coordinates. You cannot simply feed a point cloud into a model designed for RGB photographs.
This blog post explores ImLPR, a groundbreaking research paper that bridges this “modality gap.” The researchers propose a clever pipeline that converts 3D LiDAR data into a format that Vision Foundation Models can understand, allowing robots to leverage the massive pre-trained “knowledge” of DINOv2 for highly accurate localization.

Background: The Modality Gap
Before diving into the architecture, it is essential to understand why this is a difficult problem.
Traditional LPR
Traditionally, deep learning for LiDAR has taken two paths:
- Point-based methods: These consume raw 3D points (e.g., PointNetVLAD). They are computationally heavy and require training on large 3D datasets, which are scarce compared to image datasets.
- Projection-based methods: These squash the 3D world into a 2D image. The most common is Bird’s Eye View (BEV), which looks like a top-down map. While popular, BEV compresses the vertical axis (z-axis), causing a significant loss of information. For example, a flat wall and a fence might look identical from above.
Vision Foundation Models (VFMs)
VFMs like DINOv2 are transformers trained on massive datasets (e.g., 142 million images). They are excellent at identifying features in complex scenes. In Visual Place Recognition (VPR), using VFMs is now the gold standard.
The challenge for ImLPR was to figure out how to use a model expecting a standard RGB image (Red, Green, Blue channels) when the input is actually distance measurements from a laser.
The ImLPR Methodology
The ImLPR pipeline consists of three distinct stages: Input Processing, Feature Extraction (using the VFM), and Feature Aggregation. Let’s break them down.

1. Input Processing: Creating the “LiDAR Image”
To use a vision model, the 3D data must look like an image. The authors choose a Range Image View (RIV) projection over BEV. A Range Image is essentially a spherical projection—imagine standing in the center of the scan and unrolling the 360-degree view onto a flat 2D plane. This preserves vertical details that BEV destroys.
However, DINOv2 expects 3 input channels (like RGB). A standard range image only has one channel (Range). The authors cleverly construct a 3-channel image to mimic the richness of RGB data:
- Reflectivity: How “shiny” the surface is to the laser. This acts like texture or color.
- Range: The distance to the object. This provides depth geometry.
- Normal Ratio: A measure of local surface geometry (e.g., is this a flat wall or a rough bush?).
The projection from a 3D point \(p_i = (x_i, y_i, z_i)\) to a 2D pixel \((u_i, v_i)\) is governed by the following equation:

Here, \(W\) and \(H\) are the image dimensions, and \(f\) represents the field of view. By encoding physical properties into image channels, the system gives the vision model “visual-like” patterns to latch onto.
2. The Backbone: Adapting DINOv2
Once the “LiDAR image” is created, it is fed into DINOv2. However, you cannot just use DINOv2 “out of the box” because LiDAR images look very different from the internet photos DINO was trained on. Conversely, if you fine-tune the entire model, you risk catastrophic forgetting—the model learns the new data but forgets the robust features it learned from the 142 million images.
The solution? MultiConv Adapters.
The authors freeze most of the DINOv2 parameters. They then insert small, lightweight adapter layers between the transformer blocks. These adapters process the intermediate features using Convolutional Neural Networks (CNNs).
The adapter logic is defined as:

This equation essentially says: take the patch features (\(x^{\text{patch}}\)), run them through a small adapter network, add them back to the original features (residual connection), and pass them to the next block. This allows the model to adjust slightly to the “dialect” of LiDAR data without forgetting the “language” of vision.
3. Patch-Level Contrastive Learning
Global descriptors (a single vector representing the whole scene) are great, but for high precision, the model needs to understand local details. The authors introduce Patch-InfoNCE loss.
This technique forces the model to ensure that specific patches (small squares of the image) in the current scan match the corresponding patches in a historical scan of the same location.

To do this, they align two scans using Iterative Closest Point (ICP) algorithms to find the perfect overlap. If patch A in scan 1 physically overlaps with patch B in scan 2, the model is penalized if their feature representations are not similar.
The loss function used is:

This loss function ensures that the model isn’t just looking at the general “vibe” of the scene, but is actually recognizing specific landmarks, like a particular tree or building corner.
4. Feature Aggregation (SALAD)
Finally, the features extracted by the transformer must be condensed into a single vector (descriptor) for fast searching in a database. ImLPR uses a method called SALAD (Self-Attention Local Adaptive Descriptor). This uses Optimal Transport to cluster the local features into a compact global representation.
Experiments and Results
The authors evaluated ImLPR on multiple public datasets (HeLiPR, MulRan, NCLT) covering varied environments like cities, campuses, and highways.
Intra-Session Performance
In “intra-session” tests, the robot revisits a location shortly after mapping it. The sensors and conditions are relatively similar.

As shown in Table 1, ImLPR achieves nearly perfect scores (0.992 Recall@1), consistently beating state-of-the-art methods like BEVPlace++ and MinkLoc3Dv2.
We can visually see the improvement in Figure 4 below. The red lines indicate “false positives”—times when the robot got confused about its location. ImLPR (bottom) has significantly fewer red lines than the competitors.

Inter-Session and Generalization
The real test of a localization system is Inter-session (revisiting days or months later) and Generalization (training on one sensor, testing on another).
When trained on high-resolution Ouster LiDAR data and tested on lower-resolution Velodyne data, traditional methods often fail because the point cloud density changes drastically.

The precision-recall curves above illustrate this robustness. While BEVPlace++ (orange) performs well on some datasets, it crashes on the HeLiPR-V dataset. ImLPR (red), however, maintains high performance across the board.
Why does BEV fail here?
The authors provide a compelling visualization to explain why the Range Image View (RIV) used by ImLPR is superior to Bird’s Eye View (BEV) for this task.

In Figure 6(a), look at the BEV representation. Because the scan is sparse, the BEV projection creates “empty” pixels that the model mistakenly thinks are actual features (orange boxes). In contrast, the RIV projection in Figure 6(b) remains consistent even when dynamic objects (like a car) disappear, or when the sensor density changes.
Robustness to Rotation (Yaw)
One major headache in robotics is rotation. If a robot approaches a place at a slightly different angle, the LiDAR scan looks different.

Figure 12 demonstrates ImLPR’s stability. While other methods fluctuate wildly as the robot rotates (yaw change), ImLPR’s performance remains nearly flat. This is partly because the DINOv2 architecture, combined with the cylindrical nature of the Range Image, handles horizontal shifts (which correspond to rotation) naturally.
Conclusion and Implications
ImLPR represents a significant step forward in robotic localization. By successfully adapting a Vision Foundation Model (DINOv2) to the LiDAR domain, the researchers have unlocked a new way to process 3D data without needing massive, domain-specific 3D datasets.
Key Takeaways:
- Don’t reinvent the wheel: We can leverage the “knowledge” stored in massive vision models for 3D tasks if we can bridge the modality gap.
- Representation matters: How you project your data (RIV vs. BEV) dictates what your model can learn. RIV preserves the geometric details necessary for foundation models to work.
- Adapters are powerful: You don’t need to retrain huge models from scratch. Small, trainable adapters can retarget a giant model’s capabilities to a new sensor type.
This research paves the way for “General Purpose” robotic perception, where a single foundation model could potentially handle inputs from cameras, LiDARs, and perhaps even radar, creating more robust and intelligent autonomous systems.
](https://deep-paper.org/en/paper/2505.18364/images/cover.png)