In the rapidly evolving world of Computer Vision, teaching machines to understand 3D spaces is a monumental challenge. We want robots to navigate construction sites, augmented reality glasses to overlay information on furniture, and digital assistants to understand complex spatial queries like “Find the kitchen with the island counter.”
To do this, AI systems typically rely on multi-modal learning. They combine different types of data—RGB images, 3D point clouds, CAD models, and text descriptions—to build a robust understanding of the world. However, existing methods have a significant Achilles’ heel: they often assume the data is perfect. They require every object to be fully aligned across all modalities, with complete semantic labels.
But the real world is messy. A robot might have a point cloud but no camera image. A floorplan might exist without a corresponding 3D scan.
Enter CrossOver, a novel framework presented by researchers from Stanford, Microsoft, ETH Zurich, and HUN-REN SZTAKI. This paper introduces a method for flexible, scene-level alignment that doesn’t need perfect data pairs or explicit object labels during inference.
In this post, we will deconstruct how CrossOver works, how it learns a “modality-agnostic” language for 3D scenes, and why its “emergent behavior” is a game-changer for 3D scene understanding.

The Problem: The “Perfect Data” Trap
Traditional multi-modal approaches usually operate at the object level. They take a triplet of data—say, an image of a chair, a point cloud of that same chair, and the text “a wooden chair”—and learn to map them to the same feature space. Methods like PointCLIP or ULIP have been successful here.
However, this approach faces two major limitations when scaled up to entire scenes:
- Context Blindness: Aligning individual objects ignores the relationship between them. A “chair” in a vacuum is different from “a chair next to a dining table.”
- Rigid Requirements: These methods assume you have aligned data for every instance. If you want to match a video of a room to a CAD model, but the video is missing a specific lamp that exists in the CAD model, traditional algorithms often fail.
Real-world applications need scene-level understanding that is robust to missing modalities and does not rely on pre-labeled object segments (semantic segmentation) during inference.
The Solution: CrossOver
CrossOver addresses these issues by learning a unified embedding space for entire scenes. It aligns five distinct modalities:
- RGB Images (\(\mathcal{I}\))
- Point Clouds (\(\mathcal{P}\))
- CAD Meshes (\(\mathcal{M}\))
- Floorplans (\(\mathcal{F}\))
- Text Descriptions (\(\mathcal{R}\))
The core innovation is that CrossOver does not require all these modalities to be present simultaneously. Instead, it uses a flexible training pipeline to learn how to translate any available input into a shared “language” (embedding space).
The Architecture Overview
The framework operates in a progressive manner, moving from understanding individual instances to understanding the holistic scene.

As shown in Figure 2, the architecture is split into three main conceptual blocks:
- Instance-Level Multimodal Interaction: Learning representations for specific objects using available data.
- Scene-Level Multimodal Interaction: Fusing these objects into a single scene descriptor (\(\mathcal{F}_S\)).
- Unified Dimensionality Encoders: The final inference engines that can take raw data (like a raw point cloud) and map it to the scene descriptor without needing to know where individual objects are.
Let’s break down the methodology step-by-step.
Step 1: Instance-Level Interaction
The first stage is about grounding specific objects. Even though the goal is scene understanding, the model first learns to recognize constituents.
The researchers use different encoders for different data dimensions:
- 1D Encoder (Text): Uses a pre-trained BLIP encoder to process “object referrals”—descriptions of an object’s spatial relationship (e.g., “The chair is left of the table”).
- 2D Encoder (Images): Uses DinoV2 to extract visual features from multiple views of an object.
- 3D Encoder (Point Clouds/Meshes): Uses a sparse convolutional network (I2PMAE) to encode geometry. Importantly, for point clouds, they also encode the spatial location and pairwise relationships (how far is this object from others?), injecting scene context early on.
The Contrastive Loss
To align these modalities, the model uses a contrastive loss function. The idea is simple: the embedding of a “chair” in an image should be mathematically close to the embedding of that same “chair” in a point cloud, and far away from a “table.”
The loss function for an object instance \(\mathcal{O}_i\) is defined as:

Here, the model aligns the Point Cloud (\(\mathcal{P}\)), Mesh (\(\mathcal{M}\)), and Text (\(\mathcal{R}\)) features specifically with the Image (\(\mathcal{I}\)) features.
Why align to images? The authors choose RGB images as the “base” modality because they are the most abundant and rich source of data. By anchoring everything to the image space, the model avoids the combinatorial explosion of trying to align every pair (e.g., Mesh-to-Text, Floorplan-to-Point Cloud) explicitly during this phase. This efficiency is key to the model’s scalability.
Step 2: Scene-Level Fusion
Once the model has features for the individual instances, it needs to aggregate them into a Scene Embedding.
Simply averaging the features of all objects in a room might wash out important details. Instead, CrossOver employs a Weighted Multimodal Scene Fusion. It pools features from all available modalities for all instances in a scene and computes a weighted sum.

In this equation, \(w_q\) represents trainable attention weights. The network learns which modalities provide the most reliable signal for describing the scene and weighs them accordingly to produce a master scene vector, \(\mathbf{F}_S\).
Step 3: Unified Dimensionality Encoders
This is arguably the most critical contribution of the paper.
The mechanism described in Step 2 requires us to know where the objects are (instance segmentation) to extract their features. But in a real-world application—like a robot entering a new room—we don’t have labeled object bounding boxes. We just have a raw video feed or a raw LiDAR scan.
To solve this, CrossOver trains Unified Dimensionality Encoders. These are stand-alone networks that take raw, unsegmented scene data and try to predict the rich \(\mathbf{F}_S\) representation created in Step 2.
- 1D Encoder: Processes a set of text descriptions about the scene.
- 2D Encoder: Processes raw multi-view images or floorplans using DinoV2.
- 3D Encoder: Processes the raw scene point cloud using Minkowski Engines (sparse convolutions).
The training objective here implies “distillation.” We want the raw 3D encoder to produce a result as good as the complex, segmented fusion model.

Here, \(\alpha, \beta, \gamma\) are hyperparameters balancing the alignment of the 1D, 2D, and 3D encoders with the master scene embedding \(\mathbf{F}_S\).
The total loss combines the scene-level alignment with the instance-level alignment:

Finally, the contrastive loss used throughout these steps to measure similarity between query \(q\) and key \(k\) is the standard InfoNCE loss:

Inference Pipeline
Once trained, the system is incredibly flexible. You can input a query in one modality (e.g., a Point Cloud) and retrieve matches in a completely different modality (e.g., a Floorplan).

As shown in Figure 3, the inference does not require object detection or segmentation. The query scene passes through its specific encoder (e.g., 3D Encoder for point clouds) to get a vector. The database contains vectors from other encoders (e.g., 2D Encoder for floorplans). A simple similarity check retrieves the best match.
Experiments and Results
The researchers evaluated CrossOver on two major datasets: ScanNet and 3RScan. They tested the model on two primary tasks: Instance Retrieval (finding specific objects) and Scene Retrieval (finding specific rooms).
Emergent Behavior
One of the most fascinating results is the “emergent behavior.” Recall that during instance training, the model primarily aligned modalities to Images. It was not explicitly trained to map Point Clouds directly to Text.
However, because both Point Clouds and Text were mapped to the Image space, they ended up mapped to each other via transitivity.

Figure 4(a) shows this clearly. The green area represents CrossOver. Even for pairs like Point Cloud \(\to\) Mesh or Point Cloud \(\to\) Text (where explicit pairwise training was absent or indirect), CrossOver significantly outperforms baselines like ULIP-2 and PointBind. This proves the shared embedding space is robust and semantically meaningful.
Cross-Modal Scene Retrieval
Can CrossOver find a matching floorplan given a point cloud?
The quantitative results are striking. In Figure 6 below, we see the recall rates (the percentage of times the correct match was found in the top K results).

CrossOver (green dotted line) dominates the charts. Whether mapping Images to Point Clouds (\(\mathcal{I} \to \mathcal{P}\)) or Point Clouds to Text (\(\mathcal{P} \to \mathcal{R}\)), it achieves significantly higher recall than state-of-the-art baselines.
The qualitative results bring these numbers to life.

In Figure 5, the query is a Floorplan (bottom left). The goal is to find the corresponding 3D Scan.
- PointBind (an existing method) fails, retrieving unrelated scenes.
- CrossOver correctly identifies the exact room as its Top-1 match (green checkmark).
The visualization shows that CrossOver understands the layout—the arrangement of furniture and structure—rather than just matching isolated object textures.
Handling Missing Modalities
In real-world scenarios, data is rarely complete. The researchers simulated this by training with non-overlapping data chunks—for example, training Image-to-PointCloud on one half of the dataset and Image-to-Mesh on the other half.

Table 4 shows that even when the model has never seen a specific PointCloud-Mesh pair sharing the same image during training (non-overlapping), the performance remains high. This “modality bridging” capability implies that CrossOver effectively learns the underlying geometry and semantics of the scene, independent of the specific sensor used to capture it.
Conclusion
CrossOver represents a significant step forward in 3D scene understanding. By moving away from rigid, object-level alignment and embracing a flexible, scene-level embedding space, it solves several practical problems in computer vision.
Key Takeaways:
- Modality Agnosticism: It treats Images, Point Clouds, Floorplans, and Text as different views of the same underlying reality.
- No Segmentation Needed: The unified dimensionality encoders allow the model to work on raw sensor data during inference.
- Emergent Alignment: Aligning diverse data types to a common anchor (Images) allows the model to “learn” relationships it wasn’t explicitly taught, such as connecting text directly to 3D geometry.
For students and researchers, CrossOver demonstrates the power of contrastive learning when applied with thoughtful architectural choices. It enables applications where digital twins (CAD/Floorplans) can be seamlessly linked with physical reality (Video/LiDAR), opening new doors for augmented reality, robotics, and automated construction monitoring.
](https://deep-paper.org/en/paper/2502.15011/images/cover.png)