Beyond Geometry: Teaching Robots to Understand Functionality in 3D Spaces

Imagine you are a robot navigating a kitchen. You scan the room and perfectly identify a refrigerator, a cabinet, and a sink. You know exactly where they are located in 3D space. But now, you are given a command: “Open the fridge.”

Suddenly, your perfect geometric map is insufficient. You know where the fridge is, but do you know how to interact with it? Do you know which specific handle belongs to the fridge door? Do you understand that pulling that handle causes the door to swing open? Or consider a more complex command: “Turn on the ceiling light.” You can see the light fixture, but the switch is on a wall three meters away. To a standard 3D perception system, there is no physical link between that switch and that light.

This is the gap between spatial understanding (knowing where things are) and functional understanding (knowing how things work).

In a new paper titled “Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces,” researchers from Tsinghua University, ETH Zürich, and MPI for Informatics introduce a groundbreaking method called OpenFunGraph. Their work moves beyond static maps, proposing a way to generate “Functional 3D Scene Graphs” that capture the interactive dynamics of real-world environments.

Fig. 1. Functional 3D Scene Graphs. Given an input sequence of posed RGB-D frames of an indoor environment, our method predicts a functional 3D scene graph by detecting objects, identifying interactive elements, and inferring functional relationships.

The Problem: Why Spatial Maps Aren’t Enough

For years, the gold standard in machine perception has been the 3D Scene Graph (3DSG). A traditional 3DSG is a data structure where nodes represent objects (e.g., chair, table) and edges represent spatial relationships (e.g., “chair is next to table”).

While impressive, these graphs have two major limitations:

They ignore small interactive parts: They might detect a “door,” but they often miss the “handle.”
They lack functional logic: They describe proximity (“the switch is on the wall”) rather than causality (“the switch controls the light”).

The researchers argue that for an agent to truly be useful—whether it’s a household robot or an advanced AI assistant—it needs to perceive affordances (possibilities for action). It needs to understand that a knob turns, a button presses, and a handle pulls, and importantly, what effect those actions have on the environment.

The Challenge of Functionality

Creating a functional map is significantly harder than creating a spatial one. Why?

Data Scarcity: We have massive datasets of 3D objects, but very few datasets explicitly label functional relationships (e.g., lines connecting switches to lights).
Small Objects: Interactive elements like buttons and knobs are tiny compared to furniture, making them hard for standard computer vision models to detect.
Invisible Connections: Remote relationships (like a remote control operating a TV) have no visual link. You can’t “see” the connection; you have to infer it based on common sense or observation.

To solve this, the authors introduce OpenFunGraph, an open-vocabulary pipeline that leverages the massive knowledge embedded in modern Foundation Models.

The Solution: OpenFunGraph

The core insight of this paper is that we don’t need to train a model from scratch to learn physics or common sense. Large Language Models (LLMs) like GPT-4 and Visual Language Models (VLMs) already “know” that switches usually control lights and handles usually open doors.

The OpenFunGraph architecture, illustrated below, harnesses this pre-existing knowledge to build functional graphs from standard RGB-D (color + depth) video sequences.

Fig. 2. Illustration of the OpenFunGraph architecture. The pipeline consists of node detection, node description, and functional relationship inference.

The method operates in three distinct stages:

1. Adaptive Node Detection (Finding the Parts)

The first step is to find everything in the scene. The researchers use a progressive detection strategy.

Instead of trying to find everything at once, they start big. They use a model called RAM++ to identify large objects (e.g., “cabinet”). Then, they ask an LLM (GPT-4) a crucial question: “What interactive elements usually belong to a cabinet?” The LLM might suggest “knob” or “handle.”

Armed with these specific suggestions, they prompt a detection model (GroundingDINO) to hunt for those specific small parts within the area of the large object. This “zoom-in” approach allows them to detect tiny interactive elements that generic detectors would usually miss. These detections are then fused into 3D space to create the “nodes” of the graph.

2. Node Description (Describing the Parts)

Once the system has identified a node (say, a specific switch), it needs to understand it. A simple label isn’t enough.

The system generates multi-view captions for each object. For small elements, it uses a clever trick: it crops the image around the element and draws a red box around it to focus the VLM’s attention. The VLM then describes the object in natural language (e.g., “a white rocker switch mounted on a beige wall”). An LLM summarizes these descriptions into a concise, informative caption for every node in the graph.

3. Functional Relationship Reasoning (Connecting the Parts)

This is the most innovative part of the pipeline. The system needs to draw “edges” between nodes representing functional links. It does this using Sequential Reasoning:

Step A: Local Relationships (Physical Connections) First, the system looks for objects that are physically touching. If a “handle” is spatially overlapping with a “drawer,” the system feeds the descriptions of both to an LLM. It asks: “Is it likely that this handle opens this drawer?” The LLM uses its common sense to confirm the relationship.

Step B: Remote Relationships (Invisible Connections) Next, the system tackles the harder problem: things that aren’t touching. This is the Confidence-Aware Remote Reasoning.

The system takes an unassigned interactive element (e.g., a wall switch).
It asks the LLM to identify potential targets in the room (e.g., ceiling light, table lamp, fan).
It then uses a VLM to verify visual clues (e.g., “Is the lamp plugged into an outlet near this switch?”).
Finally, the LLM assigns a confidence score to each possible connection. If the confidence is high enough (e.g., “This is the only ceiling light in the room, so the wall switch likely controls it”), the edge is created.

A New Benchmark: The FunGraph3D Dataset

To test their method, the researchers couldn’t rely on existing datasets—they simply didn’t have the necessary functional labels. So, they built their own.

They introduced FunGraph3D, a dataset captured using high-fidelity laser scanners and photorealistic cameras.

Fig. 3. Modalities of our FunGraph3D dataset. Top: 3D scans. Middle: Interaction Graphs. Bottom: Real-world interactions.

What makes this dataset special is the ground truth. The researchers didn’t just label objects; they labeled the interaction graphs. They even collected egocentric videos (using an Apple Vision Pro) of people actually interacting with the scenes—flipping switches and opening doors—to ensure the ground truth labels were accurate.

As seen below, the dataset covers a variety of complex real-world environments, from kitchens to living rooms, all annotated with rich functional data.

Fig. 4. Example scenes from our FunGraph3D dataset.

Experimental Results

So, how well does OpenFunGraph work? The researchers compared it against adapted versions of state-of-the-art baselines like Open3DSG and ConceptGraph.

The results were decisive. OpenFunGraph significantly outperformed the baselines in both detecting interactive elements and correctly predicting functional relationships.

Node Detection: Baselines like ConceptGraph are great at finding big furniture but fail miserably at finding small knobs and switches. OpenFunGraph’s progressive prompting strategy allowed it to recall vastly more interactive elements.
Relationship Prediction: Because standard scene graphs focus on spatial edges (proximity), they struggle to infer function. OpenFunGraph’s use of LLM-driven common sense reasoning allowed it to correctly link switches to lights and handles to drawers with much higher accuracy.

Qualitatively, the difference is stark. In the image below, you can see OpenFunGraph correctly inferring that a specific switch controls the ceiling light (indicated by the orange line), and distinguishing between different handles for different storage units.

Fig. 5. Qualitative results. Top: input images. Bottom: predicted functional 3D scene graph.

Why Does This Matter? Downstream Applications

The ultimate goal of this research isn’t just to make pretty graphs; it’s to enable robots to act. The researchers demonstrated this utility through two downstream tasks: 3D Question Answering and Robotic Manipulation.

Because the functional scene graph is essentially a structured database of how the room works, an LLM can query it to answer complex user questions.

User: “How do I turn on the light?”
System: “You can use the switch plate located by the door.” (The system knows which specific switch is linked to the light).

Furthermore, this graph can guide robotic motion planning.

Fig. 6. Functional 3D Scene Graphs for Robotic Manipulation. Left: 3D scene and functional graph. Right: Robot interacting with scene elements.

In the demonstration above, a user gives a command: “Turn on the light.” The robot queries the functional graph, identifies the specific switch connected to the ceiling light, navigates to its location, and manipulates it. Without the functional edge predicted by OpenFunGraph, the robot might know where the light is, but it would have no idea how to activate it.

Conclusion

OpenFunGraph represents a significant step forward in 3D scene understanding. By moving from purely spatial representations to functional ones, it equips AI agents with a deeper, more human-like understanding of their environment.

The method cleverly bypasses the need for massive labeled datasets by distilling the common-sense physics knowledge already present in foundation models. While challenges remain—such as handling ambiguous scenarios where multiple switches look identical—this work lays the foundation for a future where robots don’t just occupy our space, but truly understand how to live and work within it.

The Problem: Why Spatial Maps Aren’t Enough#

The Challenge of Functionality#

The Solution: OpenFunGraph#

1. Adaptive Node Detection (Finding the Parts)#

2. Node Description (Describing the Parts)#

3. Functional Relationship Reasoning (Connecting the Parts)#

A New Benchmark: The FunGraph3D Dataset#

Experimental Results#

Why Does This Matter? Downstream Applications#

Conclusion#