Introduction
In the world of robotics, answering the question “Where am I relative to you?” is surprisingly difficult. This problem, known as visual relative pose estimation, is fundamental for multi-robot systems. Whether it’s a swarm of drones coordinating a light show or warehouse robots avoiding collisions, robots need to know the position and orientation (pose) of their peers.
Traditionally, teaching a robot to estimate pose from a camera image requires a heavy dose of supervision. You usually have two expensive options:
- Motion Capture Systems: You set up a room with expensive infrared cameras (like a Vicon system) to track the exact position of the robots and use that data to train your neural networks. This is costly and restricted to the lab.
- CAD Models: You use a 3D digital model of the robot to generate synthetic training images. While cheaper, this suffers from the “sim-to-real” gap—robots rarely look exactly like their perfect digital twins due to lighting, wear and tear, or messy cables.
But what if a robot could learn to see its peers without any external cameras, human labels, or 3D models? What if two robots could simply drive around a room, flash some lights at each other, and teach themselves pose estimation from scratch?
That is exactly the premise of the paper “Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States.” The researchers propose a clever self-supervised method where a neural network learns the complex geometry of a robot not by being told where the robot is, but by trying to guess which LEDs on the robot are currently lit up.
In this deep dive, we will explore how a simple “pretext task”—predicting light states—forces a model to learn high-level concepts like distance, orientation, and position, all without a single ground-truth pose label.
Background: The Challenge of Self-Supervision
Before understanding the method, we need to clarify the learning paradigm. Most standard computer vision tasks use Supervised Learning. You show the computer an image of a robot and tell it, “This robot is at coordinates \((x, y)\) and rotated 30 degrees.” Do this thousands of times, and the computer learns.
Self-Supervised Learning (SSL) changes the game. In SSL, the data provides its own supervision. A common technique is the Pretext Task. You ask the model to solve a made-up problem (the pretext) that forces it to learn features useful for the actual problem (the downstream task).
In this paper, the downstream task is Pose Estimation (finding the robot). The pretext task is LED State Classification (is the front light on or off?).
The intuition is brilliant in its simplicity:
- To know if the front LED is visible, the model must understand the robot’s orientation.
- To know if any LED is on, the model must find the robot’s position in the image.
- To distinguish the LEDs from background noise, the model must understand the robot’s scale (distance).
The Core Concept: Learning from Blinking Lights
The researchers’ setup involves two robots. One is the observer (equipped with a camera), and the other is the target (equipped with LEDs).
The robots move randomly around a room. The target robot randomly toggles its LEDs (Front, Back, Left, Right) on and off. Crucially, the target robot broadcasts its LED states via radio to the observer.
This creates a perfectly synchronized dataset. The observer has an image, and it has a label (e.g., “Front LED is ON, Back LED is OFF”). It does not know where the robot is. It only knows the lights.

As shown in Figure 1, the model takes the image as input. By trying to classify the state of the LEDs (blue for off, red for on), the model implicitly learns the variables needed to solve that problem: the position (\(u, v\)), the distance (\(d\)), and the bearing/orientation (\(\psi\)).
Why does this work?
If you ask a neural network, “Is the back light on?”, and the robot is facing the camera, the network will struggle to see the back light because the robot’s body blocks it. The only way to minimize the error is to understand that when the robot faces forward, the back light is occluded. Therefore, the network learns the concept of orientation to predict the visibility of the lights.
Similarly, if the robot is far away, the LEDs are tiny clusters of pixels. If it is close, they are large blobs. To correctly identify them, the network must understand scale and distance.
The Method: Architecture and Loss Function
The researchers designed a Fully Convolutional Network (FCN) that outputs several “maps” rather than single values. Let’s break down the architecture and the logic step-by-step.
1. Localization: Where is the robot?
The model outputs a spatial map called \(\hat{P}\). This is a grid covering the image where each cell represents a probability of the robot being there.
However, we don’t have labels for where the robot is. We only have LED labels. The researchers use a spatial attention mechanism. The model learns that to correctly guess the LED state, it should “look” at the pixels where the robot actually is.
- If the model looks at the wall, it can’t predict if the LED is on. The loss (error) will be high.
- If the model looks at the robot, it sees the LED. The loss is low.
Mathematically, the model learns to increase the values in the \(\hat{P}\) map at the robot’s location to minimize its classification error.
2. Orientation: Which way is it facing?
The model also outputs an orientation map \(\hat{\Psi}\). Each cell contains a predicted angle.
Here is the clever part: The researchers use a Visibility Function. They assume they know the approximate direction each LED faces relative to the robot.

Look at Figure 3. This graph shows the visibility of different LEDs based on the robot’s yaw angle (heading).
- If the robot is at \(0^{\circ}\) (facing away), the Back LED (solid black line) is most visible.
- If the robot rotates to \(90^{\circ}\), the Right LED (dashed line) becomes visible.
The model uses its predicted orientation \(\hat{\psi}\) to look up these visibility weights. It effectively says: “I think the robot is facing 90 degrees, so I should trust the Right LED’s status and ignore the Left LED’s status because it’s probably blocked.”
3. Distance: The Multi-Scale Approach
Estimating distance from a single camera (monocular vision) is notoriously hard because of the ambiguity between “small object close up” and “large object far away.” However, since the robot’s physical size is constant, its apparent size in the image correlates directly to distance.
The researchers use a multi-scale strategy. They feed the image into the network at three different sizes (scales).

As visualized in Figure 2:
- The network has a fixed “Receptive Field” (RF)—think of this as the size of the magnifying glass it uses to look at the image.
- If the robot is far away, it fits inside the RF when the image is full size.
- If the robot is close (huge in the frame), it might only fit inside the RF when the image is downscaled (shrunk).
By checking which image scale yields the most confident detection, the model can estimate the distance. If the robot is detected best in the tiny image, it must be close (large). If it’s detected best in the large image, it must be far (small).
4. Putting it together: The Loss Function
The training process optimizes a composite loss function. The goal is to minimize the error between the predicted LED states and the broadcasted (true) LED states.

This equation might look intimidating, but it summarizes the logic above:
- It sums over all LEDs (\(K\)), all scales (\(S\)), and all pixels (\(H', W'\)).
- \(\mathcal{L}_{\mathrm{ms}}^{k,s}\) computes the error between the prediction and the reality, weighted by where the model thinks the robot is (\(\hat{P}\)) and how visible that LED should be based on orientation.
Experimental Setup
To prove this works, the authors let two robots (DJI Robomaster S1s) loose in a laboratory, a gym, a classroom, and a break room.
- Data Collection: The robots drove randomly. 77% of the time, they didn’t even see each other! This is realistic—robots won’t always be in the frame.
- Validation: For testing purposes only, they used a motion capture system to get the “ground truth” to verify their model’s accuracy. The model never saw this data during training.

Figure 5 shows what the robot sees. Notice how difficult some of these shots are—cluttered backgrounds, different lighting, and the robot appears at various distances. The labels on the right (e.g., F: blue, B: red) are the only information the model gets.
Results: How well does it work?
The results are striking. The self-supervised model performs nearly as well as fully supervised approaches that require expensive tracking systems.
1. Comparison with Baselines
Let’s look at the numbers.

In Table 1, compare “Ours” with “Upperbound” and “CNOS”:
- \(E_{uv}\) (Position Error): Our method misses by about 17 pixels. The Upperbound (supervised) misses by 18. Our self-supervised method is actually slightly better here.
- \(E_{\psi}\) (Orientation Error): Our error is \(17^{\circ}\) versus the Upperbound’s \(14^{\circ}\). Extremely close.
- \(E_d\) (Distance Error): Here, the gap is wider (24% vs 11%). This is due to the “step-function” nature of the multi-scale distance estimation, which we will discuss next.
Crucially, compare “Ours” to “Mean Predictor” (guessing the average). The improvement is massive. It also outperforms “CNOS,” a state-of-the-art method that uses CAD models, in orientation and distance accuracy.
2. The Distance Limitation
While position and orientation are highly accurate, distance estimation shows a specific quirk.

Figure 6 (bottom right graph) reveals a “staircase” pattern in the distance (\(d\)) predictions. This happens because the model estimates distance based on discrete image scales (Scale 1, Scale 2, Scale 4). It categorizes the robot into “Close,” “Medium,” or “Far.” It struggles to predict continuous values between those steps. The authors note this can be fixed by using more scales at inference time.
3. Generalization and Multi-Robot Scenarios
One of the strongest arguments for this method is that it doesn’t overfit to a specific room. Because it learns the visual features of the robot (wheels, chassis, lights) rather than memorizing the background, it works in new environments.
Furthermore, even though it was trained on images with only one robot, it can handle images with multiple robots at inference time.

Figure 8 shows qualitative results:
- 1-4: Standard lab setting.
- 5-8: Out of domain (gyms, classrooms). The model still finds the boxy outline of the robot.
- 9-10: Multi-robot scenarios. The probability map \(\hat{P}\) simply develops two peaks instead of one, allowing the system to track multiple peers simultaneously.
Inference: The Magic Trick
You might be asking, “Does the robot need to blink its LEDs forever for this to work?”
No. This is the most important takeaway.
The LED classification is just the training task. Once the network is trained, it has learned to interpret the appearance of the robot’s body to infer its pose. At inference time (deployment), the LEDs can be off, on, or broken—it doesn’t matter. The model looks at the image, generates the probability maps, and extracts the pose.
The authors verified this by testing on datasets where all LEDs were turned off, and the performance drop was negligible. The model had successfully learned the “concept” of the robot.
Conclusion
This paper presents a compelling step forward for autonomous robotics. By leveraging Self-Supervised Learning, the researchers eliminated the need for expensive motion capture systems or brittle CAD model matching.
Here are the key takeaways:
- Labels are everywhere if you look: We don’t always need humans to annotate data. The internal state of a robot (like its lights) can serve as a powerful training signal.
- Geometry emerges from classification: By forcing a neural network to predict visibility, the network implicitly learns 3D geometry (occlusion, perspective, and scale).
- Scalability: This method allows robots to “learn in the wild.” You could deploy a swarm of rovers on Mars, have them drive around flashing lights at each other, and they would learn to recognize one another without any prior training data.
This approach turns the constraint of “no ground truth” into a feature, paving the way for robots that learn to see the world simply by interacting with it.
](https://deep-paper.org/en/paper/2509.10405/images/cover.png)