Introduction

Imagine you are teaching a teenager to drive. You typically don’t give them a differential equation describing the distance to the curb or a physics formula for the friction coefficient of the road. Instead, you offer intuitive feedback: “That was a bit too close to the parked car,” or “Good job slowing down for that pedestrian.”

This human intuition is incredibly powerful, yet translating it into the mathematical language of robotics is notoriously difficult.

In traditional autonomous navigation, engineers often rely on hand-crafted reward functions. These are rigid mathematical rules (e.g., “Receive +10 points for moving forward, -100 points for hitting a wall”). While this works in structured environments, it often fails in the messy, unpredictable real world. Furthermore, these systems usually depend on expensive, power-hungry sensors like LiDAR to measure geometry perfectly.

But what if we could teach a robot to navigate using only a cheap camera and human-like intuition?

This is the premise behind a fascinating new paper titled HALO: Human Preference Aligned Offline Reward Learning for Robot Navigation. The researchers propose a method to capture human visual intuition and distill it into a reward model that a robot can use to navigate complex environments—from crowded sidewalks to glass-walled offices—using only RGB images.

In this post, we will break down how HALO works, the clever architecture behind it, and why “preference learning” might be the key to the next generation of social robots.

Background: The Challenge of the “Reward”

To understand why HALO is significant, we first need to understand the bottleneck in Reinforcement Learning (RL) for robotics.

In RL, an agent (the robot) learns to make decisions by trying to maximize a cumulative Reward. If the reward function is good, the robot learns great behavior. If the reward function is flawed, you get “reward hacking,” where the robot does something technically correct but practically useless (like spinning in circles to avoid dying, but never reaching the goal).

The Limits of Hand-Engineering and LiDAR

Traditionally, roboticists design these rewards manually. They also rely heavily on LiDAR (Light Detection and Ranging) to detect obstacles.

LiDAR is expensive: It adds cost and hardware complexity.
Hand-crafting is hard: How do you write a mathematical equation for “don’t be rude to that pedestrian”? Or “drive on the sidewalk, not the grass”?

Offline RL and Preference Learning

HALO leverages Offline RL, which allows robots to learn from a static dataset of previously collected experiences (trajectories) rather than learning by trial-and-error in the real world (which is dangerous and slow).

However, Offline RL still needs a way to evaluate how “good” a specific action is. HALO solves this by using Human Preferences. Instead of defining a perfect score, HALO asks: Given these options, which one would a human prefer?

The Core Method: How HALO Works

HALO stands for Human Preference ALigned Offline Reward Learning. The goal is to train a neural network that takes an image (what the robot sees) and a proposed action (where the robot wants to go) and outputs a scalar score representing how safe and “human-like” that action is.

Let’s break down the architecture and the training process.

1. The Architecture: Action-Conditioned Vision

The core of HALO is a reward model that needs to look at an image and decide if a specific movement is a good idea. The researchers designed a clever architecture to achieve this.

Figure 1: Architecture of the proposed reward model.

As shown in the architecture diagram above, the process involves two parallel streams that merge:

Visual Processing (Top Left): The robot’s view (\(I_t^{RGB}\)) is passed through a DINO-v2 encoder. DINO-v2 is a powerful, pre-trained Vision Transformer that is excellent at understanding semantic features in images (like distinguishing a road from a wall) without needing labeled training data.
Action Masking (Bottom Left): This is the unique part. The model takes a candidate action \(a_t\) (a combination of linear velocity \(v\) and angular velocity \(\omega\)). It projects this action into the future to create a “trajectory mask”—essentially drawing a line on the image where the robot would go if it took that action.

Why is the mask important? If you look at a busy street, parts of the image are irrelevant to your immediate safety. By generating a mask of the intended path, the model can tell the visual encoder: “Focus specifically on this strip of pixels. Is there a rock here? A person? A hole?”

This mask is processed by a small CNN and used to weigh the visual features from DINO-v2. Finally, an MLP (Multi-Layer Perceptron) outputs the single reward score \(R(s_t, a_t)\).

2. Capturing Human Intuition (The Data)

You cannot train a neural network without data. The researchers used the SCAND dataset (Socially CompliAnt Navigation Dataset) but added a twist. They manually annotated scenes with human preferences.

Instead of asking a human to give a score from 0 to 10 (which is subjective and noisy), they asked 5 binary questions for different frames:

Can the robot turn left?
Can the robot turn right?
Can the robot decelerate?
Can the robot accelerate?
Is the robot in danger/behaving sub-optimally?

This approach converts complex navigation intuition into simple Yes/No data points.

3. From Binary Answers to Probability Distributions

How do you turn a “Yes” or “No” into a training signal? The authors convert these boolean responses into a probability distribution over the action space.

First, they define a local set of possible actions \((v, \omega)\):

Local Action Set Equation

They then assume the probability of an action being “good” depends on the user’s answers. They use a Boltzmann distribution. If a user says “Turn Left,” the distribution shifts to assign higher probabilities to actions with a positive angular velocity.

The probability of a specific velocity \(v\) and angle \(\omega\) given the user’s feedback \(\mathcal{U}\) is calculated as:

Probability distribution equations

Here, the “temperature” parameter \(\tau\) determines how sharp the preference is. If the user strictly prefers left, the distribution becomes very sharp around left turns.

Finally, to handle safety, they introduce a scaling factor \(\lambda\). If the user marked the scene as “Danger,” the score is inverted to be negative, heavily penalizing the action.

Lambda scaling equation

This results in a final “Preference Score” for every possible action in that frame:

Final Preference Score Equation

4. Training with Plackett-Luce Loss

Now the model has a set of actions ranked by how much a human would prefer them. To train the neural network to replicate this ranking, the researchers use the Plackett-Luce model.

In simple terms, Plackett-Luce is a way to calculate the probability of a specific ranking of items. The loss function tries to maximize the likelihood that the neural network ranks the actions in the exact same order as the human preferences derived above.

Plackett-Luce Loss Equation

By minimizing this loss, the network learns to assign high rewards to safe, human-approved trajectories and low rewards to dangerous ones.

To keep the training stable, they essentially add “regularization” terms, including a diversity loss (ensuring different actions get different rewards) and focal loss (focusing on hard examples):

Total Loss Equation

Experimental Results

Does this actually work? Can a robot navigate using only this learned reward function, without a LiDAR sensor?

The researchers tested HALO on a Clearpath Husky robot in diverse environments: outdoors, low-light scenarios, and indoors. They compared two ways of using HALO:

HALO-MPC: Using the reward model as a cost function inside a Model Predictive Control planner.
HALO-IQL/BC: Using the reward model to train an Offline RL policy (Implicit Q-Learning or Behavioral Cloning).

They compared these against standard methods like DWA (Dynamic Window Approach, which uses LiDAR) and HER (Hand-Engineered Rewards).

Quantitative Success

The results were impressive. HALO generally outperformed baselines, even those that had the unfair advantage of using LiDAR.

Table 1: Performance comparison of navigation methods

Looking at Table 1:

Success Rate: In Scenario 1 (Outdoors), HALO-IQL achieved an 80% success rate, matching the LiDAR-based DWA.
Fréchet Distance: This metric measures how similar the robot’s path was to a human expert’s path. Lower is better. HALO-MPC achieved a score of 0.892 in Scenario 1, significantly better than the LiDAR-based DWA (1.677). This suggests HALO drives more naturally.

Qualitative Analysis: Seeing is Believing

The qualitative results really highlight the difference in behavior.

Figure 2: Trajectory comparisons

In Figure 2, you can see distinct behaviors:

Scenario 1 (Top Row): The DWA (Cyan line) tries to cut a straight line, ignoring the context of the sidewalk. HALO-MPC (Solid Red) follows the sidewalk curve, mimicking human social norms.
Scenario 3 (Bottom Row): This is an indoor hallway with glass walls. LiDAR often shoots right through glass, failing to detect it. Consequently, DWA fails here (0% success in Table 1). HALO, relying on vision (RGB), sees the frame/reflection of the glass and successfully navigates the corridor.

We can also look at specific policy decisions:

Figure 3: Qualitative Analysis for Behavioral Cloning and IQL

In Figure 3, the images show the robot’s perspective. The colored lines represent the planned path. Notice how in crowded scenes (middle images), the HALO-trained policies (Blue) often choose conservative, safe paths that avoid pedestrians, aligning closely with the Green (Human) ground truth.

Why did Hand-Engineered Rewards Fail?

The experiments showed that policies trained on Hand-Engineered Rewards (HER) often performed poorly (e.g., 0% success in some scenarios). This confirms the hypothesis: it is incredibly difficult to tune a mathematical formula that generalizes across grass, pavement, glass hallways, and night-time lighting. HALO’s data-driven approach generalizes much better because it learns features of safety rather than geometric rules.

Conclusion

HALO represents a significant step forward in robotic navigation. By moving away from brittle, hand-crafted reward functions and expensive sensors, and moving toward human-preference aligned learning, we can build robots that are:

Cheaper: Functioning on RGB cameras rather than LiDAR.
More Natural: Moving in smooth, socially compliant ways.
More Robust: Capable of handling “invisible” obstacles like glass that trick geometric sensors.

The core innovation—using a homography-projected mask to focus the vision model on the robot’s future path—is an elegant way to combine action and perception.

While the authors note limitations, such as sensitivity to extreme lighting changes or the “object impermanence” problem (forgetting an obstacle once it leaves the camera frame), the results clearly show that teaching robots through preference and intuition is a viable path to autonomy.

Just like teaching that teenager to drive, sometimes the best instruction isn’t a formula—it’s just knowing what “good” looks like.

This blog post explains the research presented in “HALO: Human Preference Aligned Offline Reward Learning for Robot Navigation” by Seneviratne et al. at the University of Maryland.

Introduction#

Background: The Challenge of the “Reward”#

The Limits of Hand-Engineering and LiDAR#

Offline RL and Preference Learning#

The Core Method: How HALO Works#

1. The Architecture: Action-Conditioned Vision#

2. Capturing Human Intuition (The Data)#

3. From Binary Answers to Probability Distributions#

4. Training with Plackett-Luce Loss#

Experimental Results#

Quantitative Success#

Qualitative Analysis: Seeing is Believing#

Why did Hand-Engineered Rewards Fail?#

Conclusion#