Introduction

Imagine trying to sprint through a crowded room, dodging furniture and people, while looking through a narrow paper towel tube. This is effectively how many state-of-the-art legged robots operate today. While we have seen incredible videos of robot dogs backflipping or hiking, much of that agility relies on “proprioception”—the robot’s internal sense of its joint positions and balance. They are essentially moving blindly, relying on their ability to recover from stumbles rather than avoiding obstacles in the first place.

To make robots truly autonomous in unstructured environments—like search and rescue sites or busy offices—they need to see and react to the world in 3D. This includes avoiding hanging wires, glass walls, and moving people.

In a recent paper titled “Omini-Percption: Omnidirectional Collision Avoidance for Legged Locomotion in Dynamic Environments,” researchers from The Hong Kong University of Science and Technology (Guangzhou) propose a groundbreaking solution. They introduce Omni-Perception, a framework that allows a robot to navigate complex 3D spaces by processing raw LiDAR data directly within a Reinforcement Learning (RL) policy.

Figure 1: Validation scenarios for the Omni-Perception framework. Effective omnidirectional collision avoidance is demonstrated on the left, where the robot reacts to obstacles from various approach vectors. Robustness against diverse environmental features is shown on the right, including successful negotiation of aerial, transparent, slender, and ground obstacles. These results highlight the capacity of the Omni-Perception to achieve collision-free locomotion in challenging 3D settings directly from raw LiDAR input.

As shown in Figure 1 above, this system allows the robot to handle everything from uneven terrain to “slender” obstacles like poles and even transparent glass—obstacles that typically confuse standard depth cameras. In this post, we will tear down the architecture of this paper, explaining how they taught a robot to “see” using raw point clouds and how they built a custom simulator to make it happen.

The Problem: Why is “Seeing” So Hard?

Before diving into the solution, we need to understand why this hasn’t been solved yet. Legged robots typically use one of two methods to sense the world:

  1. Depth Cameras: These work like our eyes but often have a limited field of view (FOV). If an obstacle is to the side or behind the robot, the camera misses it. Furthermore, depth cameras struggle with lighting changes (like bright sunlight or darkness) and transparent surfaces.
  2. Elevation Maps: Robots often take sensor data and build a 2.5D map of the floor height. While useful for stepping on rocks, these maps are terrible at representing overhanging obstacles (like a table you might bump your head on) or dynamic objects moving quickly.

The researchers argue that LiDAR (Light Detection and Ranging) is the superior sensor for this task. LiDAR provides a 360-degree view of the environment and is immune to lighting conditions. However, LiDAR produces a massive amount of data in the form of “point clouds”—thousands of individual 3D dots.

Processing these point clouds in real-time to make split-second motor control decisions is computationally heavy. This is why most previous approaches separated the “seeing” (mapping/planning) from the “moving” (locomotion). The result? Slow, conservative movement.

Omni-Perception changes this by feeding the raw LiDAR data directly into the robot’s brain (the neural network) end-to-end.

The Omni-Perception Framework

To achieve agile, safe movement, the robot needs to know two things: “Where is my body?” (Proprioception) and “What is around me?” (Exteroception).

The researchers highlight the difference in sensor coverage in the image below. Notice how the LiDAR (bottom) covers a much wider range than the depth camera (top), eliminating blind spots.

Figure 2: Proposed System Framework. (a) Visualization of differing sensor coverage: the typically narrow, forward-directed field of view of a depth camera (top) contrasted with the broader and longer range, coverage of a LiDAR sensor (bottom), shown on the Unitree Go2 robot. (b) Detailed diagram of the perception and control pipeline.

The Architecture: PD-RiskNet

The core innovation of this paper is a neural network module called PD-RiskNet (Proximal-Distal Risk-Aware Hierarchical Network).

Processing every single point from a LiDAR scan is too slow. The network needs to be smart about what it pays attention to. The researchers realized that the robot needs different types of information depending on how far away an object is. Therefore, they split the LiDAR data into two streams:

  1. Proximal (Near-Field): This data is crucial for immediate foothold selection and collision avoidance. It requires high precision.
  2. Distal (Far-Field): This data helps the robot plan its path in the general direction of the goal. It can be coarser.

1. Proximal Processing (The “Here and Now”)

For points close to the robot (the dense blue points in Figure 2b), the system uses Farthest Point Sampling (FPS). This technique selects a representative subset of points that best preserves the shape of nearby obstacles. These points are fed into a GRU (Gated Recurrent Unit), a type of neural network good at remembering sequences, to extract features about the immediate terrain.

2. Distal Processing (The “Look Ahead”)

For points far away (the sparse red points in Figure 2b), the system uses Average Downsampling. Since the robot doesn’t need to know the exact millimeter shape of a wall 5 meters away, averaging the points reduces noise and computational load. This stream also goes into its own GRU.

These two streams are combined (concatenated) with the robot’s proprioception (joint history) and the user’s velocity command. The final output is the specific motor commands for the robot’s legs.

Teaching the Robot: The Reward System

How do you teach a robot not to crash? In Reinforcement Learning, you design a “reward function”—a scoring system where the robot gets points for good behavior and loses points for bad behavior.

The goal is to track a command (e.g., “move forward at 1 m/s”) while avoiding obstacles. The researchers created a clever sector-based avoidance mechanism.

Imagine the robot is surrounded by a 360-degree radar divided into sectors. If an obstacle appears in a specific sector, the system calculates a “repulsive” force pushing the robot in the opposite direction.

Figure 3: The calculation of the sector-based avoidance velocity.

As seen in Figure 3, if obstacles (O1, O2, O3) are detected, the system computes an Avoidance Velocity (\(V_{avoid}\)). This isn’t a physical force, but a mathematical adjustment to the target velocity the robot thinks it should have.

The magnitude of this avoidance velocity in each sector is calculated using an exponential function based on distance. The closer the obstacle (\(d_{j,t}\)), the stronger the repulsion:

() | V _ { t , j } ^ { a v o i d } | = \\exp ( - d _ { j , t } \\cdot \\alpha _ { a v o i d } ) ()

The robot is then trained to match a modified target velocity that includes this avoidance vector. If the user says “go straight,” but there is a wall, the modified command effectively becomes “go straight but veer left.” The reward function penalizes the robot if it doesn’t follow this safe trajectory:

() r _ { v e l _ a v o i d } = \\exp ( - \\beta _ { v a } * | v _ { t } - ( v _ { t } ^ { c m d } + V _ { a v o i d , t } ) | ^ { 2 } ) ()

Additionally, a “ray reward” encourages the robot to seek open spaces, maximizing the distance of LiDAR rays to obstacles:

() r _ { r a y s } = \\sum _ { i = 1 } ^ { n } \\frac { \\hat { d } _ { t , i } } { n \\cdot d _ { \\operatorname* { m a x } } } ()

The Simulation Toolkit: Bridging the Reality Gap

Training a robot in the real world is slow and dangerous. Training in simulation is fast, but often inaccurate (“Sim-to-Real gap”). This is especially true for LiDAR, which has complex noise patterns and reflection physics.

Standard simulators like Isaac Sim or Gazebo often struggle to simulate LiDAR efficiently for thousands of parallel environments (which is required for RL). To solve this, the authors built a Custom LiDAR Rendering Framework using NVIDIA Warp and Taichi.

Efficiency and Fidelity

This custom tool allows for massive parallelism. It simulates non-repetitive scan patterns (common in modern solid-state LiDARs like the Livox Mid-360) and self-occlusion (the robot seeing its own body).

The visual comparison below demonstrates the high fidelity of their simulator. Panel (c) is their simulation, which looks remarkably similar to the real-world scan in panel (b), including the “shadow” cast by the robot’s own body.

Figure 4: Comparison of simulated and real point cloud for the Unitree G1 robot. (a) The physical Unitree G1 robot setup. (b) Real-world LiDAR scan captured by the onboard Livox Mid-360 sensor. (c) (Ours) Point cloud generated using our Livox Mid-360 sensor model within the Isaac Gym. (d) Point cloud using the official sensor within the Isaac Sim. Ours captures the self-occlusion effect as in the real-world LiDAR.

Speed is critical for RL training. As shown in Table 1 below, the custom framework (Ours) renders scenes significantly faster than Isaac Sim, especially as the number of environments scales up to 4096.

Table 1: Rendering time (ms) for static scenes across configurations. Ours is much more efficient than Isaac Sim.

Experiments and Results

The team validated Omni-Perception through extensive simulation and real-world deployment on a Unitree robot.

Real-World Agility

The robot was tested in scenarios that are notoriously difficult for perception systems:

  1. Aerial Obstacles: Objects hanging at head height.
  2. Transparent Obstacles: Glass walls (LiDAR can often detect the frame or partial reflections better than depth cameras).
  3. Dynamic Humans: People walking into the robot’s path.
  4. Complex Terrain: Combinations of stairs, slopes, and clutter.

Figure 5: Robot obstacle avoidance performance was assessed in varied scenarios, including complex terrain and dynamic human interference.

In the image above, you can see the robot navigating rocky terrain and successfully avoiding moving humans. In their comparison tests, the native Unitree system (which uses depth cameras) failed 100% of the time against aerial and moving human obstacles, while Omni-Perception achieved success rates of 70% and 90% respectively.

Ablation Studies

Does the PD-RiskNet architecture really matter? The researchers compared their method against simpler approaches, such as feeding raw points into a standard MLP (Multi-Layer Perceptron) or just using a GRU without the proximal/distal split.

Table 3: PD-RiskNet Ablation Results (30 Trials)

The results in Table 3 are telling. A direct MLP approach ran out of memory (OOM). Simpler sampling methods resulted in much lower success rates (33%). The Omni-Perception method (“Ours”) achieved the highest success rate, proving that the hierarchical processing of near and far points is essential.

Limitations and Failure Cases

No system is perfect. The authors candidly discuss where Omni-Perception struggles.

1. The “Dense Grass” Problem: In unstructured environments, tall grass can look like a solid wall to a LiDAR sensor. The robot might treat a patch of grass as a dangerous obstacle and refuse to walk through it, or get “trapped” because it sees obstacles on all sides.

Figure 7: dense grass.

2. Sparse Objects: Because the “Distal” (far) stream uses average downsampling, very thin or small objects at a distance might be averaged out of existence. The robot might not realize a thin pole is there until it gets close enough for the “Proximal” stream to catch it, which might be too late for smooth avoidance.

Conclusion

The Omni-Perception paper represents a significant step forward in robotic autonomy. By moving away from slow, map-based planning and limited depth cameras, and instead embracing end-to-end learning with raw LiDAR, the researchers have created a legged robot that is genuinely aware of its 3D surroundings.

This work highlights two important trends in robotics:

  1. Sensor-First Design: Using the right sensor (LiDAR) for the task (3D geometry) is more effective than trying to force cameras to do everything.
  2. Simulation is Key: You cannot train robust policies without high-fidelity, high-performance simulation tools that accurately model sensor noise and physics.

As this technology refines—perhaps by integrating semantic understanding to tell the difference between “grass” and “concrete wall”—we can expect to see robot dogs leaving the lab and reliably navigating the messy, chaotic world we live in.