Imagine walking through a crowded office or hiking down a forest trail with a robot assistant carrying your gear. For the robot to be useful, it needs to do one thing flawlessly: follow you.
This task, known as Embodied Visual Tracking (EVT), sounds simple to us humans. We effortlessly track a friend in a crowd, predict where they will step next, and navigate around obstacles without losing sight of them. But for robots, this is a nightmare. It requires two distinct skills: recognition (identifying who to follow) and trajectory planning (deciding how to move).
Traditionally, roboticists treated these as separate problems. One module handles the camera vision, and another handles the movement. But what happens if the vision module gets confused by a passerby wearing a similar shirt? The movement module blindly follows the wrong person.
In a recent paper, researchers introduced TrackVLA, a unified model that bridges this gap. By combining vision, language, and action into a single framework, TrackVLA achieves state-of-the-art tracking performance, even in chaotic real-world environments.

The Problem: When Recognition Meets Planning
The core challenge in Embodied Visual Tracking is the “synergy” gap. Existing approaches usually decouple the process:
- A Tracker/Detector finds the target in the image.
- A Planner takes that bounding box and calculates velocity commands.
This separation leads to error accumulation. If the detector flickers or fails due to occlusion (the target walking behind a pillar), the planner has no context to recover. Furthermore, most existing planners use Reinforcement Learning (RL) which can be jerky and unnatural, or they rely on simple visual servoing that fails in complex scenes.
To fix this, we need a “brain” that understands visual context and motion simultaneously.
Enter TrackVLA: A Vision-Language-Action Model
TrackVLA stands for Vision-Language-Action. It builds upon the recent explosion of Large Language Models (LLMs) and Vision-Language Models (VLMs). The insight here is clever: if LLMs can reason about the world, why not teach them to control a robot’s movement too?
The researchers designed TrackVLA to handle two tasks with one shared backbone:
- Visual Question Answering (VQA): “What is the person wearing?”
- Tracking: “Follow the man in the black suit.”
By training on both tasks, the model learns a deeper understanding of the target’s appearance and behavior.
The Architecture
Let’s break down the pipeline, which is illustrated in the figure below.

The pipeline consists of three main stages:
- Observation Encoding: The robot takes a video stream of images. These are processed by a Vision Encoder (specifically, EVA-CLIP). To keep things fast, the system uses a Grid Pooling strategy. Instead of processing every pixel at high resolution, it creates diverse feature maps.
The researchers realized that you need high resolution to see the target now, but low resolution is fine for remembering history. They implement this mathematically as:

This equation essentially says they pool the visual features into different sizes—fine-grained (\(64/N\)) for the current frame to capture details, and coarse-grained (\(4/N\)) for history frames to capture motion context without blowing up memory usage.
LLM Forwarding: These visual tokens are fed into a Large Language Model (Vicuna-7B) along with the text instruction (e.g., “Follow the person”).
Dual Heads: Here is where the magic happens. The model splits into two branches based on what it needs to do:
- The Language Head: If the task is recognition, it outputs text (just like ChatGPT).
- The Action Head: If the task is tracking, it activates a specialized Anchor-based Diffusion Model to predict the robot’s path.
The Action Head: Anchor-Based Diffusion
Standard diffusion models (like those used in DALL-E or Stable Diffusion) are great at generating data, but they are slow—often requiring hundreds of steps to remove noise and create an output. A robot moving at 1.5 meters per second cannot wait seconds for a decision.
TrackVLA solves this with Anchor-based Diffusion. Instead of starting from pure random noise, the model starts with a set of pre-calculated “anchor” trajectories—common movement patterns derived from the training data.

As shown above, the model takes these noisy anchors and the context from the LLM (\(E^{pred}_T\)) and “denoises” them to find the correct path. Because the anchors are already close to valid trajectories, the model only needs two denoising steps during inference. This allows the system to run at 10 FPS (Frames Per Second) on a server, which is sufficient for real-time tracking.
The output isn’t just one path; it predicts multiple potential trajectories and scores them. The final output is selected based on the highest score:

The model is trained by optimizing a combined loss function that looks at both the accuracy of the trajectory (MSE) and the classification score (BCE):

Fueling the Brain: The EVT-Bench Dataset
A complex model requires complex data. Existing datasets were either too small, lacked language descriptions, or didn’t feature realistic crowd dynamics. The researchers built their own benchmark: EVT-Bench.
They utilized the Habitat simulator to create a massive training set. They didn’t just drop random mannequins into a room; they created a sophisticated pipeline:
- 100+ Custom Avatars: Using SMPL-X models with diverse clothing and textures.
- Natural Motion: Avatars move at realistic human walking speeds (1.0 - 1.5 m/s) and use collision avoidance algorithms (ORCA) so they don’t walk through walls or each other.
- Diverse Scenarios:
- Single-Target Tracking: Basic following.
- Distracted Tracking: “Follow the man in the blue shirt” (ignoring the one in red).
- Ambiguous Tracking: “Follow the first person you see.”

To ensure the model recognizes objects as well as it tracks them, they mixed this tracking data with open-world recognition data (VQA samples). This combination—855K tracking samples and 855K recognition samples—is crucial. As shown in their ablation studies (below), the performance peaks when using a balanced 1:1 ratio of tracking to recognition data.

Experimental Results
Does it actually work? The researchers tested TrackVLA in both simulation and the real world.
Simulation Benchmarks
In the Gym-UnrealCV benchmark, TrackVLA achieved State-of-the-Art (SOTA) results in a zero-shot manner. This means it was evaluated on environments it had never seen during training.
- Single Target: It tracked targets successfully for entire episodes.
- Distractors: It significantly outperformed previous methods (like AD-VAT and standard VLA models) in identifying the correct target among look-alikes.
- Unseen Objects: It could even follow animals (horses, dogs, sheep) despite primarily being trained on humans, proving its visual generalization.
(Note: While the model excels at humans, the ability to generalize to animals shown here highlights the robustness of the underlying visual encoder.)
Real-World Deployment
Simulation is safe; the real world is messy. The team deployed TrackVLA on a Unitree Go2 quadruped robot equipped with a RealSense camera. The computation was offloaded to a server with an RTX 4090 GPU via Wi-Fi.

The results were visually impressive. The robot successfully tracked users in:
- Cluttered Environments: Forests with trees obscuring the view.
- Low Illumination: Dimly lit rooms.
- Pursuit-Evasion: Chasing a target running away.

Perhaps the most telling comparison was against a commercial tracking drone (DJI Flip). In “Easy” scenarios (open fields), both achieved 100% success. However, in “Hard” scenarios (fast movement, heavy occlusion), the commercial drone’s success rate dropped to 50%, while TrackVLA maintained 70%.
The visualization below highlights a case where the commercial drone (UVA view) loses the target, while TrackVLA maintains the trajectory lock.

Why This Matters
TrackVLA represents a significant step forward for embodied AI. By unifying recognition (Language/Vision) and planning (Action/Diffusion) into a single model, it eliminates the “broken telephone” effect of modular systems.
The key takeaways for students and researchers are:
- Architecture Synergy: Using the same tokens for VQA and Motion allows the model to “understand” what it is tracking.
- Efficient Diffusion: You don’t need hundreds of steps for diffusion if you use smart anchors. This makes generative policies viable for real-time robotics.
- Data Diversity: Mixing recognition datasets with tracking datasets is essential for robust performance.
As robots move out of factories and into our homes and streets, the ability to robustly follow and interact with us—without getting confused by a crowd or a shadow—will be the defining feature of useful embodied AI. TrackVLA shows us a promising path to getting there.
](https://deep-paper.org/en/paper/2505.23189/images/cover.png)