Introduction

Imagine trying to navigate a slippery, twisting tunnel while looking through a tiny camera. Now, imagine you have to locate a specific lesion, track its movement as the tunnel breathes and deforms, and precisely manipulate a tool to treat it. This is the daily reality of endoscopic surgery. It is a procedure that demands immense cognitive load, steady hands, and years of training.

For years, roboticists have tried to automate parts of this process to relieve the burden on surgeons. However, the environment inside the human body is notoriously difficult for robots. It is unstructured, dynamic, and wet. Traditional automation methods are often “brittle”—they rely on complex mathematical models that break down the moment the tissue deforms unexpectedly or a reflection obscures the camera.

But what if a robot could understand the surgery the way a human does? What if it could look at an image, read a command like “track the polyp,” and instinctively know how to move?

This is the promise of EndoVLA, a new framework presented in the paper “EndoVLA: Dual-Phase Vision-Language-Action for Precise Autonomous Tracking in Endoscopy.” By combining the reasoning power of Large Language Models (LLMs) with robotic control, the researchers have created a system that doesn’t just “process” pixels—it understands tasks.

In this post, we will dissect how EndoVLA works, specifically focusing on its novel Dual-Phase Fine-Tuning (DFT) strategy that teaches a generic AI model to become a surgical specialist.

The Challenge: Why Not Just Use GPT-4?

We have seen the rise of Multimodal Large Language Models (MLLMs) like GPT-4V or open-source variants that can describe images with uncanny accuracy. In robotics, we call the application of these models Vision-Language-Action (VLA). The idea is simple: the model takes an image and a text instruction as input, and outputs a robotic action (like “move arm left”).

However, applying standard VLAs to endoscopy presents two massive hurdles:

The Domain Gap: General VLA models are trained on internet data—pictures of cats, cars, and coffee cups. They have never seen the inside of a stomach. They lack the medical context to distinguish a polyp from a fold in the mucosa.
Data Scarcity: Training a robot requires thousands of examples. In medical robotics, collecting high-quality data where a robot successfully tracks a tumor is expensive and difficult.
Precision vs. Creativity: LLMs are designed to be creative and open-ended. Surgery requires deterministic, reliable, and precise action. You don’t want a “creative” robot inside a patient.

EndoVLA addresses these issues not just by training a model, but by fundamentally changing how the model learns.

The EndoVLA Framework

At its core, EndoVLA is built upon the Qwen2-VL backbone, a powerful open-source vision-language model. The system is designed for continuum robots—flexible, snake-like endoscopes that can bend in multiple directions.

Figure 2: Overview of the setup of robotic endoscope and the DFT architecture of EndoVLA

As shown in Figure 2, the architecture is deceptively simple. The robot captures an image (\(O_t\)) and receives a text instruction (\(I\)). These inputs are processed by a vision encoder and a language tokenizer, respectively. The data is fused and passed through the Large Language Model (LLM).

The output is two-fold:

Perception: A bounding box \([x, y, w, h]\) locating the target.
Action: A discrete motion command (e.g., “upper-right”, “stop”).

This simultaneous output is critical. By forcing the model to identify where the target is before deciding how to move, the system grounds its actions in visual reality.

The Training Data: EndoVLA-Motion

Since no dataset existed for this specific problem, the researchers built their own: EndoVLA-Motion.

Figure 5: (a) The data is collected by two phantoms. (b) Robotic setup.

They used realistic stomach phantoms (silicone models of the stomach) and a robotic endoscope to record 6,000 image-action pairs. They defined three specific tasks that mimic real surgical needs, illustrated below:

Figure 1: EndoVLA enables robust autonomous tracking in endoscopic procedures, demonstrating effective zero-shot generalization capabilities across general scenes and sequential tracking tasks.

Polyp Tracking: Keeping a protruding lesion in the center of the view.
Abnormal Region Tracking: Following a flat, discolored area of tissue (harder to see).
Circular Marker Following: A complex sequential task where the robot must follow a ring of markers, simulating a circumferential cutting guide used in tumor removal.

The Core Method: Dual-Phase Fine-Tuning (DFT)

The true innovation of this paper is the training strategy. If you just train an LLM on the data once (standard Supervised Fine-Tuning), it tends to learn the “average” behavior. It might get the general idea but lacks the razor-sharp precision needed for surgery.

The authors propose a Dual-Phase Fine-Tuning (DFT) approach. Think of it like training a medical student: first, they learn from textbooks (Supervised Learning), and then they learn by practicing and getting grades on their performance (Reinforcement Learning).

Phase 1: Supervised Fine-Tuning (SFT)

In the first phase, the model creates a baseline understanding. Using Low-Rank Adaptation (LoRA)—a technique to fine-tune huge models efficiently—the model is trained on the EndoVLA-Motion dataset. It learns to predict the bounding box and the action that matches the human demonstration.

While SFT provides a good starting point, the researchers found it wasn’t enough. The model often struggled with the complex “Circular Marker” task, where it needed to understand a sequence of movements rather than just spotting a single object.

Phase 2: Reinforcement Fine-Tuning (RFT)

To push the model from “good” to “expert,” the researchers employed Reinforcement Learning (RL). Specifically, they used a technique involving Verifiable Rewards.

In standard RL, you might need a human to say “good robot” or “bad robot.” That doesn’t scale. Instead, EndoVLA uses mathematical rules to automatically grade the model’s output.

The optimization is handled by an algorithm called Group Relative Policy Optimization (GRPO).

Equation 6 showing the GRPO objective function

GRPO is sophisticated because it doesn’t just look at one attempt; it looks at a group of attempts and optimizes the policy to improve the average outcome relative to a baseline. This stabilizes the training, preventing the model from learning erratic behaviors.

The Three Verifiable Rewards

How does the system grade itself? The researchers designed three specific reward functions that guide the model toward perfection:

1. IoU Reward (Perception Accuracy) This measures how well the model sees the target. It calculates the Intersection over Union between the predicted bounding box and the ground truth.

Equation 5: IoU Reward Formula

If the model says the polyp is in the top-left, but it’s actually in the center, this reward drops to zero. This forces the model to be an accurate observer.

2. Motion Angle Reward (Action Accuracy) Did the robot move in the right direction? This is a binary reward. If the ground truth says “move upper-right” and the model predicts “move upper-right,” it gets a 1.0. Otherwise, it gets nothing.

Equation 6: Motion Angle Reward Formula

The robot’s movement logic is defined by discrete motor increments. For example, “upper-right” corresponds to specific adjustments in the tendon motors:

Equation 1: Motor control definitions

3. Format Reward (Structural Integrity) LLMs can sometimes hallucinate or output text in the wrong format (e.g., writing a paragraph when you asked for coordinates). The format reward ensures the output strictly follows the required syntax: [x, y, w, h] action.

Equation 7: Format Reward Formula

By combining these three signals, the RFT phase fine-tunes the model to be accurate in vision, correct in action, and reliable in format.

Experiments and Results

Does this complex two-phase training actually work better? The results are compelling.

Quantitative Analysis

The researchers compared three versions of their model:

SFT Only: Trained just on the dataset.
RFT Only: Trained just with reinforcement learning (starting from a base model).
DFT (SFT + RFT): The proposed method.

The table below reveals a massive difference, particularly in the difficult “CC” (Circular Cutting) task.

Table 2: IoU of bounding box and PR (%) of action prediction

Look at the CC column. The SFT model only achieved an IoU (visual accuracy) of 11.0%. The RFT model was even worse at 2.8%. But the DFT model jumped to 48.5%. This proves that neither supervision nor reinforcement is enough on its own; they must be combined. The SFT phase provides the necessary “priors” (basic knowledge), and the RFT phase refines that knowledge to handle complex scenarios.

Visualizing the Improvement

The qualitative results show why the numbers are so different.

Figure 8: Qualitative results of bounding box prediction and motion output

In Figure 8, look at the predictions (red boxes) versus the ground truth (green boxes).

SFT (Top Row): Predicts boxes reasonably well for simple tasks but fails completely on the third column (CC), predicting a tiny box in the wrong place.
RFT (Middle Row): Tends to predict massive, incorrect bounding boxes. Without the supervised foundation, the RL agent struggles to learn what the object actually looks like.
SFT+RFT (Bottom Row): The tracking is tight and accurate across all tasks. The red box overlaps the green box almost perfectly.

Real-World Robot Performance

Simulation is one thing, but does it work on a physical robot? The team deployed EndoVLA on a real Olympus-style endoscope inside stomach phantoms.

Table 3: Performance in real-world endoscopic tracking tasks.

In the Polyp Tracking (PP) and Abnormal Region (AR) tasks, the combined model achieved a 100% success rate in moving toward the target. In the highly complex Circular Cutting (CC) task—which involves following a sequence of markers in a specific order—it was the only model that managed to complete the full circle (albeit with a 10% completion rate, showing the extreme difficulty of the task, but significantly outperforming the 0% of other methods).

Figure 3: Example of successful rollouts on the real-world endoscopic tasks

Surprising Generalization

Perhaps the most fascinating result of the paper is how well EndoVLA generalized to things it had never seen before. Remember, this model was fine-tuned on stomach phantoms.

The researchers tested it on:

Fruit Sequences: Tracking a banana, apple, and watermelon in order.
Outdoor Scenes: Tracking a hole in the ground.
Icons: Tracking letters and symbols.

Figure 4: Generalization examples: CORL characters, fruit, and outdoor holes.

Despite never training on fruit or trees, EndoVLA performed remarkably well.

Table 4: SR (%) in general scene tracking tasks

As shown in Table 4, the SFT+RFT model achieved 100% success on the fruit sequence and 90% success on the hole tracking. This suggests that the “Dual-Phase” training didn’t just teach the model to memorize stomach textures; it taught the model the fundamental concept of following visual instructions. It learned the semantic link between “track the target” and the corresponding motor actions, regardless of what the target actually looks like.

Conclusion and Future Implications

EndoVLA represents a significant step forward for medical robotics. It moves us away from rigid, hand-coded tracking algorithms toward flexible, semantic agents that understand the task at hand.

The key takeaways are:

Vision-Language-Action works for surgery: We can control surgical robots using natural language prompts and visual feedback.
Dual-Phase Fine-Tuning is essential: You cannot rely on Supervised Learning or Reinforcement Learning alone. Combining them (SFT for structure, RFT for precision) unlocks state-of-the-art performance.
Generalization is possible: A robustly trained VLA agent can apply its tracking logic to completely new environments, hinting at a future where surgical robots could adapt to different patient anatomies without retraining.

While the system is currently tested on phantoms and has limitations regarding speed (running at ~2Hz), the foundation is laid. As these models become faster and are trained on real clinical video data, we may soon see AI agents acting as intelligent, reliable assistants to surgeons, keeping a watchful eye on pathology even in the most chaotic environments.

Introduction#

The Challenge: Why Not Just Use GPT-4?#

The EndoVLA Framework#

The Training Data: EndoVLA-Motion#

The Core Method: Dual-Phase Fine-Tuning (DFT)#

Phase 1: Supervised Fine-Tuning (SFT)#

Phase 2: Reinforcement Fine-Tuning (RFT)#

The Three Verifiable Rewards#

Experiments and Results#

Quantitative Analysis#

Visualizing the Improvement#

Real-World Robot Performance#

Surprising Generalization#

Conclusion and Future Implications#