How to Teach Drones to Fly Using Vision-Language Models (Without Training)

Imagine standing in a park and pointing to a distant tree, telling a friend, “Go over there.” Your friend sees your finger, estimates the distance, and walks toward it, adjusting their path to avoid a bench along the way. This interaction is intuitive, relying on visual understanding and common sense.

Now, imagine trying to get a drone to do the same thing. Traditionally, this has been a nightmare. You either need a joystick to manually pilot it, or you need to train a complex neural network on thousands of hours of flight data just to recognize a “tree.”

But what if drones could understand the world like we do? What if they could just see, point, and fly?

In a recent paper from National Yang Ming Chiao Tung University and National Taiwan University, researchers propose See, Point, Fly (SPF). It is a novel framework that allows Unmanned Aerial Vehicles (UAVs) to navigate purely based on natural language instructions—without any specific training. By leveraging the power of modern Vision-Language Models (VLMs), SPF achieves state-of-the-art performance in both simulation and the real world.

In this post, we will deconstruct how SPF works, why it outperforms trained policies, and how it turns 2D pixels into 3D flight.

Three scenarios of drone navigation: following a dynamic target, navigating a long hallway, and searching for a person.

Autonomous aerial navigation sits at the intersection of three difficult fields:

Visual Reasoning: The drone must understand unstructured environments (clutter, people, obstacles).
Language Understanding: “Fly to the red car” is different from “Find a safe place to land.”
Control: The system must output precise motor commands (yaw, pitch, throttle).

The Limits of End-to-End Learning

Conventional methods typically use “end-to-end” policy learning. Researchers collect massive datasets of expert pilots flying drones and train a neural network to map images directly to motor controls. While this works in the lab, it is brittle. If you train a drone in a forest, it will likely crash in a warehouse. It cannot generalize to new environments or understand complex, free-form instructions.

The Problem with Text-Based VLMs

With the rise of Large Language Models (LLMs) and VLMs (like GPT-4V or Gemini), a new approach emerged: show the VLM an image and ask it what to do.

However, VLMs are designed to generate text, not flight dynamics. Asking a VLM to output “Throttle: 0.5, Yaw: 10 degrees” usually fails because language models struggle with high-precision continuous numbers. Previous attempts tried to simplify this by asking the VLM to choose from a list of skills (e.g., “Move Forward,” “Turn Left”), but this jerky, discrete movement lacks the smoothness and precision required for real-world flight.

The researchers behind See, Point, Fly (SPF) had a brilliant insight: Don’t ask the VLM to fly the drone. Ask the VLM to point at the target.

VLMs are exceptionally good at understanding images and answering questions like “Where is the person in the green shirt?” by providing a bounding box or a point on the image. This is a 2D spatial grounding task.

If the VLM can identify where on the 2D image the drone should go, we can use geometric mathematics to calculate how the drone should move in 3D space to get there. This decouples the high-level reasoning (handled by the VLM) from the low-level control (handled by geometry).

The Method: How SPF Works

The SPF framework operates in a continuous loop: See the environment, Point to the target on the image, and Fly towards it. Because it uses a pre-trained VLM, it requires zero training on flight data.

A diagram showing the pipeline: Drone camera -> VLM -> 2D waypoint -> Action to Control -> Flight.

As shown in Figure 2 above, the pipeline consists of three main stages. Let’s break them down.

Stage 1: VLM-Based Action Planning (See & Point)

At every timestep \(t\), the drone captures an image (\(I_t\)). This image, along with the user’s text instruction (\(l\)), is fed into a VLM.

The VLM is prompted to output a “waypoint plan.” Instead of text, it outputs a 2D coordinate \((u, v)\) on the image—essentially a pixel location representing the immediate target. It also outputs a discretized depth label (\(d_{VLM}\)), which is the model’s guess of how far away the target is (on a scale of 1 to \(L\)).

Mathematically, the system looks for the most likely waypoint sequence \(w\) given the instruction and image:

Equation showing the optimization of probability for the waypoint w given instruction l and image I_t.

If the instruction includes constraints like “avoid obstacles,” the VLM also detects bounding boxes for obstacles and selects a waypoint that guides the drone around them. This effectively turns a complex navigation problem into a visual annotation task.

Stage 2: Adaptive Step Size (The “Intuition” Layer)

A raw depth guess from a VLM isn’t accurate enough for safe flight. If the drone simply blindly followed the predicted depth, it might overshoot or crash.

To solve this, SPF introduces an Adaptive Travel Distance Scaling mechanism. The idea is simple: if the target is far away, take big steps. If the target (or an obstacle) is close, take small, cautious steps.

The system converts the VLM’s discrete depth score (\(d_{VLM}\)) into an adjusted physical distance (\(d_{adj}\)) using a non-linear scaling curve:

Equation for adaptive distance adjustment d_adj.

Here, \(s\) is a global scaling factor and \(p\) controls the non-linearity. This allows the drone to move efficiently in open spaces while slowing down for precision maneuvers near targets.

Stage 3: 2D-to-3D Unprojection (Point to Fly)

Now comes the “Fly” part. We have a 2D pixel \((u, v)\) and an adjusted distance \(d_{adj}\). How do we turn this into motor commands?

The system uses the pinhole camera model. This is a standard geometric model that relates 3D coordinates in the world to 2D pixels on an image sensor. By “unprojecting” the 2D pixel using the camera’s field of view (FOV), the system calculates a 3D displacement vector \((S_x, S_y, S_z)\).

Figure illustrating the geometry of converting 2D waypoints to 3D vectors and then to Yaw, Pitch, and Throttle.

As illustrated in Figure 3(b) above, the 2D point is lifted into 3D space relative to the drone’s body. The equations for this transformation are:

Equations for calculating Sx, Sy, and Sz based on u, v, and angles alpha and beta.

Here, \(\alpha\) and \(\beta\) represent the camera’s horizontal and vertical field-of-view angles. \(S_y\) represents the forward motion, while \(S_x\) and \(S_z\) represent lateral and vertical displacements.

Stage 4: Reactive Control

Finally, this 3D vector \((S_x, S_y, S_z)\) is decomposed into the drone’s native control primitives: Yaw (rotation), Pitch (forward/backward tilt), and Throttle (up/down).

Equations converting the S vector into Delta Theta, Delta Pitch, and Delta Throttle.

The drone executes these velocity commands, the camera captures a new frame, and the loop repeats. This closed-loop system allows the drone to correct its path continuously, making it robust to moving targets or wind.

Experimental Results

The researchers compared SPF against state-of-the-art baselines, including:

TypeFly: Uses an LLM to select discrete skills (e.g., “Move Left”).
PIVOT: Generates candidate paths on the image and asks a VLM to pick the best one.

They tested in both a high-fidelity simulator (DRL Simulator) and the real world using DJI Tello drones.

Simulation Performance

The results were stark. In simulation, SPF achieved a 93.9% success rate across diverse tasks, compared to just 28.7% for PIVOT and nearly 0% for TypeFly.

Table 1 showing success rates. SPF achieves 93.9% in sim and 92.7% in real-world, vastly outperforming baselines.

As seen in Table 1, SPF excelled in every category, including complex reasoning tasks (“Fly to the object that helps me when I’m thirsty”) and long-horizon navigation.

Visualizing the Flight

The qualitative difference is visible in the flight trajectories. In Figure 4 below, you can see the paths taken by the different models in a simulator. The Green line (SPF) moves smoothly around obstacles to the target. The Blue line (PIVOT) often gets stuck or takes inefficient paths, while the Purple line (TypeFly) fails to generate valid commands.

Top-down view of flight trajectories in simulation. Green paths (SPF) are smooth and successful; others are erratic.

Real-World Success

Real-world environments are messy. Lighting changes, sensors are noisy, and aerodynamics are unpredictable. Despite this, SPF achieved a 92.7% success rate in real-world experiments.

The system proved capable of “Dynamic Target Following”—keeping pace with a walking human—and “Reasoning-Driven Search,” where it had to find a specific object based on a vague description.

Real-world flight trajectories. Green represents takeoff, Magenta represents the task trajectory.

Figure 5 shows the real-world trajectories. The drone effectively identifies targets (like a person) and navigates toward them smoothly.

Speed and Efficiency

It’s not just about reaching the goal; it’s about getting there efficiently. The researchers analyzed completion time and found that SPF was significantly faster than the baselines.

Bar chart showing completion times. SPF is consistently faster across various tasks.

Does the Adaptive Step Size Matter?

One might wonder if the complex math for the “Adaptive Step Size” is necessary. Could we just use a fixed speed?

The ablation study in Table 3 proves its value. Comparing a fixed step size against the adaptive controller, the adaptive method cut completion time nearly in half (from 61s to 28s in one task) while maintaining a 100% success rate.

Table 3 comparing fixed vs adaptive step sizes. Adaptive is much faster.

This confirms that the “intuition” of slowing down near targets and speeding up in open space is critical for efficient autonomous flight.

Why This Matters

See, Point, Fly represents a significant shift in robotics.

Generalization: Because it relies on general-purpose VLMs (like Gemini or GPT-4), it inherits their common sense. You can ask it to “Find the backpack” in a room it has never seen before, and it works.
Training-Free: There is no expensive data collection or policy training phase. You can deploy this code on a drone today.
Modularity: As VLMs get better, SPF gets better. If a new, faster VLM is released next week, you can simply plug it into the framework to improve the drone’s perception.

Conclusion

Navigating the physical world is one of the hardest challenges for AI. While end-to-end deep learning has made strides, it often lacks the flexibility to handle the infinite variety of the real world.

SPF bridges the gap by treating navigation as a visual grounding problem. By letting the VLM do what it does best (interpreting images and language) and letting geometry do what it does best (calculating 3D vectors), we get a system that is robust, versatile, and surprisingly capable.

As we look toward a future of delivery drones, search-and-rescue UAVs, and personal aerial assistants, frameworks like SPF suggest that the key to autonomy might not be training harder, but modeling smarter.

References & Credits This post is based on the paper “See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation” by Chih Yao Hu, Yang-Sen Lin, et al. (National Yang Ming Chiao Tung University & National Taiwan University).

How to Teach Drones to Fly Using Vision-Language Models (Without Training)