Imagine standing in a park and pointing to a distant tree, telling a friend, “Go over there.” Your friend sees your finger, estimates the distance, and walks toward it, adjusting their path to avoid a bench along the way. This interaction is intuitive, relying on visual understanding and common sense.
Now, imagine trying to get a drone to do the same thing. Traditionally, this has been a nightmare. You either need a joystick to manually pilot it, or you need to train a complex neural network on thousands of hours of flight data just to recognize a “tree.”
But what if drones could understand the world like we do? What if they could just see, point, and fly?
In a recent paper from National Yang Ming Chiao Tung University and National Taiwan University, researchers propose See, Point, Fly (SPF). It is a novel framework that allows Unmanned Aerial Vehicles (UAVs) to navigate purely based on natural language instructions—without any specific training. By leveraging the power of modern Vision-Language Models (VLMs), SPF achieves state-of-the-art performance in both simulation and the real world.
In this post, we will deconstruct how SPF works, why it outperforms trained policies, and how it turns 2D pixels into 3D flight.

The Problem: Why is Drone Navigation So Hard?
Autonomous aerial navigation sits at the intersection of three difficult fields:
- Visual Reasoning: The drone must understand unstructured environments (clutter, people, obstacles).
- Language Understanding: “Fly to the red car” is different from “Find a safe place to land.”
- Control: The system must output precise motor commands (yaw, pitch, throttle).
The Limits of End-to-End Learning
Conventional methods typically use “end-to-end” policy learning. Researchers collect massive datasets of expert pilots flying drones and train a neural network to map images directly to motor controls. While this works in the lab, it is brittle. If you train a drone in a forest, it will likely crash in a warehouse. It cannot generalize to new environments or understand complex, free-form instructions.
The Problem with Text-Based VLMs
With the rise of Large Language Models (LLMs) and VLMs (like GPT-4V or Gemini), a new approach emerged: show the VLM an image and ask it what to do.
However, VLMs are designed to generate text, not flight dynamics. Asking a VLM to output “Throttle: 0.5, Yaw: 10 degrees” usually fails because language models struggle with high-precision continuous numbers. Previous attempts tried to simplify this by asking the VLM to choose from a list of skills (e.g., “Move Forward,” “Turn Left”), but this jerky, discrete movement lacks the smoothness and precision required for real-world flight.
The Core Insight: Navigation as “Pointing”
The researchers behind See, Point, Fly (SPF) had a brilliant insight: Don’t ask the VLM to fly the drone. Ask the VLM to point at the target.
VLMs are exceptionally good at understanding images and answering questions like “Where is the person in the green shirt?” by providing a bounding box or a point on the image. This is a 2D spatial grounding task.
If the VLM can identify where on the 2D image the drone should go, we can use geometric mathematics to calculate how the drone should move in 3D space to get there. This decouples the high-level reasoning (handled by the VLM) from the low-level control (handled by geometry).
The Method: How SPF Works
The SPF framework operates in a continuous loop: See the environment, Point to the target on the image, and Fly towards it. Because it uses a pre-trained VLM, it requires zero training on flight data.

As shown in Figure 2 above, the pipeline consists of three main stages. Let’s break them down.
Stage 1: VLM-Based Action Planning (See & Point)
At every timestep \(t\), the drone captures an image (\(I_t\)). This image, along with the user’s text instruction (\(l\)), is fed into a VLM.
The VLM is prompted to output a “waypoint plan.” Instead of text, it outputs a 2D coordinate \((u, v)\) on the image—essentially a pixel location representing the immediate target. It also outputs a discretized depth label (\(d_{VLM}\)), which is the model’s guess of how far away the target is (on a scale of 1 to \(L\)).
Mathematically, the system looks for the most likely waypoint sequence \(w\) given the instruction and image:

If the instruction includes constraints like “avoid obstacles,” the VLM also detects bounding boxes for obstacles and selects a waypoint that guides the drone around them. This effectively turns a complex navigation problem into a visual annotation task.
Stage 2: Adaptive Step Size (The “Intuition” Layer)
A raw depth guess from a VLM isn’t accurate enough for safe flight. If the drone simply blindly followed the predicted depth, it might overshoot or crash.
To solve this, SPF introduces an Adaptive Travel Distance Scaling mechanism. The idea is simple: if the target is far away, take big steps. If the target (or an obstacle) is close, take small, cautious steps.
The system converts the VLM’s discrete depth score (\(d_{VLM}\)) into an adjusted physical distance (\(d_{adj}\)) using a non-linear scaling curve:

Here, \(s\) is a global scaling factor and \(p\) controls the non-linearity. This allows the drone to move efficiently in open spaces while slowing down for precision maneuvers near targets.
Stage 3: 2D-to-3D Unprojection (Point to Fly)
Now comes the “Fly” part. We have a 2D pixel \((u, v)\) and an adjusted distance \(d_{adj}\). How do we turn this into motor commands?
The system uses the pinhole camera model. This is a standard geometric model that relates 3D coordinates in the world to 2D pixels on an image sensor. By “unprojecting” the 2D pixel using the camera’s field of view (FOV), the system calculates a 3D displacement vector \((S_x, S_y, S_z)\).

As illustrated in Figure 3(b) above, the 2D point is lifted into 3D space relative to the drone’s body. The equations for this transformation are:

Here, \(\alpha\) and \(\beta\) represent the camera’s horizontal and vertical field-of-view angles. \(S_y\) represents the forward motion, while \(S_x\) and \(S_z\) represent lateral and vertical displacements.
Stage 4: Reactive Control
Finally, this 3D vector \((S_x, S_y, S_z)\) is decomposed into the drone’s native control primitives: Yaw (rotation), Pitch (forward/backward tilt), and Throttle (up/down).

The drone executes these velocity commands, the camera captures a new frame, and the loop repeats. This closed-loop system allows the drone to correct its path continuously, making it robust to moving targets or wind.
Experimental Results
The researchers compared SPF against state-of-the-art baselines, including:
- TypeFly: Uses an LLM to select discrete skills (e.g., “Move Left”).
- PIVOT: Generates candidate paths on the image and asks a VLM to pick the best one.
They tested in both a high-fidelity simulator (DRL Simulator) and the real world using DJI Tello drones.
Simulation Performance
The results were stark. In simulation, SPF achieved a 93.9% success rate across diverse tasks, compared to just 28.7% for PIVOT and nearly 0% for TypeFly.

As seen in Table 1, SPF excelled in every category, including complex reasoning tasks (“Fly to the object that helps me when I’m thirsty”) and long-horizon navigation.
Visualizing the Flight
The qualitative difference is visible in the flight trajectories. In Figure 4 below, you can see the paths taken by the different models in a simulator. The Green line (SPF) moves smoothly around obstacles to the target. The Blue line (PIVOT) often gets stuck or takes inefficient paths, while the Purple line (TypeFly) fails to generate valid commands.

Real-World Success
Real-world environments are messy. Lighting changes, sensors are noisy, and aerodynamics are unpredictable. Despite this, SPF achieved a 92.7% success rate in real-world experiments.
The system proved capable of “Dynamic Target Following”—keeping pace with a walking human—and “Reasoning-Driven Search,” where it had to find a specific object based on a vague description.

Figure 5 shows the real-world trajectories. The drone effectively identifies targets (like a person) and navigates toward them smoothly.
Speed and Efficiency
It’s not just about reaching the goal; it’s about getting there efficiently. The researchers analyzed completion time and found that SPF was significantly faster than the baselines.

Does the Adaptive Step Size Matter?
One might wonder if the complex math for the “Adaptive Step Size” is necessary. Could we just use a fixed speed?
The ablation study in Table 3 proves its value. Comparing a fixed step size against the adaptive controller, the adaptive method cut completion time nearly in half (from 61s to 28s in one task) while maintaining a 100% success rate.

This confirms that the “intuition” of slowing down near targets and speeding up in open space is critical for efficient autonomous flight.
Why This Matters
See, Point, Fly represents a significant shift in robotics.
- Generalization: Because it relies on general-purpose VLMs (like Gemini or GPT-4), it inherits their common sense. You can ask it to “Find the backpack” in a room it has never seen before, and it works.
- Training-Free: There is no expensive data collection or policy training phase. You can deploy this code on a drone today.
- Modularity: As VLMs get better, SPF gets better. If a new, faster VLM is released next week, you can simply plug it into the framework to improve the drone’s perception.
Conclusion
Navigating the physical world is one of the hardest challenges for AI. While end-to-end deep learning has made strides, it often lacks the flexibility to handle the infinite variety of the real world.
SPF bridges the gap by treating navigation as a visual grounding problem. By letting the VLM do what it does best (interpreting images and language) and letting geometry do what it does best (calculating 3D vectors), we get a system that is robust, versatile, and surprisingly capable.
As we look toward a future of delivery drones, search-and-rescue UAVs, and personal aerial assistants, frameworks like SPF suggest that the key to autonomy might not be training harder, but modeling smarter.
References & Credits This post is based on the paper “See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation” by Chih Yao Hu, Yang-Sen Lin, et al. (National Yang Ming Chiao Tung University & National Taiwan University).
](https://deep-paper.org/en/paper/2509.22653/images/cover.png)