Imagine you are in a taxi. You tell the driver, “Please park to the left of that red car.” The driver looks around, sees a blue truck and a white sedan, but no red car. The driver turns to you and says, “There is no red car.”

This interaction seems trivial for humans. We possess an intuitive grasp of language, spatial relationships, and object permanence. However, for an autonomous vehicle (AV), this is a monumental challenge. Most current AI systems operate under the assumption that if you give a command, the target object must be there. If you tell a standard vision-language model to “find the red car,” and there isn’t one, it will often hallucinate—desperately selecting the closest match (like the red fire hydrant or the maroon truck) just to satisfy the request.

In this post, we are diving deep into a paper titled “GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions.” This research introduces a novel architecture designed to understand natural language instructions for navigation, specifically tackling the tricky scenarios where targets might be absent, multiple targets might exist, or the target is an ambiguous region of space (like “the grass”) rather than a specific object.

The Problem: When “Go There” Isn’t Enough

The intersection of Computer Vision and Natural Language Processing (NLP) is one of the most exciting fields in AI. In the context of autonomous driving, this is often formalized as the Referring Navigable Regions (RNR) task. The goal is to identify a specific region in a camera feed based on a linguistic command.

However, the authors of GENNAV identified a critical gap in existing RNR research. Previous methods largely focused on “Thing-type” targets—countable objects with clear boundaries, like “car” or “pedestrian.” They also assumed the target always existed.

Real-world driving is messier. We often refer to “Stuff-type” regions—areas with ambiguous boundaries, like “the road next to the curb” or “the grassy patch.” Furthermore, commands can be invalid (“Stop behind the bus” when there is no bus) or refer to multiple locations (“Park next to any of those trees”).

The researchers defined a more robust task called Generalized Referring Navigable Regions (GRNR). As illustrated below, this task requires the model to handle three distinct scenarios:

  1. Single-target: A standard command with one clear destination.
  2. Multi-target: A command that could apply to several valid regions.
  3. No-target: A command referencing a missing landmark.

Figure 5: Typical examples of the GRNR task. Left: single target. Center: multi-target. Right: notarget. The goal is to generate zero or more segmentation masks (shown in green). Unlike the RNR, the GRNR accommodates instructions that specify an arbitrary number of landmarks, including cases where multiple target regions exist or no target region exists.

Why Current Methods Fail

Before we get into the solution, it helps to understand why previous attempts struggle here.

  1. Pixel-based Segmentation: Many existing models try to classify every single pixel in an image as “target” or “background.” This is computationally heavy and slow. In driving, milliseconds matter.
  2. Assumption of Existence: Traditional models are trained to output something. When faced with a “no-target” scenario, they force a prediction, leading to dangerous errors (e.g., identifying a drivable lane as a parking spot because the instruction mentioned a non-existent car).
  3. Visual Grounding Limits: Large Multi-modal Large Language Models (MLLMs) like GPT-4V or Gemini are powerful, but they are often optimized for bounding boxes (rectangles) rather than precise navigable polygons. A rectangle might include a curb or a pedestrian, whereas a polygon can hug the safe driving zone tightly.

Enter GENNAV: The Architecture

To solve these problems, the researchers proposed GENNAV. It is designed to predict the existence of targets explicitly and generate efficient polygon masks rather than heavy pixel maps.

Let’s look at the high-level workflow:

Figure 1: Overview of GENNAV. The model predicts target regions from a natural language instruction and a front camera image captured by a moving mobility.

The system takes two inputs: a front-camera image and a natural language instruction. It processes these through three specialized modules to produce a final output that says, “Yes, there is a target (or multiple),” and “Here are the exact coordinates.”

The architecture is composed of three key modules:

  1. Landmark Distribution Patchification Module (LDPM)
  2. Visual-Linguistic Spatial Integration Module (VLSiM)
  3. Existence Aware Polygon Segmentation Module (ExPo)

Let’s break down each module.

Figure 2: Overall architecture of GENNAV. DA represents Depth Anything [45]. The green, red, and blue regions in this figure represent the LDPM, VLSiM, and ExPo modules, respectively.

1. Landmark Distribution Patchification Module (LDPM)

Represented by the green region in Figure 2.

Standard image processing often resizes images to a smaller square (e.g., 224x224 pixels) to save processing power. However, in driving scenarios, important landmarks (like a distant stop sign or a pedestrian down the block) become tiny, blurry blobs at that resolution.

Instead of uniformly chopping up the image, the LDPM uses a smarter strategy. It divides the image into patches based on the spatial distribution of landmarks. The researchers analyzed where landmarks typically appear in driving datasets (usually along the horizon and road edges, rarely in the sky).

By focusing high-resolution processing power on these “hotspot” areas, GENNAV extracts fine-grained visual features (\(h_{ldp}\)) where they matter most, without wasting resources on the sky or the hood of the car.

2. Visual-Linguistic Spatial Integration Module (VLSiM)

Represented by the red region in Figure 2.

Understanding a command like “Park on the road next to the car” requires three types of understanding:

  • Visual: What do objects look like? (The car).
  • Depth: How far away are things? (Next to).
  • Semantic: Which areas are actually driveable? (The road).

VLSiM fuses these concepts. It takes the image features (\(f_{vis}\)) and combines them with a pseudo-depth image (\(f_{depth}\)) generated by a depth estimation model (Depth Anything V2). It also incorporates a road segmentation mask (\(f_{road}\)) generated by a semantic segmentation model (PIDNet).

This fusion prevents the model from selecting non-navigable areas. For example, even if the “sky” visually matches a description of “blue open space,” the road mask ensures the car never tries to drive there.

The integration is mathematically represented as:

Equation 1: Multimodal feature integration formula

Here, \(h_{inst}\) is the linguistic feature (the text command). Notice how the text is multiplied (Hadamard product \(\odot\)) with both the visual-depth features and the road-depth features. This forces the language to “attend” to both the physical layout and the semantic driveability of the scene.

3. Existence Aware Polygon Segmentation Module (ExPo)

Represented by the blue region in Figure 2.

This is the decision-making core. Unlike pixel-based methods that paint an image, ExPo is a Transformer-based decoder. It takes the rich features from the previous modules and performs two simultaneous tasks:

  1. Classification Head: It predicts a probability distribution over three states: {no-target, single-target, multi-target}. This is the “brain” that allows the car to say, “I can’t do that, the target isn’t here.”
  2. Regression Head: If targets exist, it predicts the precise \((x, y)\) coordinates of the vertices of a polygon that outlines the target region.

Why polygons? Defining a region by 6 to 12 points is computationally instant compared to classifying 50,000 pixels. This efficiency is what allows GENNAV to run in real-time.

The loss function used to train this module is a combination of classification error and geometric error:

Equation 2: Loss function combining classification and regression

The term \(\mathbb{I}\) is an indicator function. It essentially says: “Only try to minimize the shape error (\(\ell_1\)) if a target actually exists.” This prevents the model from getting confused by trying to draw shapes for non-existent objects.

The GRiN-Drive Benchmark

To test this architecture, the authors faced a hurdle: there were no good datasets that combined “stuff” targets (roads/grass), multi-targets, and no-targets for autonomous driving.

So, they built their own: GRiN-Drive. They combined data from existing datasets (Talk2Car and Refer-KITTI) and augmented them.

  • No-target generation: They swapped instructions between images (e.g., putting a “find the truck” instruction on an image with only cars) and verified the absence of the target using GPT-4o and manual checks.
  • Multi-target generation: They identified frames with multiple instances of the same object (e.g., two pedestrians) and annotated polygons for both.

Figure 6: Annotation interface for multi-target samples in the GRiN-Drive benchmark. Annotators were instructed to provide polygons for an arbitrary number of target regions in the given image, corresponding to the navigation instruction.

This resulted in a robust benchmark of over 17,000 samples, providing a rigorous testing ground for the Generalized RNR task.

Experimental Results

How did GENNAV perform? The researchers compared it against state-of-the-art pixel-based methods (like LAVT and TNRSM) and massive MLLMs (GPT-4o, Gemini, Qwen2-VL).

The Metric: msIoU Standard Intersection over Union (IoU) is a bad metric for this task. If a dataset has many “no-target” samples, a lazy model that always predicts “no-target” would get a perfect score for those samples, inflating its average.

The authors proposed mean stuff IoU (msIoU). This metric normalizes the score so that “no-target” predictions are balanced against segmentation accuracy.

Equation 3 and 4: Definition of msIoU Equation 4 continued

Quantitative Performance

The results were decisive.

Table 1: Quantitative comparison between GENNAV and the baseline methods on the test sets of the GRiN-Drive benchmark. The best score for each metric is in bold.

Looking at Table 1, GENNAV achieved an msIoU of 46.35%, significantly outperforming the best baseline (TNRSM) which scored 37.90%.

Perhaps most surprisingly, GENNAV crushed the massive commercial models. GPT-4o only achieved 23.41%, and Gemini scored 6.98%. This highlights that while Foundation Models are generalists, specialized architectures like GENNAV are still superior for spatial-geometric tasks requiring precise localization.

Speed is also a major factor. GENNAV runs at 31.31 ms per sample—roughly 30 frames per second. In comparison, GPT-4o takes over 3.5 seconds (3525 ms) per sample, which is far too slow for a moving vehicle.

Qualitative Analysis

Let’s look at the model in action compared to baselines.

Figure 3: Qualitative results of GENNAV and baseline methods on the GRiN-Drive benchmark. Columns (a), (b), (c) and (d) show the prediction by LAVT, TNRSM, Qwen2-VL (bbox) and GENNAV, respectively. The green and red regions indicate the predicted and ground-truth regions, respectively; yellow indicates their overlap.

In row (i), the instruction is “Park my car next to the rack.”

  • LAVT (a) and Qwen2-VL (c) hallucinate regions that don’t make sense.
  • TNRSM (b) predicts “no target” (fails to see the rack).
  • GENNAV (d) correctly identifies the specific parking spot next to the bike rack.

In row (ii), “Pull up right next to pedestrian on the left.”

  • There are multiple pedestrians. GENNAV is the only model that correctly highlights regions next to both pedestrians, handling the multi-target requirement perfectly.

Real-World “Zero-Shot” Testing

Benchmarks are great, but can it drive? The researchers took GENNAV out of the lab. They used four different cars and smartphones in five different urban areas to record video. They fed these videos to GENNAV without any additional training (Zero-Shot transfer).

Figure 4: Qualitative results of GENNAV in the real world experiment. The color of regions are the same as in Fig. 2

The results held up.

  • Left Image: “Please go near the blue car traveling in the same direction.” GENNAV ignores the oncoming traffic and highlights the lane behind the lead vehicle.
  • Right Image: “Stop to the right of the pedestrian on the left side.” It correctly identifies the safety zones near the pedestrians.

Table 7 confirms that even in the wild, GENNAV maintains a significant lead over baselines.

Table 7: Quantitative comparison between the proposed method and baseline methods in the realworld experiment. The best score for each metric is in bold.

Where Does It Fail?

No model is perfect. The error analysis reveals that GENNAV struggles most with Reduced Visibility Conditions (RVC).

Figure 10: Qualitative analysis of failure cases of the proposed method categorized as Reduced Visibility Conditions.

In the images above, rain, glare, and darkness confuse the model.

  • In (a), the glare on the wet road likely obscures the texture of the “grass,” causing the model to miss the target.
  • In (b), the reflection and darkness make it difficult to distinguish the specific “car in front.”

Additionally, the authors noted that Multimodal Language Understanding (MLU) remains a bottleneck. Sometimes the model sees the object but misunderstands complex phrasing about it.

Conclusion

GENNAV represents a significant step forward for autonomous mobility. By moving away from pixel-heavy computation and explicitly modeling the “existence” of targets, it achieves a balance of speed and accuracy that real-time driving demands.

Key takeaways for students and researchers:

  1. Don’t assume existence: Real-world systems must handle “null” results gracefully.
  2. Polygons > Pixels: For geometric navigation tasks, predicting vertices is far more efficient than segmenting raster images.
  3. Specialization matters: Despite the hype around Large Language Models, specialized architectures that fuse depth, segmentation, and language features still dominate in specific, high-stakes tasks like autonomous driving.

As we move toward Level 5 autonomy, systems like GENNAV that can handle the ambiguity of human language (“Park over by that thing… no, the other thing”) will be essential for creating cars that don’t just drive, but actually cooperate with us.