Imagine you are asking a robot to help you in the kitchen. You say, “Pass me the red mug.” In a perfect world—or at least in the world of classic computer vision benchmarks—there is exactly one red mug on the counter. The robot identifies it, boxes it, and grabs it.
But what if real life happens? What if the dishwasher is empty and there are no red mugs? Or what if you just had a party and there are three red mugs?
Most state-of-the-art Referring Expression Comprehension (REC) models fail spectacularly in these scenarios. They are built on a rigid “one-to-one” assumption: one text query must map to exactly one bounding box. If you ask for a mug that doesn’t exist, the model will hallucinate one. If you refer to three mugs, it might just pick one at random or fail to converge.
This blog post explores a solution to this problem: RECANTFormer. Presented by researchers from Heriot-Watt University and the University of Edinburgh, this paper introduces a streamlined, transformer-based architecture designed for Generalized Referring Expression Comprehension (GREC).
By the end of this article, you will understand how RECANTFormer breaks the single-object barrier, how it uses a “decoder-free” transformer design to learn efficiently, and how it handles the complex math of matching text to varying numbers of objects.
The Problem: From REC to GREC
To appreciate the innovation of RECANTFormer, we first need to understand the limitations of the status quo.
Classic REC: The “One-to-One” Trap
In classic REC, the task is simple: given an image and a sentence, find the specific region the sentence describes. The evaluation metric is usually Top-1 accuracy. The model must output a box, and it is graded on whether that single box overlaps with the single ground truth.
This limits real-world applications significantly. A photo editing tool that can only “brighten the person in the blue shirt” is useless if there are two people in blue shirts, or if the user makes a typo and asks for a “green shirt” when none exists.
Generalized REC (GREC)
The Generalized REC task extends the classic definition to handle varying numbers of targets:
- Zero Targets: The object described does not exist in the image.
- One Target: The classic scenario.
- Multiple Targets: The description applies to a group (e.g., “the three cars on the left”).
This is exponentially harder. If an image has \(n\) objects, classic REC looks for 1 target. GREC has to consider \(2^n\) distinct combinations of objects. Furthermore, the model needs to be smart enough to say “I don’t see that” rather than forcing a prediction.
The RECANTFormer Solution
The researchers propose RECANTFormer, a one-stage method that combines a multimodal transformer encoder with a matching loss function inspired by object detection models like DETR.
Unlike many modern architectures that use a heavy Encoder-Decoder structure (which can be slow to train), RECANTFormer is decoder-free. It relies entirely on a powerful encoder to fuse vision and language features.
Architecture Overview
Let’s break down the architecture step-by-step. It consists of four main components:
- Language Stream
- Vision Stream
- Multimodal Fusion Module (The Core)
- Prediction Heads

As shown in Figure 1, the model takes two inputs: the image and the text query (“referring expression”).
1. The Input Streams
- Vision Stream (Purple): The image is fed into a ResNet-50 backbone to extract features. These features are then passed through a 6-layer visual transformer encoder. This is crucial for capturing long-range dependencies—understanding that a pixel in the top left might relate to a pixel in the bottom right (e.g., “the car furthest from the tree”).
- Language Stream (Green): The text is processed by a pre-trained BERT model to extract word embeddings.
2. The Multimodal Fusion Module
This is where the magic happens (the yellow block in Figure 1). The visual features and text embeddings are projected linearly and then fed into a Multimodal Transformer Encoder.
This encoder allows the text and image features to “talk” to each other. The word embeddings attend to specific image patches, and image patches attend to specific words. This cross-modality reasoning is essential for understanding queries like “the man sitting next to the dog.”
3. Learnable Localization Tokens
This is the paper’s key innovation for handling GREC.
In standard transformers, you often have a [CLS] token that summarizes the whole input. RECANTFormer introduces a set of Learnable Localization Tokens (denoted as \(N_{loc}\)).
Think of these tokens as “detective slots.” If we set \(N_{loc} = 10\), we are giving the model 10 dedicated “detectives” that search the fused image-text representation. Each token is responsible for potentially finding one target.
- If the query refers to 3 objects, 3 tokens should light up with coordinates.
- If the query refers to 0 objects, all tokens should report “invalid.”
4. Prediction Heads
Finally, the output of these localization tokens is fed into two parallel Multi-Layer Perceptrons (MLPs):
- Box Prediction Head: Predicts bounding box coordinates (\(x, y, w, h\)) for every token.
- Validity Prediction Head: Predicts a binary score—is this box a valid target for the text query, or should we ignore it?
Training: The Hungarian Matching Game
Training a model to output a variable number of boxes is tricky. If the model predicts 10 boxes, but there are only 2 ground-truth targets, which predicted box corresponds to which target? And which 8 boxes should be penalized for being “hallucinations”?
To solve this, the authors use Hungarian Matching, a technique popularized by the DETR (DEtection TRansformer) object detector.
Step 1: Bipartite Matching
The goal is to find the optimal one-to-one assignment between the predicted boxes and the ground-truth boxes.
Let’s say we have \(N_{loc}\) predictions and a set of ground truth targets. If there are fewer targets than predictions (which is almost always the case), the authors pad the ground truth with “no-object” labels.
They define a cost function that considers how similar a predicted box is to a target box (overlap and position) and how confident the model is. The matching process tries to find a permutation \(\sigma\) that minimizes this cost:

Here, \(\mathcal{L}_{match}\) calculates the “cost” of assigning a specific prediction to a specific ground truth. The Hungarian algorithm efficiently solves this assignment problem, ensuring the best predictions are matched to the real targets.
Step 2: The Loss Function
Once the matching is done (i.e., we know that Predicted Box #3 matches Target A, and Predicted Box #7 matches Target B), the model computes the loss.
The total loss, \(\mathcal{L}_{Hungarian}\), has two parts:
- Classification Loss: Did the validity head correctly predict “valid” or “invalid”?
- Bounding Box Loss: For the valid matches, how accurate were the coordinates?

The bounding box loss itself (\(\mathcal{L}_{bbox}\)) is a combination of L1 loss (absolute error in coordinates) and Generalized IoU (GIoU) loss, which ensures the shapes overlap nicely.

This training setup forces the localization tokens to specialize. Over time, they learn that for a query like “two cats,” exactly two tokens should yield high validity scores with accurate boxes, while the rest should output low validity scores.
Experiments and Results
Does this architecture actually work? The authors tested RECANTFormer on three GREC datasets: gRefCOCO, Ref-ZOM, and VQD. They also tested it on classic REC datasets to ensure they didn’t break the standard functionality.
Quantitative Performance
The results are summarized in Table 1. The authors compare their model against baselines that use the same amount of data (no massive pre-training) and against massive Multimodal Large Language Models (MLLMs) like KOSMOS-2.

Key Takeaways from Table 1:
- Beating the Specialists: RECANTFormer significantly outperforms models like RESC-Large and VLT on the GREC datasets.
- Beating the Giants: Interestingly, the massive MLLM KOSMOS-2 struggles with GREC. Despite having zero-shot capabilities, it performs poorly (scores in the 20s) compared to RECANTFormer (scores in the 50s). This highlights that generalized grounding is a specialized skill that generalist models don’t automatically possess.
- Efficiency: The model achieves these results with “only” 4000 GPU hours of training, compared to other pre-trained baselines that require orders of magnitude more compute.
Qualitative Success
It’s easier to understand the model’s capability by looking at what it sees. Figure 2 shows the model handling the three core scenarios: 0, 1, and \(N\) targets.

- Zero Targets (Row 1, a): The query asks for a “red airplane.” The model sees planes, but none are red. It correctly predicts no boxes.
- Counting (Row 1, b & c): The model distinguishes between “leftmost three airplanes” and “four flying airplanes.” This proves the tokens are communicating with each other to resolve relative positions.
- Spatial Reasoning (Row 2, e): “Two individuals on the outermost sides.” This is a difficult query requiring global context. The model successfully picks the two people on the edges and ignores the group in the middle.
How Validity Filtering Works
A unique feature of RECANTFormer is the split between the Box Head and the Validity Head. We can actually visualize this process.
In Figure 6 below, the second column (Yellow boxes) shows what the box head wants to predict—essentially every possible candidate. The third column (Green boxes) shows what happens after the Validity Head filters the results.

Look at row (b): “every male individual.” The box head initially detects everyone, including the woman. However, the validity head understands the semantic nuance of “male” and filters out the female subject, leaving only the four men. This two-stage internal mechanism allows for high recall (find everything plausible) followed by high precision (select only what fits the text).
Does the number of tokens (\(N\)) matter?
The authors performed an ablation study to see how the number of localization tokens affects performance.

As seen in Table 4, simply adding more tokens doesn’t always help.
- For gRefCOCO, increasing \(N\) from 5 to 20 actually hurts performance slightly (57.73 down to 54.27).
- Why? The authors hypothesize that having too many tokens dilutes the loss signal. Since most images have only 1 or 2 targets, if you have 20 tokens, 18 of them are being trained to predict “nothing” (background). This overwhelms the positive training signals from the few active tokens. \(N=5\) or \(N=10\) seems to be the sweet spot.
Visualizing Attention: What is the model thinking?
One of the benefits of transformer architectures is interpretability through attention maps. We can ask: “When this localization token predicted this box, which parts of the image was it looking at?”

In Figure 7, we see the attention weights for specific tokens.
- Top Left: For the “back half of elephant,” the attention map (heatmap) lights up exactly over the rear of the elephant.
- Top Right: For “giraffe on right,” the attention focuses on the specific giraffe’s body.
This confirms that the localization tokens aren’t just memorizing coordinates; they are semantically attending to the visual features that correspond to the text query.
Extension: Segmentation (RECANTFormer+)
While the main paper focuses on bounding boxes (comprehension), the authors also demonstrate that their architecture is flexible. They extended the model to perform Generalized Referring Expression Segmentation (GRES)—generating pixel-perfect masks instead of boxes.

By adding a mask prediction head that utilizes the attention maps between localization tokens and visual tokens (as shown in Figure 8), they created “RECANTFormer+.” This extension utilizes a Feature Pyramid Network (FPN) style upsampler to recover high-resolution details required for segmentation.
Conclusion and Future Implications
RECANTFormer represents a significant step toward more realistic computer vision systems. By abandoning the “one-to-one” assumption, it moves us closer to robots and AI assistants that can handle the ambiguity and variety of the real world.
Key Takeaways:
- Generalized Capability: The model handles 0, 1, or many targets effectively using a unified architecture.
- Decoder-Free Efficiency: By using an encoder-only transformer, it converges faster and trains more efficiently than DETR-like encoder-decoder models.
- Learnable Tokens: The use of fixed localization tokens combined with Hungarian matching provides a robust way to learn variable-number object detection without complex heuristics.
Limitations: The authors note that the model still struggles with “hard negatives”—samples where the object is almost there but not quite (e.g., a “red mug” when there is a “red bowl”). The performance on no-target samples (N-acc) in complex datasets like gRefCOCO remains lower than in simpler datasets (see Table 5 below), indicating room for improvement in negative mining and discrimination.

Despite this, RECANTFormer offers a blueprint for the next generation of multimodal AI: models that don’t just find objects, but understand the presence, absence, and quantity of the world around them.
](https://deep-paper.org/en/paper/file-3534/images/cover.png)