Imagine you are shopping online. You see a photo of a woman wearing a stunning floral dress, but you’d prefer it in a solid red color. You can’t just upload the image because the search engine will give you the floral dress again. You can’t just type “red dress” because you’ll lose the specific cut and style of the original image.
This problem—combining a reference image with a textual modification to find a target image—is known as Composed Image Retrieval (CIR). It is one of the most practical yet challenging frontiers in computer vision.
Traditionally, solving this required training massive models on specific datasets of “before and after” image triplets. But what if we could solve this without any specific training? What if we could use the power of modern Multimodal Large Language Models (MLLMs) like GPT-4o to “reason” through the user’s request?
In this post, we are diving deep into OSrCIR (One-Stage reflective Composed Image Retrieval), a novel method proposed in the paper “Reason-before-Retrieve.” This approach introduces a “Reflective Chain-of-Thought” process that allows an AI to explicitly think about visual modifications before executing a search, setting a new state-of-the-art for training-free zero-shot retrieval.
The Core Problem: The “Telephone Game” of Retrieval
To understand why OSrCIR is significant, we first need to understand the limitations of previous approaches.
In Zero-Shot Composed Image Retrieval (ZS-CIR), the goal is to perform this retrieval without fine-tuning the model on CIR datasets. The most common “training-free” strategy has been a Two-Stage process:
- Captioning Stage: An image captioning model converts the reference image into text (e.g., “A photo of a dog”).
- Reasoning Stage: A Large Language Model (LLM) takes that caption and the user’s text (e.g., “make it a cat”) to generate a new target caption (e.g., “A photo of a cat”).
- Retrieval: This target caption is used to search the database.
This sounds logical, but it suffers from a critical flaw: Information Loss.

As shown in Figure 1 (a) above, the two-stage baseline (CIReVL) fails because of the separation between seeing and reasoning. The user wants to “Remove the human; change background to a blurry human.” However, the initial Image Captioner might generate a generic caption like “A big dog holding a small dog in a home.” It might miss the specific visual details that the user cares about. By the time the LLM sees the text, the visual context of the “human” might be lost or misrepresented.
It’s like a game of telephone. The captioner whispers to the LLM, and the LLM whispers to the search engine. By the end, the nuance is gone.
The authors of this paper propose a solution: One-Stage Reasoning. Instead of separating vision and language, why not feed the image and the text into a Multimodal LLM (MLLM) simultaneously? This allows the model to see the image while reading the instructions, ensuring no visual detail is left behind.
The Solution: OSrCIR and Reflective Chain-of-Thought
The proposed method, OSrCIR, utilizes a unified framework where an MLLM acts as a reasoning engine. However, simply showing an image to GPT-4V and saying “modify this” isn’t enough. MLLMs can hallucinate or misunderstand implicit user intents.
To solve this, the researchers developed a structured reasoning process called Reflective Chain-of-Thought (Reflective CoT).
The Architecture Overview

Figure 2 illustrates the complete pipeline. Let’s break down the flow:
- Input: The system takes a Reference Image (\(I_r\)) and Manipulation Text (\(T_m\)).
- The Brain (\(\Psi_M\)): An MLLM (like GPT-4o) processes both inputs simultaneously.
- The Process: The model executes the Reflective CoT to generate a Target Image Description (\(T_t\)).
- The Retrieval: A standard pre-trained model (CLIP) searches for the image that matches this description.
The mathematical formulation for generating the target description is elegant in its simplicity:

Here, \(p_c\) represents the specific prompt template that enforces the Chain-of-Thought structure.
Deep Dive: The Reflective Chain-of-Thought
The “Reflective CoT” is the heart of this paper. It forces the model to slow down and think like a human editor. The prompt guides the MLLM through four distinct steps, all within a single inference pass:
1. Original Image Description
First, the model must prove it sees the image correctly. It is prompted to capture all visible objects, attributes, and elements relevant to the manipulation text. This ensures that fine-grained details (like the breed of a dog or the texture of a couch) are retained in the model’s working memory.
2. Thoughts (Intention Understanding)
Next, the model analyzes the user’s intent. User requests in CIR are often implicit. If a user says “change background to a blurry human,” they imply they want the foreground object to stay the same but the setting to shift. In the “Thoughts” step, the model explains its understanding of the manipulation intent and discusses how this intent influences which elements of the original image should be focused on.
3. Reflections
This is the critical quality-control step. MLLMs can sometimes hallucinate or be too literal. In the “Reflections” step, the model filters out incorrect intentions.
- Example: If the text says “remove the human,” a naive model might remove all traces of humans. A reflective model realizes the user specifically meant the person holding the dog, not necessarily a human in the background if the context requires it.
- It highlights key decisions made to preserve the coherence of the image.
4. Target Image Description
Finally, based on the previous three steps, the model generates the final text query (\(T_t\)) that will be sent to the search engine. This description contains only the content of the desired target image.
Vision-by-Language In-Context Learning
How does the model know how to follow these four steps without being fine-tuned? The authors use a technique called Vision-by-Language In-Context Learning (ICL).
Standard In-Context Learning involves giving the model examples (image + text + answer). However, uploading multiple images as examples increases cost and latency. Instead, the authors provide text-only examples where the “reference image” is described in text. This teaches the MLLM the format of the reasoning process (Description \(\to\) Thoughts \(\to\) Reflections \(\to\) Target) without the computational overhead of processing example images.
The Retrieval Mechanism
Once the MLLM has reasoned through the problem and produced the Target Image Description (\(T_t\)), the “Reason-before-Retrieve” phase is complete. Now comes the retrieval.
The system uses CLIP (Contrastive Language-Image Pre-Training), a model that maps images and text into a shared mathematical space.

First, the Target Image Description (\(T_t\)) is converted into a normalized text embedding vector (\(\hat{e}_{T}\)) using CLIP’s text encoder (\(\Psi_T\)).

Simultaneously, all candidate images in the database (\(I_i\)) are converted into image embeddings (\(\hat{e}_{I_{i}}\)) using CLIP’s image encoder (\(\Psi_I\)).
The final retrieval is a simple maximization of cosine similarity. The system looks for the image in the database (\(I_r\)) that is mathematically closest to the target description:

This modularity is a massive advantage. You can swap out the MLLM for a better one, or the retrieval model (CLIP) for a different one (like BLIP or OpenCLIP), without retraining the whole system.
Experimental Results
The researchers evaluated OSrCIR across four standard datasets: CIRR and CIRCO (object manipulation), GeneCIS (object/attribute composition), and FashionIQ (attribute manipulation).
Quantitative Performance
The results show that OSrCIR significantly outperforms existing methods.

Looking at Table 1 (in the image above), let’s focus on the ViT-L/14 backbone on the CIRCO dataset.
- CIReVL (The Baseline): Achieved an mAP@5 of 18.57%.
- OSrCIR (Ours): Achieved 23.87%.
That is a massive performance jump of over 5% in a field where improvements are usually measured in fractions of a percentage. Even when compared to textual inversion methods (which require training), OSrCIR often performs better simply by being smarter about how it constructs the query.
We see similar dominance in the fashion domain:

In Table 3 (above), specifically for FashionIQ, OSrCIR consistently outperforms the baselines across “Shirt,” “Dress,” and “Toptee” categories. The “Reflective” capability helps the model understand nuanced fashion requests like “make it darker with a floral pattern” without losing the structural details of the garment.
Qualitative Analysis: Seeing the Difference
Numbers are great, but in visual tasks, seeing is believing.
The bottom half of Figure 3 (included in the image images/006.jpg) compares the object manipulation results.
- Row 1: The user asks for “Chihuahuas” in a “poster” style. The baseline (CIReVL) retrieves generic dogs. OSrCIR correctly retrieves a vibrant poster with Chihuahuas.
- Row 2: The user asks for a swimming dog. OSrCIR nails the specific action and setting.
Similarly, in Figure 5 (shown below), we can see the “Thoughts” process in action.
(Note: Please refer to Figure 5 in the original paper or the dataset provided if accessible. The image provided here seems to be a placeholder or partial view, but the concept remains: the model explicitly writes out “I need to change context from bedroom to living room” before searching.)
Correction: Let’s look at the specific visual comparison provided in the deck for Figure 5-like behavior:

Revisiting Figure 2 (Right side), notice the “Thoughts” section. The model explicitly writes: "…Changing the background implies the human presence is now via the blurry figure…" This level of explicit reasoning prevents the model from searching for a clear human when the user asked for a blurry one.
Ablation Studies: Do we really need all these steps?
You might wonder, “Is the ‘Reflection’ step actually necessary? Can’t we just ask for the description?”
The authors tested this by removing parts of the prompt.

Table 4 reveals the importance of every component:
- w/o One-stage reasoning: Drops performance by ~2%. This proves that seeing the image directly is better than reading a caption of it.
- w/o Reflections: Performance drops. Without reflecting, the model accepts its first (potentially flawed) thought.
- w/o Thoughts: Performance drops significantly. If the model doesn’t try to “understand” the intent, it acts like a basic keyword matcher.
Failure Cases: When Reasoning Fails
No model is perfect. The authors provide an honest look at where OSrCIR struggles.

Figure 6 highlights two main failure modes:
- Reasoning Terms: The retrieval model (CLIP) sometimes struggles with comparative language like “less flowy” or “darker.” While the MLLM understands this, CLIP might not grasp the relative difference.
- Domain Misalignment: The MLLM might use fancy fashion terms (“Hawaiian style”) that the retrieval model (CLIP) doesn’t strongly associate with the visual features, leading to mismatches.
This suggests that while the Reasoning (MLLM) is strong, the Retrieval (CLIP) is currently the bottleneck.
Conclusion: The Future is Reflective
The “Reason-before-Retrieve” paper presents a compelling argument for the future of computer vision systems. By moving from a pipeline of disconnected tools (captioner \(\to\) LLM \(\to\) search) to a unified, reflective system (MLLM \(\to\) search), we gain significant accuracy and flexibility.
Key Takeaways:
- One-Stage is Superior: Keeping the image and text together during the reasoning phase prevents information loss.
- Reflection Matters: Forcing an AI to “think” and “reflect” before answering significantly improves the quality of the output.
- Training-Free Viability: We don’t always need expensive training on niche datasets. A smart prompt with a powerful general-purpose model can beat specialized trained models.
OSrCIR sets a new standard for Zero-Shot Composed Image Retrieval. As MLLMs continue to get faster and smarter, and as retrieval models like CLIP improve their understanding of nuanced text, we are moving closer to a world where you can talk to a search engine as naturally as you would to a human shop assistant.
](https://deep-paper.org/en/paper/2412.11077/images/cover.png)