Introduction

Imagine you are shopping online for a shirt. You find a photo of a shirt that has the perfect cut and fabric, but it’s blue, and you really wanted it in grey. In a standard text search, describing exactly what you want is difficult (“shirt like this but grey”). This scenario is where Composed Image Retrieval (CIR) shines.

CIR allows users to search using a combination of a reference image (the blue shirt) and a text instruction (“change blue to grey”). Ideally, the system understands that it should keep the shape and fabric from the image but swap the color based on the text.

However, there is a fundamental problem lurking in this process: Compositional Conflict.

The reference image contains strong visual signals (Blue! Long sleeves!). The text instruction provides contradictory signals (Grey! Short sleeves!). When a neural network tries to process both simultaneously, these signals clash. Does the model output a blue shirt? A grey shirt? Or a muddy confusion of both?

In this post, we will dive deep into a CVPR paper titled “CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval.” This research proposes a novel framework that doesn’t just smash image and text together. Instead, it uses Large Language Models (LLMs) to intelligently identify exactly which parts of the image conflict with the text and “neutralize” them before performing the search.

Figure 1: An example of the multi-modal query that contains compositional conflicts.

As shown in Figure 1, the reference image is a blue, long-sleeve shirt. The text asks for “grey” and “short sleeve.” Traditional methods struggle here because the visual features of “blue” and “long sleeve” are still present in the reference embedding. The CCIN framework aims to solve this by sequentially identifying and removing these conflicts.

Background: The Challenge of Composed Image Retrieval

To understand the innovation of CCIN, we first need to understand how CIR usually works.

In standard approaches, a model (often based on Vision-Language Pre-training, or VLP) extracts features from the reference image and features from the text. It then fuses them—using mechanisms like concatenation, attention, or gating—to create a single “query representation.” This query is then used to search a database for the target image.

The “Conflict” Problem

The issue arises when the text explicitly contradicts the image.

Reference: A dog sitting.
Text: “Make the dog stand up.”

If the model blindly fuses the features, the resulting representation contains “sitting” (from the image) and “standing” (from the text). In the high-dimensional feature space, this hybrid representation might drift away from the true target (a standing dog) and settle somewhere in the middle, or prioritize the strong visual signal of the sitting dog, ignoring the text entirely.

Previous works have attempted to solve this by using “learnable masks” on the image features—essentially trying to dim the parts of the image that might be irrelevant. However, these methods operate in a “black box” fashion. They don’t explicitly know what the conflict is (e.g., “sitting vs. standing”); they just try to suppress noise. This often leads to imprecise results.

The Core Method: CCIN

The researchers propose a framework called Compositional Conflict Identification and Neutralization (CCIN). The philosophy here is simple but powerful: You cannot fix a conflict you haven’t identified.

The framework operates in two distinct stages:

Identify: Use an LLM to analyze the image caption and the instruction to pinpoint exactly what attributes are in conflict.
Neutralize: Use a dual-instruction mechanism to extract visual features that explicitly exclude the conflicting attributes while keeping everything else.

Let’s break down the architecture.

Figure 2: Structure overview of the proposed CCIN method.

As illustrated in Figure 2, the architecture consists of the Compositional Conflict Identification (CCI) module (top half) and the Compositional Conflict Neutralization (CCN) module (bottom half).

Module 1: Compositional Conflict Identification (CCI)

The first challenge is figuring out what needs to change. Visual features are abstract numbers; they don’t tell us “this is the color blue.” Text, however, is semantic.

The CCI module bridges this gap using the following steps:

Captioning: An image captioning model (like BLIP-2) generates a detailed description of the reference image (\(T_{ref}\)).
Synthesis: The system combines the reference caption with the user’s modified instruction (\(T_{mod}\)) using the conjunction “However.”

Example: “A blue long-sleeve shirt.” + However + “make it grey and short sleeve.”

LLM Analysis: This combined text is fed into a Large Language Model (like GPT-4) with a specific prompt (\(P_{con}\)) asking it to identify the conflicting attributes (\(T_{con}\)).

The process can be formalized as:

Equation 1: Determining conflicting attributes using an LLM.

By converting the visual content into text, the system leverages the massive reasoning capabilities of LLMs to perform logic checks. The LLM outputs the specific attributes that are causing trouble (e.g., “shirt color,” “sleeve length”).

Figure 6: Qualitative results of conflict identification.

Figure 6 shows this in action. Look at the top example: The reference is a tie-dye shirt. The text asks for a black band t-shirt. The CCI module correctly identifies that the “pattern” and “graphic” are the conflicting attributes.

Module 2: Compositional Conflict Neutralization (CCN)

Once the conflicts are identified, how do we stop the model from seeing them in the image? This is where the CCN module comes in. It uses a strategy involving Dual Instructions.

Step A: Generating the “Kept Instruction”

First, the system asks the LLM to rewrite the original reference caption by removing the conflicting attributes identified in the previous step. This creates a new text called the Kept Instruction (\(T_{kept}\)).

Equation 2: Generating the Kept Instruction.

Original Caption: “A blue long-sleeve shirt with a collar.”
Conflict: “Color,” “Sleeve length.”
Kept Instruction: “A shirt with a collar.”

This \(T_{kept}\) is crucial. It describes exactly what we want to preserve from the reference image.

Step B: Dual-Path Feature Extraction

The researchers utilize a Q-Former (a component from the BLIP architecture) to extract features from the reference image. However, they do something unique here. They run the Q-Former twice, in parallel, with different text guidance:

Path 1 (Preservation): The Q-Former looks at the reference image guided by the Kept Instruction (\(T_{kept}\)). This extracts visual features of the non-conflicting parts (e.g., the collar, the fabric texture).
Path 2 (Modification): The Q-Former looks at the reference image guided by the Modified Instruction (\(T_{mod}\)). This helps the model align visual features with the requested changes.

Equation 3: Extracting features using dual instructions.

Here, \(\mathcal{F}_Q\) represents the Q-Former. The result is two sets of visual features: \(f_{kept}\) (what stays) and \(f_{mod}\) (what changes).

Step C: Adaptive Fusion

Now the system has separated the “good” visual traits from the “bad” ones. It needs to recombine them. It uses a learnable gating mechanism (an MLP with a Sigmoid activation) to weigh how much of the kept features versus the modified features to use.

Equation 4: Fusing kept and modified visual features.

Finally, these neutralized visual features (\(f_{neu}\)) are fused with the text features (\(f_{t}\)) of the modified instruction to create the final query representation.

Equation 5: Final query representation.

This final vector, \(f_{query}\), represents the “Platonic ideal” of the target image: it contains the visual structure of the original image (minus the conflicts) plus the semantic meaning of the new instruction.

Training Objectives

To train this complex architecture, the researchers use a composite loss function consisting of three parts:

Equation 6: Total loss function.

\(\mathcal{L}_{ITC}\) (Image-Text Contrastive Loss): The standard loss that pulls the query representation close to the target image and pushes it away from non-target images.
\(\mathcal{L}_{WRT}\) (Weighted Regularization Triplet Loss): This uses relative distances to ensure the positive pair (query + correct target) is significantly closer than negative pairs.
\(\mathcal{L}_{OPR}\) (Orthogonal Projection Regularization): This is a clever addition. It forces the “conflict” features to be mathematically orthogonal (perpendicular) to the target features. In vector space, if two vectors are orthogonal, they are uncorrelated. This explicitly forces the model to “forget” the conflicting information.

Experiments and Results

The researchers evaluated CCIN on three standard datasets: FashionIQ (fashion images), CIRR (real-life images), and Shoes (footwear).

Quantitative Performance

The results are compelling. As seen in Table 1 below, CCIN outperforms state-of-the-art methods, including SPRC and TG-CIR.

Table 1: Evaluation of CIR performance on FashionIQ and CIRR.

For example, on the FashionIQ dataset (Dress category), CCIN achieves a Recall@10 of 49.38%, surpassing the previous best of 48.83%. While the margins might seem small in percentages, in the world of dense retrieval, these are significant gains, especially given the difficulty of the task.

Similarly, on the Shoes dataset (Table 2), the method shows consistent superiority.

Table 2: Evaluation on Shoes dataset.

Qualitative Analysis

Numbers are great, but visual results tell the real story. Let’s look at a comparison between CCIN and a competitor, SPRC.

Figure 4: Qualitative results comparing SPRC with CCIN.

Look at the bottom example in Figure 4 involving the dog.

Reference: A dog standing up.
Instruction: “Make the dog lie down and look to right.”
Result: The competitor (SPRC) retrieves dogs that are still standing or in ambiguous poses. Why? Because the “standing” feature from the reference image wasn’t neutralized. CCIN, however, successfully retrieves a dog that is lying down. It successfully identified that “posture” was a conflict and neutralized the “standing” visual cue.

Does the LLM matter?

The CCIN framework relies heavily on the LLM to identify conflicts. Does it matter which LLM you use? The researchers tested Llama2-70B, GPT-3.5-Turbo, and GPT-4.

Figure 7: Comparison of different LLMs in identifying conflicts.

As shown in Figure 7 (Row 1), when asked to change a t-shirt to “long sleeves,” GPT-3.5 failed to identify the sleeve length conflict, and Llama2 misidentified the intention. GPT-4 (in purple) was the most robust in accurately spotting the subtle logical contradictions between image captions and user instructions. This highlights that as LLMs get smarter, this retrieval framework will likely get even better.

Ablation Studies

The team also performed ablation studies to prove that every part of the system is necessary.

Table 3: Ablation studies of the CCIN framework.

Table 3 shows that adding the specific loss functions (\(\mathcal{L}_{WRT}\) and \(\mathcal{L}_{OPR}\)) incrementally improves performance.

Furthermore, Table 4 (below) validates the Dual-Instruction strategy. Using only the Kept Instruction or only the Modified Instruction performs significantly worse than using both. You need the “Kept” path to maintain identity and the “Modified” path to incorporate changes.

Table 4: Ablation studies of the CCN module.

Conclusion and Implications

The CCIN paper represents a shift in how we think about multi-modal tasks. Rather than treating neural networks as black boxes that magically fuse data, this approach introduces a structured, logical step: Identify the problem first.

By explicitly identifying compositional conflicts using the reasoning powers of LLMs, the model can perform “surgery” on the feature vectors—removing the specific attributes that contradict the user’s intent while preserving the rest of the image’s rich visual detail.

Key Takeaways:

Conflict is the enemy: In Composed Image Retrieval, the biggest hurdle is the clash between what the image shows and what the text requests.
LLMs as Reasoners: We can use LLMs not just for chat, but as logical processors to guide computer vision tasks.
Explicit Neutralization: targeted removal of conflicting features works better than general “masking” or “fusion.”

As we move toward more interactive AI systems—where we converse with our computers to edit and find content—methods like CCIN that understand the nuance of “keep this, but change that” will become the standard.

Introduction#

Background: The Challenge of Composed Image Retrieval#

The “Conflict” Problem#

The Core Method: CCIN#

Module 1: Compositional Conflict Identification (CCI)#

Module 2: Compositional Conflict Neutralization (CCN)#

Step A: Generating the “Kept Instruction”#

Step B: Dual-Path Feature Extraction#

Step C: Adaptive Fusion#

Training Objectives#

Experiments and Results#

Quantitative Performance#

Qualitative Analysis#

Does the LLM matter?#

Ablation Studies#

Conclusion and Implications#