Imagine you have a digital photo album containing thousands of images. You want to find a picture of your specific pet, “Fido,” catching a frisbee. You type “Fido catching a frisbee” into the search bar.
Standard Vision-Language Models (VLMs) like CLIP differ from standard object detectors because they can understand open-ended text. However, they have a major limitation: they know what a dog looks like, but they don’t know what your dog, Fido, looks like.
To solve this, we need Personalized Vision-Language Retrieval (PerVL). This task involves teaching a pre-trained model a new, specific concept from just a few reference images (usually 3 to 5) so that it can recognize that concept in various contexts.
The challenge is a classic dilemma in machine learning: plasticity vs. stability.
- Plasticity: The model must be flexible enough to learn the new concept (“Fido”).
- Stability: The model must not forget its general knowledge (what “catching a frisbee” looks like).
If the model focuses too much on Fido, it might retrieve every picture of Fido, ignoring the “frisbee” part of the query. If it focuses too much on general knowledge, it might retrieve random dogs catching frisbees.
In this post, we will dive deep into a research paper that introduces POLAR (PersOnalized Low-rank Adaptation for Retrieval). This method proposes a surgical approach: instead of jamming new words into the model’s dictionary, it gently tweaks the model’s internal brain wiring using a regularized, low-rank update.
The Status Quo: Textual Inversion
Before understanding POLAR, we need to understand the prevailing method it attempts to dethrone: Textual Inversion.
In approaches like PALAVRA or Textual Inversion, the researchers don’t touch the internal weights of the neural network. Instead, they freeze the model and “learn” a new word vector. They find a pseudo-word token (often denoted as \(V^*\)) that, when fed into the text encoder, produces an embedding close to the images of the personal concept.
Basically, they are trying to find a magic word that means “Fido” to the model.

As shown in Figure 1 (Left), textual inversion inserts a learned pseudo-token into the query. The prompt becomes “A photo of [purple vector].”
The Problem:
- Interference: This learned token interacts with every layer of the text encoder. It can dominate the sentence, causing the model to ignore the context (“catching a frisbee”).
- Optimization Speed: Finding this token requires backpropagating through the entire text encoder, which can be slow.
- Storage: These tokens can be high-dimensional and hard to optimize without large-scale pre-training.
The POLAR Method: Internal Parameter Updates
The researchers behind POLAR asked a different question: What if, instead of finding a magic word, we just slightly adjusted how the model thinks?
They drew inspiration from Low-Rank Adaptation (LoRA), a technique popular in fine-tuning Large Language Models (LLMs) and diffusion models. POLAR applies a rank-1 parameter update to the text encoder.
1. The Architecture
POLAR focuses on the CLIP architecture, which consists of an image encoder and a text encoder. The goal is to maximize the similarity between the text embedding of a query (containing the personal concept) and the image embedding of the target image.

As illustrated in Figure 2, POLAR freezes the Image Encoder entirely. It also freezes almost all of the Text Encoder. The only thing that changes is a specific set of weights in the final layer of the text encoder.
2. The Math of Low-Rank Adaptation (LoRA)
Standard fine-tuning involves updating a weight matrix \(W\).
\[ W_{new} = W + \Delta W \]In deep learning, \(W\) is huge. Updating the whole thing on a few images leads to massive overfitting (catastrophic forgetting). LoRA assumes that the change in weights (\(\Delta W\)) doesn’t need to be full-rank. It can be approximated by multiplying two very small matrices, \(B\) and \(A\).

Here:
- \(W\): The frozen pre-trained weights.
- \(x\): The input to the layer.
- \(A\): A matrix that projects the input down to a low rank (\(r\)).
- \(B\): A matrix that projects it back up to the original dimension.
POLAR takes this to the extreme. They set the rank \(r=1\). This means \(A\) and \(B\) are essentially vectors. This minimizes the number of parameters learned, acting as a bottleneck that forces the model to learn only the most essential features of the personal concept.
3. Surgical Precision: The Value Transform
A Transformer block consists of Attention mechanisms and Feed-Forward networks. Inside the Attention mechanism, inputs are transformed into Queries (\(Q\)), Keys (\(K\)), and Values (\(V\)).
The researchers performed ablations (tests to see which component matters most) and found that updating the Value (\(V\)) matrix in the final layer (Layer 12) was the optimal strategy.

Why the final layer? Early layers process syntax and basic word relationships. The final layer aggregates this information into a high-level semantic representation. By tweaking the Value matrix here, POLAR injects the concept of “Fido” right before the model makes its final decision, minimizing the disruption to the rest of the sentence processing.
The update looks like this:

4. Training the Concept
To train the model, the user provides \(\sim 5\) images of the concept (\(I^c\)) and a placeholder text like “An image of sks” (\(q^c\)). The model minimizes the Mean Squared Error (MSE) between the text embedding and the image embedding.

This standard loss ensures the model recognizes “sks” as the specific visual concept provided.
The Secret Sauce: Regularization
If you only use the MSE loss above, the model will learn “Fido” perfectly, but it will likely forget everything else. It might start thinking every dog is Fido, or it might lose the ability to understand contexts like “sitting on a couch.”
To prevent this catastrophic forgetting, POLAR introduces a clever regularization strategy directly on the LoRA matrices.

The regularization has two parts:
- Constraining \(A\): The vector \(A\) is constrained to have a unit norm (\(||A|| = 1\)).
- Interpretation: \(A\) acts as a “detector.” It looks at the incoming data \(x\) and decides how relevant the personal concept is. By fixing its scale, we force it to focus on direction (similarity) rather than magnitude.
- Penalizing \(B\): The loss includes an L2 penalty on \(B\) (\(|B|^2\)).
- Interpretation: \(B\) determines the “strength” of the update. By penalizing \(B\), the model is encouraged to keep the update vector as close to zero as possible. It only increases the weights if absolutely necessary to minimize the MSE loss.
This combination ensures that the model defaults to its pre-trained general knowledge (\(B \approx 0\)) unless the input strongly triggers the personal concept detector (\(A\)).
Handling Multi-Concept Queries
What if you want to search for “Fido playing with my favorite ball”? You have two personal concepts in one query.
Textual inversion struggles here because multiple learned tokens can interact unpredictably. POLAR, however, leverages the linearity of its updates. Because the updates are low-rank additions, you can simply add the parameters for concept 1 and concept 2 together during the forward pass.

This allows users to compose multiple personal concepts dynamically without re-training the model.
Experimental Results
The researchers evaluated POLAR on two challenging benchmarks: DeepFashion2 (clothing items) and ConCon-Chi (diverse objects in complex contexts).
Quantitative Performance
POLAR achieves state-of-the-art results. On DeepFashion2, it significantly outperforms PALAVRA and SEARLE (a zero-shot baseline).

In Table 1, we see POLAR achieving a Mean Reciprocal Rank (mRR) of 34.82, compared to PALAVRA’s 28.4. When scaled up to the larger ViT-L/14 architecture, the performance jumps to 40.72.
Qualitative Performance
Numbers are great, but visual retrieval results tell the real story. In Figure 3, we see a comparison between POLAR and SEARLE.

Look at the top row (the floral shorts).
- The Query: “An image of V*” (where V* is the shorts).
- SEARLE (Red boxes): Retrieves images that have similar colors or vibes but gets the object wrong (retrieving a skirt or just a person in a room).
- POLAR (Green boxes): Correctly identifies the specific floral pattern and cut of the shorts across different poses.
This demonstrates POLAR’s ability to distinguish fine-grained details, which is crucial for personalized search.
Preserving General Knowledge
One of the paper’s main claims is that POLAR preserves general knowledge better than token-based methods. To test this, they devised a clever metric: VLM Caption Recall.
They took the personalized model (tuned for “Fido”) and asked it to retrieve images based on generic captions generated by a Vision Language Model (like LLaVA) that had nothing to do with Fido (e.g., “A kitchen with a stove”).
If the model is “broken” by the personalization, it should fail to retrieve these generic images.

Figure 4 shows that even with the personal parameters loaded, the model perfectly retrieves generic concepts like “A kitchen with a stove top” (Right column). Simultaneously, it correctly processes context queries for the personal concept, like “V* upside down on a wooden surface” (Left column).
Ablation Studies: Design Choices
Why did they choose Rank-1? Why Layer 12? The authors justify these choices through extensive testing.
1. Rank of Update: You might think a higher rank (more parameters) implies better learning. However, Table 11 (below) shows that increasing the rank from 1 to 16 yields negligible gains in context accuracy but hurts parameter efficiency. Rank-1 is the “sweet spot” for learning a single concept without overfitting.

2. Which Layers? Applying the update to all layers actually hurts performance compared to just the final layers.

As seen in Table 12 (referenced as Table 5 in the text), updating layers 11 & 12 or just Layer 12 yields the best results. This supports the hypothesis that personalization is a high-level semantic adjustment, not a low-level feature extraction change.
3. Which Parameters? Table 13 confirms that the Value (V) projection is the most effective place to inject the update.

Updating the Query (Q) or Key (K) matrices results in poor performance (mRR of ~16-23 vs 52 for V). This suggests that we shouldn’t change where the model looks (Attention weights), but rather what information it extracts (Values) from that location.
Limitations
No method is perfect. POLAR relies on the frozen image encoder of CLIP. If CLIP’s image encoder cannot “see” the difference between two very similar textures, POLAR cannot magically learn it because it only updates the text side.

Figure 5 highlights these failure cases. In the top row, the model retrieves a blue striped shirt instead of a blue polka-dot shirt. To the frozen CLIP image encoder, these might look nearly identical. This is a fundamental limitation of “Text-Encoder-Only” fine-tuning.
Conclusion
POLAR represents a shift in how we think about personalizing large pre-trained models. Instead of treating the model as a fixed black box and trying to find the perfect input (Textual Inversion), POLAR treats the model as a malleable plastic.
By applying regularized, rank-1 updates to the final value projection, POLAR achieves a “Goldilocks” state:
- Flexible enough to learn “Fido” from 5 images.
- Stable enough to remember what a “frisbee” is.
- Lightweight enough to store one tiny vector per concept.
For students and researchers interested in efficient fine-tuning, this paper offers a masterclass in how targeted, minimal architectural changes can often outperform complex, heavy-handed approaches.
](https://deep-paper.org/en/paper/2506.10182/images/cover.png)