Artificial Intelligence has witnessed a visual revolution in recent years. Models like CLIP and SigLIP can look at a photograph and instantly categorize it, distinguishing a “Golden Retriever” from a “Labrador” with superhuman accuracy. They achieve this through Zero-Shot Learning, meaning they can recognize categories they weren’t explicitly trained on, provided they saw similar concepts during their massive pre-training on the internet.

But there is a catch. These models thrive on data abundance. They excel at recognizing things that appear frequently on the web—dogs, cars, sunsets, and celebrities. But what happens when we ask them to identify a specific type of rare plant, a complex electronic circuit diagram, or a specific skin lesion?

These are low-resource domains. In these niche areas, general-purpose Vision-Language Models (VLMs) often fail because they simply haven’t seen enough examples during pre-training.

The standard solution is usually “more training.” Engineers might fine-tune the model or use generative AI (like Stable Diffusion) to create synthetic training data. However, fine-tuning requires expertise and data that might not exist, and synthetic data can often be hallucinated or physically incorrect.

In this post, we are diving into a research paper that proposes a clever, training-free alternative: CoRE (Combination of Retrieval Enrichment). Instead of forcing the model to learn new weights, the researchers ask: What if we just let the model “look up” the context it needs from a database?

The Problem: When General Knowledge Falls Short

To understand why CoRE is necessary, we first need to look at the limitations of current Vision-Language Models (VLMs).

VLMs operate by mapping images and text into a shared “embedding space.” Ideally, a photo of a circuit amplifier and the text “a circuit diagram of an amplifier” should land right next to each other in this mathematical space.

However, in low-resource domains, the model’s internal representation is weak. It might know what a “circuit” is generally, but it struggles to differentiate an amplifier from an LED driver because those specific concepts were drowned out by billions of pictures of cats and coffees during pre-training.

The Pitfalls of Synthetic Data

A popular recent approach to fix this is generating synthetic data. If you only have 5 pictures of a rare disease, you might ask an image generator to make 100 more, then train your classifier on those.

The paper highlights a major flaw in this approach. When you generate synthetic data for rare domains, the generator often doesn’t “know” the physics or biology of the subject either.

Table 7: Synthetic image from the baseline of (Zhang et al.,2024).We show original samples,the “positive’ augmentation and the “negative” augmentation.

As shown in the figure above, synthetic augmentation can go wrong. The “Positive” columns show augmentations that are too similar to the original (adding no new info), while the “Negative” columns show augmentations that break the rules of the domain—creating circuits that don’t work or biological features that don’t exist. This noise confuses the classifier rather than helping it.

The Solution: CoRE (Combination of Retrieval Enrichment)

The researchers propose CoRE, a method that improves classification accuracy without generating fake images or retraining the model.

The intuition is simple: If you (a human) were asked to identify a rare engine part, and you didn’t know what it was, you wouldn’t hallucinate an answer. You would search for similar images or descriptions in a database to gain context.

CoRE does exactly this. It uses a Retrieval-Augmented strategy. It enriches the mathematical representation of both the query image (the picture we want to classify) and the class prototypes (the text labels we are choosing from) by retrieving relevant real-world captions from a massive external database (like CC12M or COYO-700M).

Figure 1: Our retrieval-based solution enriches both images and textual descriptors with real-world captions which contain domains and classes . Even when the captions are generic (third row for each example), they can still restrict the focus to the correct domain.

As illustrated in Figure 1, even if the model doesn’t know the exact class, retrieving captions like “Electronics that control LED patterns” helps steer the model toward the correct domain (Circuits) and away from irrelevant concepts.

How CoRE Works: A Deep Dive

The architecture of CoRE is symmetric. It performs retrieval on two sides: the Text Side (Class Enrichment) and the Image Side (Query Enrichment). Let’s look at the full architecture:

Figure 2: Our CoRE enriches both the image embedding z_q and the class prompts p with retrieved captions from a large-scale web-crawled database D . We weight the retrieved captions T with their similarity scores S^T , which we skew with controllable temperatures tau_i2t and tau_t2t . By combining the retrieved captions embedding with the original representations W and q through alpha and beta , we obtain enriched representations W^+ and z_q^+ which we employ for zero-shot classification.

Let’s break down the mathematics and logic of each branch.

Part 1: Class Representation Enrichment

In standard zero-shot classification, we take a class name (e.g., “Amplifier”) and wrap it in a prompt (e.g., “A photo of an Amplifier”). We encode this text into a vector \(W\). In low-resource domains, this vector \(W\) is often too generic.

CoRE enriches this vector by finding related context:

  1. Prompting: The system generates a prompt for the class \(c_n\) (e.g., “A circuit diagram of an amplifier”).
  2. Retrieval: It uses a Large Language Model (LLM) encoder to search a massive database \(\mathbb{D}\) for the \(k\) most similar captions.
  3. Weighting: Not all retrieved captions are equally useful. The system calculates a weight \(\sigma_n^T\) for each retrieved caption based on its similarity score \(S^T\). It uses a “softmax” function with a temperature parameter \(\tau_{t2t}\) to control how sharp this weighting is:

Equation for softmax weighting of text retrieval

If \(\tau_{t2t}\) is low, the model focuses only on the very best matches. If high, it considers a broader range of captions.

  1. Enrichment: The retrieved captions are encoded and combined into a single “retrieved context” vector \(W^T\). Finally, this context is merged with the original class prototype \(W\) using a balancing parameter \(\alpha\):

Equation for mixing original class weights with retrieved text weights

Here, \(W^+\) is the new, super-charged class representation. It contains the original class name plus all the rich context found in the database.

Part 2: Image Query Enrichment

The process is mirrored for the image we are trying to classify. Standard models just encode the image into a vector \(z_q\). CoRE asks: What text captions are usually associated with images that look like this?

  1. Visual Encoding: The query image \(q\) is encoded into vector \(z_q\).
  2. Image-to-Text Retrieval: The system searches the database for captions that are visually similar to the query image.
  3. Weighting: Similar to the text side, the retrieved captions are weighted based on their similarity scores, controlled by temperature \(\tau_{i2t}\):

Equation for softmax weighting of image-to-text retrieval

  1. Enrichment: These retrieved captions form a context vector \(z^T\). This is merged with the original image embedding using a balancing parameter \(\beta\):

Equation for mixing original image embedding with retrieved caption embedding

Now, \(z_q^+\) is an enriched representation of the image. It’s not just pixel data anymore; it’s pixel data supported by descriptive text found in the external world.

Part 3: The Final Prediction

With both sides enriched, the classification becomes a simple comparison. The model calculates the similarity between the enriched image \(z_q^+\) and all the enriched class prototypes \(W^+\).

The predicted class \(\hat{c}\) is the one that maximizes this similarity:

Equation for final classification using enriched vectors

This method is elegant because it is training-free. The parameters \(\alpha\) (alpha) and \(\beta\) (beta) are hyperparameters, not learned weights. You don’t need backpropagation, GPUs for training, or a training split of your rare data. You just need a database and a pre-trained VLM.

Experiments and Results

The authors validated CoRE on three distinct low-resource datasets:

  1. Circuits: 1,332 images of electronic diagrams (32 classes).
  2. iNaturalist 2021 (LT100): Rare plants and animals (100 classes).
  3. HAM10000: Dermatoscopic medical images of skin lesions (7 classes).

They compared CoRE against top-tier models like CLIP, SigLIP, and ImageBind, as well as the state-of-the-art method that uses synthetic data generation (fine-tuning).

Key Findings

The results were impressive across the board. CoRE consistently outperformed standard zero-shot baselines and, more importantly, outperformed the complex fine-tuning methods in most scenarios.

  • Beating Fine-Tuning: On the Circuits dataset, CoRE achieved 43.88% top-1 accuracy (using COYO-700M), beating the fine-tuned ImageBind model (24.10%) and fine-tuned SigLIP (19.53%) by massive margins.
  • Medical Imaging: On HAM10000, CoRE achieved 62.21% accuracy, which is double or triple the performance of standard zero-shot models like CLIP (roughly 7-8%) and remarkably better than fine-tuned ImageBind (31.60%).

Why did fine-tuning fail? In these “rare” domains, training data is so scarce (sometimes 5 shots or less) that the models overfit rapidly or learn noise from synthetic data. Retrieval, on the other hand, introduces robust external knowledge without the risk of overfitting.

Tuning the Knobs: Alpha and Beta

An interesting part of the analysis was understanding how much “enrichment” is actually needed. The parameters \(\alpha\) and \(\beta\) control the mix between the original model knowledge and the retrieved knowledge.

Figure 3: Top-1 accuracy of CoRE CC12M on Circuits with varying alpha and beta .CoRE achieves the best performance with a balanced merge of image-retrieved captions ( beta ~ 0 . 5 ) ,while for class-relevant captions the best weighting is slightly lower ( alpha ~ 0 . 2 )

The heatmap above illustrates the “sweet spot” for accuracy on the Circuits dataset.

  • \(\beta \approx 0.5\): For the image side, a balanced mix (50/50) of original image signal and retrieved text signal works best. The retrieved text provides vital context that the pixels alone lack.
  • \(\alpha \approx 0.2\): For the class side, the original class name is still the most important feature. The retrieved context helps, but shouldn’t overpower the specific class label.

Prompt Engineering Matters

The authors also discovered that how you ask for data matters. When retrieving data for the zero-shot weights, specificity is key.

Table 5: Accuracy of our CoRE CC12M with different prompting strategies for zero-shot weights and textto-text retrieval. Employing a domain-specific prefix for zero-shot and a domain-agnostic prefix for retrieval leads to generally better results across all the benchmarks.

As shown in Table 5, using a domain-specific prompt (like “A circuit diagram of…”) for the zero-shot weights, combined with a generic prompt for retrieval, yielded the best results. This highlights that while retrieval broadens the context, the classifier still needs to know specifically what “domain” it is operating in to make the final decision.

Conclusion

The CoRE paper presents a compelling argument for Retrieval-Augmented Generation (RAG) applied to image classification. In the era of massive foundation models, we often assume that “bigger is better” or “more training is the solution.”

However, for low-resource domains—the long tail of the distribution where rare plants, specific electronic components, and unique medical conditions live—training is often not an option. CoRE demonstrates that we can bridge this gap by simply connecting our models to a database of external knowledge.

By enriching both the image and the class label with retrieved real-world context, CoRE turns a confused VLM into a domain expert, all without updating a single model parameter. It is a win for efficiency, accuracy, and the democratization of AI for niche fields.