Introduction

In the era of deep learning, data is the new oil. But there is a catch: refining that oil—training models on massive datasets—is incredibly expensive and computationally demanding. For many students and researchers, training a state-of-the-art model on the full ImageNet or Food-101 dataset is simply out of reach due to hardware limitations.

This brings us to subset selection (also known as coreset selection). The goal is simple yet ambitious: can we identify a small, informative fraction of the training data (say, 10% or 30%) that allows a model to learn almost as well as if it had seen the whole dataset?

Traditionally, solving this problem faced a “chicken and egg” dilemma. To know which data points are important, you usually need a trained model to evaluate them. But if you have to train a model to select the data to train the model… you haven’t saved any time.

A new paper, “Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection,” proposes a way out of this loop. The researchers investigate using pre-trained Foundation Models (FMs)—like CLIP and DINOv2—as “judges” to select data. They uncover surprising insights about when these models work (and when they don’t) and introduce a novel method called RAM-APL that combines multiple foundation models to achieve state-of-the-art results on fine-grained datasets.

The Bottleneck of Traditional Selection

To understand why this new approach is significant, we first need to look at how “One-Shot Subset Selection” is usually done.

In a typical pipeline, you need an Information Extractor (IE). This is a neural network that looks at your data and extracts features (mathematical representations of the images). These features are then measured for importance—perhaps based on how unique they are or how unsure the model is about them.

Comparison of pipelines for one-shot subset selection.

As shown in Figure 1 (a) above, the traditional pipeline requires training a model on the target dataset before selection begins. This creates a dataset dependency. Every time you have a new dataset, you must spend time and compute pre-training a proxy model just to figure out which data to keep.

The researchers propose a shift to the pipeline shown in Figure 1 (b). Instead of training a new proxy model from scratch, why not use Foundation Models (FMs)? These massive models (like CLIP, SigLIP, or DINOv2) have already been trained on billions of images. They possess “general knowledge.” Theoretically, we should be able to plug them in as Information Extractors immediately, skipping the pre-training phase entirely.

But does this actually work?

Part 1: The Single-Model Investigation

The authors didn’t just assume FMs would be better; they conducted a rigorous “Single-Model Study” to test the hypothesis. They compared traditional proxy models against various FMs across different types of datasets:

Coarse-grained: Identifying generic objects (e.g., CIFAR-10: airplanes vs. birds).
Fine-grained: Identifying specific sub-categories (e.g., Oxford-IIIT Pet: distinct dog breeds).
Noisy: Datasets where some labels are incorrect.

Insight 1: FMs Shine on Fine-Grained Data

The study revealed a stark divide in performance. On coarse-grained datasets, especially those with noisy labels, Foundation Models offered limited advantages. Sometimes, a simple model trained on the target data performed better.

However, on fine-grained datasets, FMs were dominant. As shown in Figure 6 below, specifically chart (d), FMs (the colored bars) frequently outperformed traditional methods (the blue/gray bars) on the Oxford-IIIT Pet dataset.

Single-model study on five Target Datasets.

Because FMs have seen such a vast variety of objects during their own pre-training, they are exceptionally good at distinguishing subtle features—like the texture of a terrier’s fur versus a retriever’s—which is crucial for fine-grained tasks.

Insight 2: Not All FMs Are Created Equal

Here is the second surprise: a “better” Foundation Model doesn’t necessarily make a better data selector. You might assume that if Model A has higher accuracy on a classification task than Model B, it should also be better at selecting data.

Relationship between foundation model performance and subset selection.

Figure 2 shows this isn’t the case. The scatter plots map the model’s accuracy on the full task (x-axis) against its performance as a data selector (y-axis). If the correlation were perfect, all points would form a diagonal line. Instead, we see that models like EVA-CLIP might be great classifiers but suboptimal selectors for certain algorithms.

This leads to a dilemma: If we want to use FMs, which one do we choose? If we have to test them all to find the best one, we are back to wasting time.

Part 2: The Multi-Model Approach (RAM-APL)

To solve the selection dilemma and maximize performance on fine-grained datasets, the authors propose a new method: RAM-APL.

The core philosophy is consensus. Different Foundation Models see the world differently. DINOv2 might focus on object structure, while CLIP might focus on semantic associations. By combining them, we can get a robust estimate of data importance without knowing which specific model is “best” for the task.

The authors use a pool of models (specifically CLIP and DINOv2 in their main experiments) to calculate two key metrics: RAM (for intra-class ranking) and APL (for inter-class distinction).

1. RAnking Mean (RAM)

RAM focuses on Representativeness. Within a specific class (e.g., “Siamese Cat”), we want to pick images that are arguably the “best examples” of that class.

First, for every class \(c\) and every foundation model \(i\), the method calculates a “centroid”—the average feature vector of all images in that class:

Centroid calculation equation.

Next, it calculates the Euclidean distance between every image \(j\) and its class center. A small distance means the image is very typical of that class (a prototype).

Euclidean distance equation.

The images are ranked based on this distance. Finally, the Ranking Mean (RAM) is the average rank of an image across all the different Foundation Models used.

Ranking Mean equation.

Visualizing RAM: The authors provide a visualization of what this metric actually finds. In Figure 9, images on the left have a “Small” RAM (high rank). Notice how clear and centered the subjects are. As RAM gets larger (right side), the images become more cluttered, obscured, or atypical.

Visualisation of samples with RAM metric.

2. Accuracy of Pseudo-class Labels (APL)

While RAM finds representative images, we also need to know how distinct an image is from other classes. APL measures Distinctiveness.

For a given image, each Foundation Model attempts to guess its label based purely on feature distances (a “pseudo-label”).

Pseudo-label equation.

If the model guesses correctly (the image’s feature is closest to its own class center), it gets a score of 1. If it guesses wrong (confusing a cat for a dog), it gets a 0. The APL score is the average accuracy across all Foundation Models.

Average pseudo-class label accuracy equation. Overall APL vector.

If an image has a low APL score, it means multiple powerful Foundation Models are confused by it. It is likely an outlier or hard sample.

3. Fusing the Scores

The final selection score combines RAM (Representativeness) and APL (Distinctiveness).

Final Score equation.

The weights \(W_1\) and \(W_2\) are not static. The authors use a dynamic weighting mechanism based on the sampling rate (\(p\)).

Weight calculation equation.

Why dynamic weights?

Low Budget (e.g., 1% data): You can’t afford confusing data. You need the most representative examples to learn the basics. The weight shifts toward RAM.
High Budget (e.g., 50% data): You have the basics covered. Now you need “hard” examples to refine the decision boundary. The weight allows for more consideration of APL.

Experiments and Results

The authors tested RAM-APL against a wide range of baselines, including classic methods like K-Center Greedy, Herding, and newer deep learning selection methods.

Performance Comparison

The results on fine-grained datasets are impressive. Figure 3 shows the performance curves on the Oxford-IIIT Pet dataset.

Comparison of our method with baselines on fine-grained datasets.

The red line (Ours) consistently stays at the top. On the Caltech-UCSD Birds dataset (CUB-200-2011), RAM-APL achieved an average improvement of 6.4% over random selection across all sampling rates. This is a massive margin in the context of subset selection.

Why Multiple Models?

Is it really necessary to use multiple models? Could we just stick features from CLIP and DINOv2 together?

The authors analyzed the “cosine similarity” between the features extracted by different FMs.

Cosine similarity matrix.

Figure 10 shows that the similarity is near zero (the dark blue squares). This proves that CLIP, SigLIP, and DINOv2 are extracting fundamentally different types of information. By combining them, RAM-APL gets a much richer view of the data than any single model could provide.

Furthermore, Table 2 demonstrates that adding models generally improves performance. The combination of CLIP (C) and DINOv2 (D) provided the best balance of accuracy and efficiency.

Comparison of performance using different numbers of foundation models.

Cross-Architecture Generalization

A common failure point in subset selection is that data selected for one model (e.g., ResNet) might not work well for another (e.g., MobileNet). RAM-APL proves robust here as well.

Cross-architecture generalization table.

Table 6 shows that even when the target model is MobileNet-V3 (completely different from the FMs used for selection), RAM-APL still outperforms other methods.

Conclusion

The research paper “Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection” makes a compelling case for modernizing how we curate training data.

The key takeaways are:

Foundation Models are viable Information Extractors, particularly for fine-grained tasks, eliminating the need for costly pre-training of proxy models.
No single FM is perfect, and reliance on just one can lead to inconsistent results.
RAM-APL successfully bridges this gap by aggregating insights from multiple FMs, balancing representativeness (via RAM) and distinctiveness (via APL).

By leveraging the “collective wisdom” of Foundation Models, we can select high-quality subsets of data that allow us to train powerful AI models at a fraction of the computational cost. This approach not only democratizes access to efficient training but also highlights a new utility for Foundation Models beyond just generation and classification: they can act as curators for the next generation of AI.

Introduction#

The Bottleneck of Traditional Selection#

Part 1: The Single-Model Investigation#

Insight 1: FMs Shine on Fine-Grained Data#

Insight 2: Not All FMs Are Created Equal#

Part 2: The Multi-Model Approach (RAM-APL)#

1. RAnking Mean (RAM)#

2. Accuracy of Pseudo-class Labels (APL)#

3. Fusing the Scores#

Experiments and Results#

Performance Comparison#

Why Multiple Models?#

Cross-Architecture Generalization#

Conclusion#