The development of Large Vision-Language Models (LVLMs) like LLaVA and GPT-4V has revolutionized how machines understand the world. These models are typically trained in two stages: first, massive pretraining on image-caption pairs, and second, Visual Instruction Tuning (VIT). The second stage is crucial—it teaches the model to actually follow user instructions, answer questions, and reason about visual content.
However, we have hit a bottleneck. To make these models generalize well, we keep feeding them larger and larger datasets. Training on millions of instruction pairs is prohibitively expensive for most academic labs and smaller organizations. It raises a critical question: Do we really need all this data?
In this post, we dive deep into a research paper titled “Concept-skill Transferability-based Data Selection for Large Vision-Language Models.” The researchers introduce COINCIDE, a novel method that selects a small, highly effective subset of training data (a “coreset”) that allows a model to achieve performance comparable to training on the full dataset—but with a 70% reduction in training time.
The Problem with Current Data Selection
The idea of selecting a subset of data (coreset selection) isn’t new. In standard text-based LLMs, techniques like perplexity filtering or EL2N scores are used to identify “high-quality” or “difficult” examples.
However, Vision-Language tasks are unique. They are multimodal and highly diverse. A dataset might contain simple object recognition tasks, complex optical character recognition (OCR), spatial reasoning puzzles, or conversational queries.
When researchers applied traditional selection metrics to these diverse datasets, they found a significant issue: Bias.

As shown in Figure 1, different tasks (represented by different colors) exhibit vastly different score distributions.
- Top Plot (EL2N): If you select data based on error scores (EL2N), you might inadvertently select the “Mid 20%,” which is dominated by tasks like A-OKVQA and OCR-VQA, while ignoring others.
- Bottom Plot (Self-Filter): If you use a metric like Self-Filter, you bias heavily toward ShareGPT and GQA (the “Top 20%”).
By relying on a single metric, you accidentally strip away the diversity of the dataset. You might get a model that is great at VQA but terrible at reasoning, simply because the selection metric preferred one type of data distribution over another.
The Insight: Concept-Skill Compositions
The authors of COINCIDE propose that we shouldn’t look at data just by “Task ID” (e.g., this is a VQA image, this is a Caption image). Instead, we should look at the underlying Concept-Skill Compositions.
Different datasets often test the same underlying capabilities. Consider Figure 2:

- Top Row: Both VQAv2 and GQA ask about the color of a dog. Even though they are different datasets, they share the composition: “Dog playing in water / Color attribute.”
- Bottom Row: LLaVA-Conv and LLaVA-Reason both ask about a snowboarder. They share: “Jumping with snowboard / Reasoning.”
If we can identify these underlying clusters of skills, we can sample data that covers all distinct compositions, rather than over-sampling one specific dataset.
The COINCIDE Method
COINCIDE stands for COre INstruction Concept-skIll Data Election. It is a pipeline that uses a small reference model to curate data for a large target model.
The method operates on three main pillars:
- Clustering to find concepts and skills.
- Transferability to prioritize useful clusters.
- Density to avoid redundancy.
Here is the high-level overview of the pipeline:

Let’s break down each step mathematically and conceptually.
1. Discovering Concepts via Clustering
To group data by “skills,” we need a rich representation of the image and text. The authors found that using the final output of a model isn’t enough. Instead, they use a small, off-the-shelf LVLM (like TinyLLaVA-2B) and extract neuron activations from multiple intermediate layers.
Why multiple layers? Early layers in a neural network might capture simple edges or colors, while deeper layers capture complex reasoning or semantic meaning. By combining them, we get a holistic “fingerprint” of the concept-skill composition.
The feature vector \(u^m\) for a data point is created by normalizing and concatenating features from selected layers:

Using these rich feature vectors, the algorithm performs K-Means clustering (with \(K\) set very high, e.g., 10,000) to group the training data. Each cluster \(C_i\) represents a specific “Concept-Skill Composition.”
2. Measuring Transferability
This is the most novel contribution of the paper. Not all data clusters are created equal. Some clusters contain “foundational” knowledge that helps the model learn other tasks. This is called Transferability.
If learning Cluster A helps the model perform well on Cluster B, then Cluster A has high transferability. Ideally, we want to measure this directly:

In this equation, \(T_i\) represents how much training on cluster \(i\) improves the loss on other target clusters (\(j\)).
However, calculating this directly is impossible—you would have to train thousands of models to test every cluster pair. The authors needed a computationally cheap proxy. They hypothesized that clusters that are close to each other in feature space transfer well to each other.
They tested this hypothesis and found a strong correlation:

As Figure 4 demonstrates, there is a significant positive correlation (Pearson \(r \approx 0.7\)) between the Average Cosine Similarity (\(S\)) of a cluster’s centroid and its actual Transferability (\(T\)).
This allows the authors to use a simple proxy metric, \(S_i\), which is cheap to compute:

Higher \(S_i\) means the cluster is centrally located in the “skill space” and shares features with many other clusters, making it a high-priority candidate for training.
3. Measuring Density
We also need to consider efficiency. If a cluster is very dense, it means the data points inside it are extremely similar to each other. We don’t need to sample heavily from it because the information is redundant.
The density \(D_i\) is calculated as the average Gaussian kernel distance between pairs in the cluster:

A low \(D_i\) indicates the cluster is diverse; a high \(D_i\) indicates redundancy.
4. The Sampling Strategy
COINCIDE determines the number of samples to pick from each cluster (\(P_i\)) using a categorical distribution derived from the two metrics above:
\[ P_i \propto \exp\left(\frac{S_i}{\tau D_i}\right) \]where \(\tau\) is a temperature hyperparameter.
- High Transferability (\(S_i\)): Increase sampling probability.
- High Density (\(D_i\)): Decrease sampling probability (because we divide by \(D_i\)).
Once the number of samples for a cluster is decided, specific examples are selected to minimize the Maximum Mean Discrepancy (MMD). This ensures the selected subset statistically resembles the full cluster.

Experimental Results
The researchers tested COINCIDE on two major datasets: LLaVA-1.5 and Vision-Flan. They compared it against 8 strong baselines, including Random sampling, CLIP-Score, and recent methods like Self-Filter.
Performance on LLaVA-1.5
In a head-to-head comparison using only 20% of the training data, COINCIDE outperformed all baselines.

In Table 1, you can see that COINCIDE (bottom row) achieves the highest relative performance (97.4%) compared to the full-finetune model. It beats sophisticated methods like “Self-Filter” and “SemDeDup” across almost every benchmark, particularly in diverse tasks like VQAv2 and ScienceQA (SQA-I).
The consistency of this performance is evident across different sampling ratios:

Figure 5 shows that even at very low sampling ratios (5-10%), COINCIDE (the black line) maintains a lead over other methods. Notably, some methods like Self-Filter (dark blue dotted line) perform very poorly at low data regimes.
Performance on Vision-Flan
The results on the Vision-Flan dataset are even more impressive. This dataset contains 191 distinct tasks, making it highly heterogeneous.

As seen in Table 2, COINCIDE actually outperforms the full-finetune model (101.0% relative performance) while using only 16.7% of the data. This suggests that the full dataset contains noise or conflicting information that COINCIDE effectively filters out.
Efficiency: The Pareto Frontier
One of the most critical aspects of this research is computational efficiency. Coreset selection is useless if the selection process takes longer than training on the full dataset.

Figure 7 plots the relative performance against the total wall-clock time (selection + training).
- The Goal: Top-left corner (High performance, low time).
- The Result: COINCIDE (the black line) dominates the Pareto frontier.
- Comparison: Methods like Self-Filter (blue line) require training a separate scoring network, pushing their time cost way to the right (70-80 hours). COINCIDE achieves ~99% performance in about 30 hours, whereas the full fine-tune takes 50 hours.
Why Does It Work? A Look at the Clusters
To verify that the clustering actually grouped semantic skills meaningfully, the authors visualized the clusters.

Figure 11 shows that the method successfully groups data by complex compositional skills:
- Top Group: “Store sign & OCR + Counting.”
- Second Group: “Waiting for public transportation.”
- Bottom Group: “Child with animals & Reasoning.”
Because COINCIDE samples from all these clusters (weighted by transferability), it ensures that no specific skill is left behind.
We can see this impact on diversity in Figure 13.

The x-axis represents different tasks.
- Biased Methods: Look at “D2-Pruning” or “Self-Filter.” They have massive spikes, over-sampling certain tasks while neglecting hundreds of others.
- COINCIDE (Bottom): It maintains a much more balanced distribution, similar to Random sampling but optimized for utility.
Conclusion
The COINCIDE paper provides a compelling solution to the “data hunger” of Large Vision-Language Models. By cleverly using a small, efficient model to map the “Concept-Skill” landscape of a dataset, we can identify which data points act as the best teachers.
Key Takeaways:
- Diversity Matters: Metrics that look at difficulty alone (like loss) often bias the data, hurting generalization.
- Transferability is Key: Data that sits in the “center” of the concept space helps the model learn adjacent tasks.
- Efficiency: We don’t need massive compute to select data. A small 2B model can effectively curate data for a 7B or 13B model.
- Less is More: On complex datasets like Vision-Flan, removing 83% of the data actually improved the model’s performance by reducing noise.
As models grow larger, techniques like COINCIDE will be essential for sustainable, efficient AI development, allowing researchers to train capable multimodal systems without needing industrial-scale compute clusters.
](https://deep-paper.org/en/paper/2406.10995/images/cover.png)