Slash Your AI Training Costs: A New Paradigm for Visual Instruction Tuning

If you have been following the explosion of Large Vision-Language Models (LVLMs) like LLaVA, GPT-4V, or Gemini, you know that their ability to understand and reason about images is nothing short of impressive. However, behind every capable model lies a massive, expensive bottleneck: Visual Instruction Tuning (VIT).

To train these models, researchers compile massive datasets of images paired with complex textual instructions (Question-Answer pairs). Creating these datasets usually involves feeding thousands of images into expensive proprietary models like GPT-4 to generate descriptions and QA pairs. This creates a dilemma for students and researchers with limited budgets: to build a high-quality dataset, you need money. To save money, you often have to settle for lower-quality data.

But what if you didn’t need to generate instructions for every image? What if you could look at a pile of unlabeled images, pick only the most useful ones, and then spend your budget labeling only those?

In this post, we are diving deep into a paper titled “Filter Images First, Generate Instructions Later,” which introduces PreSel. This new method flips the data selection workflow on its head, achieving state-of-the-art performance while slashing instruction generation costs by 85%.

The Problem: The High Cost of Redundancy

Visual Instruction Tuning (VIT) is the process that teaches a model to follow human instructions based on visual inputs. A typical VIT dataset might contain hundreds of thousands of image-instruction pairs.

The standard industry practice for optimizing this process involves Data Selection. The logic is sound: not all data is equal. Some images are redundant; others are too simple to teach the model anything new. By selecting a high-quality subset, you can speed up training.

However, there is a catch. Most existing data selection methods work like the “Top Method” in the image below:

Comparison of traditional VIT data selection vs. PreSel.

As shown in Figure 1 (Top), traditional methods require you to start with 100% Unlabeled Images and then generate 100% Instructions (using costly APIs or humans) before you can select the best subset. You are essentially paying for expensive labels for thousands of images that you are just going to throw away.

The authors of this paper propose the Bottom Method (PreSel). Their approach performs the selection directly on the unlabeled images first. By identifying the most promising images based on visual features and task importance, they only need to generate instructions for the selected subset (e.g., 15%).

The Solution: Pre-Instruction Data Selection (PreSel)

The core challenge PreSel addresses is: How do we know an image is important if we don’t know what the instruction will be?

The researchers solved this by breaking the problem into two stages:

Task-Importance Estimation: Determining which types of vision tasks (e.g., OCR, Visual Question Answering, Captioning) are most valuable to the model.
Task-Wise Cluster-Based Selection: Selecting the most representative visual examples within those tasks.

Let’s look at the overall architecture of PreSel:

Overview of the PreSel architecture and workflow.

As Figure 3 illustrates, the process starts with a large pool of unlabeled images divided into tasks (\(T_1... T_M\)). The system takes a tiny, random Reference Set (about 5% of the data) and generates instructions for just that small slice. It uses this reference set to learn which tasks are important. Then, it uses a visual encoder (DINOv2) to cluster the remaining unlabeled images and select the best ones to fill the budget.

Let’s break down the mathematics and logic behind these two stages.

Stage 1: Task-Importance Estimation

In a massive dataset, you might have 50,000 images for “general conversation” and only 5,000 for “complex reasoning.” Simply selecting images proportionally based on size is a bad strategy; the conversation data is likely redundant, while the reasoning data is dense with information.

To solve this, PreSel introduces the Instruction Relevance Score (IRS).

First, let’s define what a data sample looks like.

Illustration of Image (I), Question (Q), and Response (R).

As seen in Figure 2, a sample consists of an Image (\(I\)), a Question (\(Q\)), and a Response (\(R\)).

The goal of IRS is to measure how much the textual question actually helps the model generate the response, compared to just seeing the image.

If the model can guess the response (\(R\)) just by looking at the image (\(I\)), the question (\(Q\)) wasn’t very necessary.
If the model struggles to guess (\(R\)) with the image alone, but performs perfectly when given the question (\(Q\)), then that instruction is highly relevant and important.

The Math Behind IRS

To calculate this, the authors first train a temporary “Reference Model” on that tiny 5% reference set. They then calculate two specific losses.

1. Loss with Context (Q + I): This equation measures the model’s error when predicting the response (\(R\)) given both the Image (\(I\)) and the Question (\(Q\)).

Equation for Loss given Question and Image.

2. Loss without Context (I only): This equation measures the error when predicting the response (\(R\)) given only the Image (\(I\)), effectively forcing the model to guess the context.

Equation for Loss given Image only.

3. The IRS Ratio: The Instruction Relevance Score is simply the ratio of these two losses.

Equation for Instruction Relevance Score (IRS).

Interpretation:

Low IRS: The numerator (loss with Q) is much lower than the denominator (loss without Q). This means the Question provided crucial information. These tasks are important.
High IRS: Adding Q didn’t change the loss much. The instruction is likely redundant or the task is too simple.

Finally, the authors calculate the average importance for each task (\(T_i\)) and normalize it to create a sampling weight (\(w\)).

Equation for average task IRS.

Equation for task weight normalization.

This weight \(w(T_i)\) tells the system exactly what percentage of the final dataset should come from each task.

Stage 2: Task-Wise Cluster-Based Selection

Now that PreSel knows how many images to take from each task, it needs to decide which specific unlabeled images to pick.

Since there are no text labels for the remaining 95% of the data, the method relies on visual features. The authors use DINOv2, a lightweight vision encoder, to extract feature vectors from the images.

Within each task, PreSel performs K-Means clustering. The idea is to group visually similar images together. If a cluster contains 1,000 images of cats sitting on sofas, you probably only need a few of them to teach the concept.

The number of images to select from each cluster (\(n_c\)) is determined by the task weight we calculated earlier:

Equation for cluster sampling budget.

Neighbor Centrality (NC)

Once inside a cluster, which image is the best? PreSel avoids outliers (weird, non-representative images) and instead picks images that are “central” to their neighbors. They use the Neighbor Centrality (NC) score:

Equation for Neighbor Centrality Score.

This formula calculates the average cosine similarity between an image (\(I\)) and its \(k\) nearest neighbors. An image with a high NC score is highly representative of that cluster’s visual concept.

Experimental Results

The theory sounds solid, but does it work in practice? The researchers tested PreSel on major datasets like LLaVA-1.5 and Vision-Flan, comparing it against random selection and other state-of-the-art selection methods (like TypiClust, IFD, and COINCIDE).

Performance on LLaVA-1.5

The table below shows the performance on the LLaVA-1.5 dataset. The goal is to maximize the “Relative Performance” (Rel %) compared to a model trained on 100% of the data.

Table of results on LLaVA-1.5 dataset.

Key Takeaways from Table 1:

PreSel (Bottom Row) achieves 97.9% of the full model’s performance while using only 15% of the data.
Look at the “Req. Inst.” (Required Instructions) column. Almost every other high-performing method requires 100% of the instructions to be generated before selection can happen. PreSel only required 15%.
PreSel outperforms Random selection by over 2%, which is a significant margin in this field.

Consistency Across Sampling Ratios

One might wonder if PreSel only works at specifically 15%. The graph below tracks performance as the sampling ratio increases from 10% to 50%.

Graph of relative performance across sampling ratios.

As shown in Figure 4, PreSel (the solid black line) consistently sits at the top of the pack. Even at very low budgets (10-15%), it maintains high performance, whereas random selection (red circles) actually degrades in relative effectiveness. Interestingly, at a 50% sampling ratio, PreSel even slightly outperforms the full fine-tuning baseline (crossing the 100% mark), suggesting that filtering out “bad” data can actually produce a better model than using “all” data.

Why Does It Work? The Power of Balancing

A major reason for PreSel’s success is how it re-balances the dataset. It doesn’t just trust the original dataset’s distribution.

Task proportion distribution comparison.

Figure 5 compares “Size-Balanced” sampling (gray) vs. “PreSel” sampling (blue).

Notice the first three bars (LLaVA-Conv, Detail, Reason). The Size-Balanced approach allocates huge chunks of the budget here because these tasks have many images.
PreSel drastically reduces the allocation for these tasks.
Instead, PreSel re-allocates that budget to harder, more information-dense tasks like A-OKVQA and GQA (towards the right side of the chart).

The IRS metric correctly identified that the conversational tasks were redundant and that the model needed more practice on complex visual question answering.

Efficiency and Robustness

The authors also validated the method on the Vision-Flan dataset, which contains a much larger diversity of tasks (191 tasks).

Table of results on Vision-Flan dataset.

In Table 2, PreSel achieves 100.1% relative performance with only 15% of the data. It actually beat the model trained on the full dataset! This confirms that vast amounts of the Vision-Flan dataset are redundant, and PreSel effectively isolates the signal from the noise.

Furthermore, the selected data transfers well to different model architectures.

Table showing results on different model backbones (Vicuna, Llama).

Table 7 shows that data selected using a small Vicuna-7B model improves performance when training larger models (Vicuna-13B) or different architectures (Llama-3-8B). This means you can perform the cheap selection process on a small model and apply the savings to your massive model training run.

Conclusion and Implications

The “Pre-Instruction Data Selection” paradigm presented in this paper is a significant step toward democratizing LVLM research.

By shifting the selection process to happen before the expensive labeling stage, PreSel addresses the elephant in the room: cost.

For Researchers: It enables the creation of custom, high-performance datasets on a fraction of the budget.
For the Industry: It reduces the computational overhead of training and the financial overhead of using APIs like GPT-4 for data generation.

The method demonstrates that we don’t need more data; we need better data. And crucially, we can identify that “better” data by looking at the images and a small sample of instructions, rather than paying to label the entire haystack just to find the needle.

If you are working on Visual Instruction Tuning, implementing a “Filter First, Generate Later” pipeline might be the highest ROI decision you can make for your next project.

The Problem: The High Cost of Redundancy#

The Solution: Pre-Instruction Data Selection (PreSel)#

Stage 1: Task-Importance Estimation#

The Math Behind IRS#

Stage 2: Task-Wise Cluster-Based Selection#

Neighbor Centrality (NC)#

Experimental Results#

Performance on LLaVA-1.5#

Consistency Across Sampling Ratios#

Why Does It Work? The Power of Balancing#

Efficiency and Robustness#

Conclusion and Implications#