The Tortoise and the Hare of AI: How Gradual Learning Makes Visual AI Faster

Multi-modal Large Language Models (MLLMs) are reshaping how we interact with AI. Models like LLaVA can look at an image and hold a conversation about it—combining the seeing ability of computer vision with the reasoning power of large language models (LLMs). They’re like high-performance sports cars: incredible on the track, but they burn through fuel—in this case, computational resources—at a staggering rate.

The main fuel drain? The sheer number of visual tokens. While a text prompt might be dozens of tokens, a single image is often broken into hundreds of them, and high-resolution images or multi-frame videos can explode this count further. This data flood creates a computational bottleneck—slowing inference speed and hogging memory.

A natural fix is to use fewer visual tokens, a process called token compression. Some methods drop or merge tokens without retraining—fast to deploy, but performance often nosedives. More advanced approaches retrain to adapt to fewer tokens, sometimes by adding new modules or altering the model architecture.

But here’s the hidden challenge: compressing tokens aggressively during training is like asking a student to learn calculus without first mastering algebra. It’s too big of a leap, and the student—like the model—gets lost. When an MLLM trained on full tokens is suddenly forced to work with a tiny fraction, its internal representation is thrown off balance. The training process stumbles, often landing in suboptimal solutions.

A new paper, “Efficient Multi-modal Large Language Models via Progressive Consistency Distillation”, proposes a clever fix to this learning difficulty. Instead of a giant leap, it applies a gradual, step-by-step training strategy—proving that in the race to efficiency, the slow-and-steady tortoise can indeed beat the hare.

The Challenge: Navigating a Shifting Landscape

Visualize training as walking across a loss landscape—a hilly terrain where altitude represents error (loss). The goal is to find the lowest valley: the optimum.

When we train without compression, the landscape has a specific shape with its own optimum. Compressing tokens reshapes this terrain—moving the location of the optimum. The more compression, the greater the shift.

Loss landscapes at different token compression levels: 0%, 20%, 40%, 60%. Direct jumps to high compression are difficult, while progressive steps ease adaptation.

Figure 1: Loss landscapes under different compression ratios. Direct jumps from 0% to high compression easily trap the optimizer in poor local minima, but progressive adaptation follows smoother, reachable paths.

Jumping directly from the 0% optimum to the 60% optimum is treacherous. The model can get stuck in a local minimum—a small dip that is not the true valley. Direct training with heavy compression often leads exactly here.

EPIC’s core idea: avoid the giant leap. First adapt to light compression (e.g., 20%), then 40%, and so on. Each step is manageable, guiding the model to high-compression optima without losing its way.

EPIC: A Progressive Training Curriculum

EPIC implements Progressive Consistency Distillation—a training framework for existing MLLMs like LLaVA, with no architectural changes. It combines a unique shared-weight teacher–student setup with two progressive learning strategies.

The Teacher in the Machine

In EPIC, one MLLM—same weights—acts as both teacher and student:

Student: processes the image with a higher (more aggressive) compression ratio—the harder task.
Teacher: gets the same image with a slightly lower (easier) compression ratio.

The student learns from two signals:

Supervised fine-tuning (SFT) loss — match ground truth outputs.
Distillation loss — align its output probability distribution to the teacher’s, which is more stable.

Overview of Progressive Consistency Distillation with Token Consistency Distillation (TCD) and Layer Consistency Distillation (LCD).

Figure 3: EPIC’s shared-weight teacher–student setup for consistency distillation, with progressive token-wise and layer-wise strategies.

This “consistency distillation” gives the student a reliable intermediate target—like learning from your future self who’s had an easier practice round.

1. Token Consistency Distillation (TCD)

TCD applies progression in a token-wise manner, building an easy-to-hard curriculum over training:

Compression starts mild (5–10%), increasing steadily to high (e.g., 90%).
Teacher–student gap also increases: initially small for close guidance, later larger to challenge the student.

Formally:

\[ r_t^{\mathsf{stu}} \sim \mathcal{U}\left(R_{\min,t}^{\mathsf{stu}}, \; R_{\max,t}^{\mathsf{stu}}\right) \]

\[ r_t^{\mathsf{tea}} = \max\left(0, \; r_t^{\mathsf{stu}} - \Delta_t\right) \]

The total loss combines SFT and the KL-divergence distillation loss:

\[ \mathcal{L}_{\text{total}}(\theta) = (1 - \lambda) \cdot \mathcal{L}_{\text{SFT}}(\theta) + \lambda \cdot \mathcal{L}_{\text{TCD}}(\theta) \]

2. Layer Consistency Distillation (LCD)

LCD applies progression layer-wise, leveraging that deep layers care less about visual tokens:

Start deep: compress tokens only in final layers—minimal disruption.
Shift shallow: progressively move compression to earlier layers—harder for the model, but by then it’s prepared.

Layer selection:

\[ \ell_t = \operatorname{Round}\left(L - \beta_t\left(L - \ell_{\min}\right)\right) \quad \text{with} \quad \beta_t = t/T \]

Teacher still uses slightly less compression at the same layer, creating a stable path from easy (deep) to hard (shallow) compression.

Putting EPIC to the Test

The authors trained LLaVA-v1.5-7B with EPIC—no architecture changes—and ran extensive evaluations.

Performance & Efficiency

Performance benchmark comparison for EPIC-trained models versus baselines.

Table 1: EPIC performance versus baselines across 10 visual understanding benchmarks. EPIC matches or surpasses full-token LLaVA using far fewer tokens.

With 128 tokens (78% fewer), EPIC matches the original LLaVA’s performance. At 192 tokens, it even outperforms LLaVA on average—showing significant redundancy in visual tokens.

MMBench accuracy versus number of visual tokens. EPIC achieves top accuracy with fewer tokens, FLOPs, and KV cache.

Figure 2: EPIC-trained models achieve high accuracy at low token count, with massive FLOP and KV cache savings.

At 64 tokens, EPIC cuts KV cache 88.9% and FLOPs 83.9%, delivering up to 1.6× inference speedup:

Inference efficiency gains with 64 visual tokens compared to baseline.

Table 2: Major reductions in memory use, compute, and latency with EPIC-trained models.

Ablations: Why Teacher Guidance & Progression Matter

Ablation study on Token Consistency Distillation. Removing teacher guidance or progression degrades performance significantly.

Table 3: Token Consistency Distillation ablation—loss of teacher or schedule reduces performance.

Ablation study on Layer Consistency Distillation.

Table 4: Layer Consistency Distillation ablation—same trend as TCD; teacher and progression are both vital.

Findings:

Removing teacher guidance (w/o Distillation Loss) hurts stability and output quality.
Fixing compression from the start (w/o Progressive Compression) makes training jump into the hardest regime immediately—performance drops even more.

Generalization Across Compression Strategies

Does EPIC overfit to one token-drop pattern? To test, they trained with DART and evaluated with Random and FastV at inference.

Cross-strategy generalization: EPIC improves performance even with unseen compression methods at test time.

Figure 4: EPIC-trained models generalize: training with one compression method yields gains with others.

Result: improvements hold across strategies. EPIC equips models with a principle—robust reasoning under missing visual information—rather than memorizing a pattern.

Is Extreme Compression Necessarily Better?

Many methods push visual tokens down to 1–2. EPIC’s analysis warns: beyond a point, less isn’t more.

Trade-off analysis showing High ROI vs Low ROI areas in token compression.

Figure 5: Reducing tokens from full count to ~64 yields high ROI. Further drops offer little speed gain but large accuracy loss.

Two Zones:

High ROI: 576 → 64 tokens — big speed/memory wins, performance stays high.
Low ROI: Below 64 tokens — GPU compute underutilized, latency dominated by memory access, performance crashes.

Optimal compression balances efficiency and information retention—extreme cuts risk starving the model.

Key Takeaways

EPIC’s impact comes not from altering architectures, but from how we train:

Training matters: Compression alone isn’t enough—progressive adaptation enables models to handle perturbations gracefully.
Slow and steady wins: Easy-to-hard curricula avoid poor local minima.
Self-teaching helps: Shared-weight teacher–student setup stabilizes learning.
Find the balance: Aim for the High ROI zone; avoid unnecessary extreme compression.

The EPIC framework is versatile and plug-and-play, making powerful MLLMs more practical on limited hardware—all through smart, steady learning. In efficiency races, it reminds us: the best route isn’t always a leap—it’s a series of deliberate, well-paced steps.

The Challenge: Navigating a Shifting Landscape#

EPIC: A Progressive Training Curriculum#

The Teacher in the Machine#

1. Token Consistency Distillation (TCD)#

2. Layer Consistency Distillation (LCD)#

Putting EPIC to the Test#

Performance & Efficiency#

Ablations: Why Teacher Guidance & Progression Matter#

Generalization Across Compression Strategies#

Is Extreme Compression Necessarily Better?#

Two Zones:#

Key Takeaways#