In the world of artificial intelligence, there’s a constant arms race. Tech giants are building ever-larger models with hundreds of billions—or even trillions—of parameters, pushing the boundaries of what’s possible. But this relentless pursuit of scale comes at a cost—literally. These colossal models require immense computational power, making them expensive to train and deploy, and often locking them away behind proprietary APIs.
This creates a fundamental tension: how can we achieve state-of-the-art AI reasoning without a state-of-the-art budget? Can a smaller, more accessible model compete with the giants?
A new research paper from SLAM Lab at ServiceNow, Apriel-1.5-15B-Thinker: Mid-training is all you need, offers a compelling answer. The researchers present a 15-billion-parameter multimodal model that punches far above its weight class, achieving performance comparable to much larger systems. Their secret isn’t just more data or more parameters, but a smarter, more deliberate training process. They argue that the mid-training phase—the crucial steps between initial pre-training and final fine-tuning—is the key to unlocking elite reasoning capabilities in a compact package.
In this article, we’ll break down their innovative three-stage methodology, explore the impressive results, and discuss why this work could be a game-changer for making top-tier AI accessible to more people.
The Challenge: Capability vs. Accessibility
Before diving into the solution, let’s set the stage. Most organizations face two major hurdles when trying to adopt cutting-edge AI:
- Infrastructure Constraints: Many need to run models on-premises or in air-gapped environments for privacy and security. This rules out relying on cloud APIs and requires models that can run efficiently on limited hardware—sometimes on a single GPU.
- Cost: The financial investment for training and running massive models is prohibitive for all but the largest companies.
This is where Apriel-1.5-15B-Thinker comes in. It’s an open-weights model designed to deliver frontier-level reasoning while being small enough for practical, single-GPU deployment. The core innovation is a progressive training pipeline proving that how you train can be more important than how big you train.
The Three-Stage Recipe for a Compact Genius
The journey to create Apriel-1.5-15B-Thinker starts with an existing open-source model, Pixtral-12B, and transforms it through a carefully orchestrated three-stage process.
Stage 1: Efficiently Scaling the Architecture
Instead of training a new 15B-parameter model from scratch (incredibly expensive), the researchers used depth upscaling:
- Start with a Foundation: They began with Pixtral-12B, which already combines vision and language capabilities using the popular LLaVA architecture—a vision encoder connected to a language decoder via a projection network.
- Add More Layers: To increase reasoning capacity, they expanded the decoder from 40 to 48 hidden layers, giving it deeper “thinking” without changing its fundamental structure. This upscaling used a massive corpus of high-quality text: web content, technical literature, math problem sets, and code.
- Realign the Modalities: After enlarging the language part, the projection network—bridging vision and language—was retrained. The vision encoder and decoder stayed frozen while this connector was trained on multimodal datasets such as image captioning and document comprehension.
This upscaling approach delivered a capable 15B base model at far lower computational cost than starting from zero.
Stage 2: The Heart of the Method — Staged Continual Pre-Training (CPT)
Here the “mid-training is all you need” philosophy truly shines. After upscaling, the model undergoes two CPT phases designed to systematically enhance reasoning.
CPT Stage 1: Building a Broad Foundation
This stage strengthens core competencies in both text and vision using a diverse training mix:
- 50% Text-only: Reasoning-heavy domains like math, science, and coding.
- 30% Multimodal: Tasks like document/chart understanding, OCR, long-form image descriptions, and visual math reasoning.
- 20% Replay Data: Samples from the upscaling stage to prevent knowledge forgetting.
During CPT Stage 1, the entire model (vision encoder, projection network, decoder) was trained for holistic multimodal understanding.
CPT Stage 2: Sharpening Visual Reasoning with Synthetic Data
The second CPT stage targets advanced visual reasoning. The researchers built a synthetic data generation pipeline from raw images to create task-specific training samples:
- Image Reconstruction: Learn scene priors and part–whole relationships by masking image regions.
- Visual Matching: Improve fine-grained detail recognition by matching image crops to candidates.
- Object Detection: Learn grounding/localization by identifying object presence and location.
- Counting: Count objects in total or by category.
Difficulty was modulated via augmentation strategies. In this stage, the vision encoder was frozen while the projection network and decoder were trained—efficiently honing visual interpretation.
Did it work? The team compared SFT performance on models trained up to Stage 1 vs. Stage 2.
CPT Stage 2 delivers consistent, significant gains—nearly +10 points on the vision-dominant MathVerse benchmark—demonstrating the synthetic data strategy’s effectiveness.
Stage 3: Supervised Fine-Tuning (SFT) for Polished Reasoning
With a powerful base from upscaling and CPT, the final step was Supervised Fine-Tuning (SFT), turning the model into a helpful assistant that follows instructions and shows its work.
The focus here: data quality.
- Efficient Annotation: Strong open-source models acted as “annotators.” An ablation showed minimal performance differences between candidates, so they chose the more compute-efficient one.
- Rigorous Filtering: The SFT dataset—millions of instruction–response pairs—underwent strict cleaning: de-duplication, content filtering, LLM-as-Judge verification, rejection sampling, and decontamination from benchmark overlaps.
- Explicit Reasoning: Every response included reasoning steps (“chain of thought”) before the final answer, teaching the model both what the answer is and how to get there.
Training Strategy: They first ran a large SFT for four epochs. Then they ran two smaller, focused SFTs—one on a high-quality subset, the other on longer sequences—and averaged the weights, boosting overall and long-context performance without an expensive full retraining.
Putting Apriel-1.5-15B-Thinker to the Test
After this meticulous training process, results confirm: a 15B model can stand shoulder-to-shoulder with much larger competitors.
Text-Based Reasoning: Topping the Charts
To measure general intelligence, the team used the Artificial Analysis Intelligence Index, a respected third-party metric aggregating ten benchmarks from competitive math (AIME 2025) to coding (LiveCodeBench) and graduate-level STEM problems (GPQA Diamond).
Apriel scores 52, tying DeepSeek-R1-0528 and outpacing many open-weight systems. Against all models, including proprietary giants, it remains highly competitive.
Perhaps the most telling visualization: model intelligence vs. size.
Apriel sits in the “most attractive quadrant”: exceptional performance relative to its size—ideal for practical deployment.
Detailed benchmarks highlight strengths in math reasoning (AIME2025: 87%), instruction following (IF-Bench), and specialized domains (τ²-Bench Telecom).
Vision & Multimodal Reasoning: A Solid Contender
Apriel was evaluated on a wide range of multimodal image benchmarks.
It holds its own against giants like Llama 4 Maverick (400B) and outperforms several larger proprietary models on overall average score.
Apriel excels at visual–text reasoning tasks—document/chart understanding (CharXiv descriptive: 88.2%) and math problems with strong text components (MathVerse Text-Dominant: 76.4%). On pure visual logic tasks (LogicVista) and heavily vision-dominant challenges (MMMU-PRO Vision), performance is solid but leaves room for growth. This pattern aligns with many models: stronger on descriptive/structural tasks than on deeply abstract visual reasoning.
Conclusion: Smarter Training Trumps Sheer Scale
The story of Apriel-1.5-15B-Thinker demonstrates that frontier-level AI reasoning is not reserved for trillion-parameter behemoths. By executing a deliberate, data-centric mid-training pipeline—staged CPT followed by high-quality SFT without RLHF—the team built a model that is both powerful and practical.
Key Takeaways:
- Strategic training matters: A staged CPT plus high-signal SFT can close capability gaps between small and large models.
- Efficiency is achievable: Techniques like depth upscaling and synthetic curriculum design create powerful models without vast compute.
- Accessibility is crucial: By delivering frontier-level performance in a single-GPU, open-source package, this work democratizes AI research and deployment.
While text reasoning is currently Apriel’s strongest suit, its solid multimodal foundation sets the stage for further improvements. This work challenges the “bigger is better” narrative and points toward a future where efficient, open AI models are the norm, not the exception.