Figure: The official K2-THINK logo from the Institute of Foundation Models at MBZUAI.
In the world of artificial intelligence, there’s a common belief: bigger is better. Large Language Models (LLMs) have ballooned to hundreds of billions, or even trillions, of parameters. These colossal systems have achieved astounding feats—but they come with trade-offs: they are expensive to train, hard to deploy, and often inaccessible for most researchers.
But what if a smaller, more agile model could challenge these giants? What if clever engineering could matter more than brute-force scale?
That is the central premise behind a new system from the Institute of Foundation Models at MBZUAI: K2-THINK — a 32-billion parameter reasoning model that achieves performance competitive with, or even surpassing, far larger AI systems. In complex mathematical reasoning, K2-THINK stands out as the leading open-source model.
How did the team pull this off? By crafting a six-pillar recipe that combines advanced post-training methods with strategic computation at inference time. This article breaks down that recipe step-by-step — showing how K2-THINK was built, why it is so effective, and what it means for the future of accessible, high-performance AI.
Figure 1: K2-THINK’s high Math Composite scores despite a relatively small parameter count — challenging the “bigger is better” narrative.
Background: The Building Blocks of a Reasoning System
Before diving into K2-THINK’s core innovations, let’s review some foundational ideas that are key to understanding its design.
Chain-of-Thought (CoT) Reasoning: Much like how humans solve problems step-by-step, CoT prompting trains models to “think out loud,” generating intermediate reasoning before giving the answer. This boosts performance in math, logic, and code.
Supervised Fine-Tuning (SFT): Starting from a pretrained “base model,” SFT adapts it using a curated dataset of prompts and high-quality answers (often with explicit chains of thought). The goal here is specialization.
Reinforcement Learning (RL): Post-SFT, models can be further refined using feedback signals. K2-THINK uses a variant called Reinforcement Learning with Verifiable Rewards (RLVR) — perfect for domains with objectively checkable answers like math or code.
Test-Time Computation: You can make a model smarter during inference by giving it more “thinking time” — generating multiple answers, planning solutions, or verifying outputs to improve final accuracy.
The Core Method: K2-THINK’s Six Pillars of Power
K2-THINK transforms the open-weight Qwen2.5-32B base model into a specialized reasoning expert via six integrated pillars — split between post-training and test-time computation.
Pillar 1 — Supervised Fine-Tuning with Long Chains of Thought
The team began by extending the model’s reasoning capabilities using the AM-Thinking-v1-Distilled
dataset — rich with multi-domain, long-form chain-of-thought examples.
Goals of this phase:
- Teach structured, step-by-step solution writing.
- Instill a clear, standardized output format for reasoning traces.
The resulting checkpoint — K2-THINK-SFT — showed significant gains.
Figure 2: Left — Pass@1 scores across multiple benchmarks during SFT training. Gains plateau after ~0.5 epoch, showing rapid acquisition of reasoning abilities.
Figure 3: Pass@k on AIME2024. K2-THINK-SFT (blue) saturates at ~95% by k=16; base model (red) struggles even with k=128.
Pillar 2 — Reinforcement Learning with Verifiable Rewards (RLVR)
With reasoning skills in place, RLVR further refines correctness. Using the Guru dataset (~92,000 prompts across Math, Code, Science, Logic, Simulation, and Tabular domains), K2-THINK learns to optimize accuracy directly.
Key lessons learned:
- Strong SFT limits RL gains: Starting from a powerful SFT model left less room for RL improvement — ~5% gain vs. 40% when starting RL directly from the base model.
- Context length matters: Multi-stage RL with reduced initial length (16k tokens) before expanding to 32k hurt performance dramatically.
Figure 4: Top — RL from base achieves faster gain than RL from SFT. Bottom — restricting initial context length causes lasting performance drop.
Pillars 3 & 4 — “Plan-Before-You-Think” and Best-of-N Sampling
Reasoning quality isn’t just about training — it’s also about how the model is used at test time.
Figure 5: K2-THINK’s inference pipeline: High-level planning precedes reasoning; multiple solutions are sampled before picking the best.
Pillar 3 — “Plan-Before-You-Think”:
A lightweight planning agent outlines key concepts and steps before the query reaches the reasoning model. This “meta-thinking” guides the reasoning process — like a human making an outline before writing.
Pillar 4 — Best-of-N (BoN) Sampling:
Instead of one answer, K2-THINK generates N=3 candidates and a separate LLM judges them via pairwise comparison, returning the strongest. This small compute cost yields substantial accuracy gains.
Pillars 5 & 6 — Blazing-Fast Deployment
K2-THINK’s reasoning chains and BoN sampling demand speed. The authors tackled this in two ways:
Pillar 5 — Speculative Decoding:
A smaller “draft” model generates tokens in batches; the main model verifies them, avoiding slow token-by-token generation.
Pillar 6 — Inference-Optimized Hardware:
Deployment on the Cerebras Wafer-Scale Engine (WSE) keeps all model weights in massive on-chip memory, removing bandwidth bottlenecks. Result: ~2,000 tokens/sec.
Example: a 32,000-token proof — typical for math or code reasoning — completes in just 16 seconds.
Experiments and Results: Small Model, Big Impact
K2-THINK was benchmarked against leading proprietary and open-source models.
Benchmarks → | Math | Code | Science | ||||||
---|---|---|---|---|---|---|---|---|---|
Models ↓ | AIME 2024 | AIME 2025 | HMMT25 | Omni-HARD | Micro-Avg. | LCBv5 | SciCode (sub/main) | GPQA-D | HLE |
K2-Think | 90.83 | 81.24 | 73.75 | 60.73 | 67.99 | 63.97 | 39.2 / 12.0 | 71.08 | 9.95 |
GPT-OSS 120B | 89.58 | 84.59 | 81.88 | 57.76 | 67.20 | 74.53 | 38.8 / 11.0 | 77.04 | 18.58 |
DeepSeek V3.1† | 91.87 | 82.49 | 83.54 | 53.22 | 64.43 | 66.59 | 38.2 / 11.7 | 79.46 | 8.40 |
Table 1 excerpt: K2-THINK leads all open-source models in math micro-average, even surpassing some much larger proprietary systems.
Component Analysis
AIME 2024 | AIME 2025 | HMMT25 | Omni-HARD | |
---|---|---|---|---|
SFT+RL Checkpoint | 86.26 | 77.72 | 66.46 | 56.74 |
+ Plan only | 85.21 | 81.04 | 71.87 | 58.97 |
+ Bo3 only | 90.77 | 81.22 | 71.16 | 59.47 |
+ Plan + Bo3 (K2-Think) | 90.83 | 81.24 | 73.75 | 60.73 |
Table 2: Bo3 delivers the largest individual gain; combining with planning yields the strongest overall performance.
Planning Reduced Verbosity — Unexpectedly
Model | AIME 2024 | AIME 2025 | HMMT25 | Omni-HARD | LCBv5 | GPQA-D |
---|---|---|---|---|---|---|
SFT+RL Checkpoint | 21,482 | 25,262 | 29,136 | 34,042 | 13,589 | 14,998 |
K2-Think | 20,040 | 24,266 | 27,030 | 30,050 | 12,166 | 14,680 |
Table 3 excerpt: Planning before reasoning shortened responses by up to ~12%, yielding more concise outputs.
Conclusion: Small Models, Big Ideas
K2-THINK is a blueprint for delivering frontier AI performance without frontier AI scale.
Key takeaways:
- Smart engineering beats brute scale: A carefully enhanced 32B model can rival systems 10× bigger.
- Test-time compute is a huge lever: Techniques like planning and BoN can drive major gains without retraining.
- Efficiency boosts experience: Shorter, clearer answers improve usability and save compute.
Crucially, the team has gone beyond releasing weights and code — deploying K2-THINK as a public API at k2think.ai, inviting the community to interact with a live frontier reasoning system.
K2-THINK shows that the future of AI reasoning may hinge not on endlessly bigger models, but on the synergistic combination of better data, smarter post-training, and clever inference-time strategies — making cutting-edge AI more open, affordable, and accessible to all.