The K2-THINK logo.

Figure: The official K2-THINK logo from the Institute of Foundation Models at MBZUAI.

In the world of artificial intelligence, there’s a common belief: bigger is better. Large Language Models (LLMs) have ballooned to hundreds of billions, or even trillions, of parameters. These colossal systems have achieved astounding feats—but they come with trade-offs: they are expensive to train, hard to deploy, and often inaccessible for most researchers.

But what if a smaller, more agile model could challenge these giants? What if clever engineering could matter more than brute-force scale?

That is the central premise behind a new system from the Institute of Foundation Models at MBZUAI: K2-THINK — a 32-billion parameter reasoning model that achieves performance competitive with, or even surpassing, far larger AI systems. In complex mathematical reasoning, K2-THINK stands out as the leading open-source model.

How did the team pull this off? By crafting a six-pillar recipe that combines advanced post-training methods with strategic computation at inference time. This article breaks down that recipe step-by-step — showing how K2-THINK was built, why it is so effective, and what it means for the future of accessible, high-performance AI.

A scatter plot showing K2-THINK’s remarkable parameter efficiency compared to other open-source and proprietary models on math benchmarks.

Figure 1: K2-THINK’s high Math Composite scores despite a relatively small parameter count — challenging the “bigger is better” narrative.


Background: The Building Blocks of a Reasoning System

Before diving into K2-THINK’s core innovations, let’s review some foundational ideas that are key to understanding its design.

  1. Chain-of-Thought (CoT) Reasoning: Much like how humans solve problems step-by-step, CoT prompting trains models to “think out loud,” generating intermediate reasoning before giving the answer. This boosts performance in math, logic, and code.

  2. Supervised Fine-Tuning (SFT): Starting from a pretrained “base model,” SFT adapts it using a curated dataset of prompts and high-quality answers (often with explicit chains of thought). The goal here is specialization.

  3. Reinforcement Learning (RL): Post-SFT, models can be further refined using feedback signals. K2-THINK uses a variant called Reinforcement Learning with Verifiable Rewards (RLVR) — perfect for domains with objectively checkable answers like math or code.

  4. Test-Time Computation: You can make a model smarter during inference by giving it more “thinking time” — generating multiple answers, planning solutions, or verifying outputs to improve final accuracy.


The Core Method: K2-THINK’s Six Pillars of Power

K2-THINK transforms the open-weight Qwen2.5-32B base model into a specialized reasoning expert via six integrated pillars — split between post-training and test-time computation.


Pillar 1 — Supervised Fine-Tuning with Long Chains of Thought

The team began by extending the model’s reasoning capabilities using the AM-Thinking-v1-Distilled dataset — rich with multi-domain, long-form chain-of-thought examples.

Goals of this phase:

  1. Teach structured, step-by-step solution writing.
  2. Instill a clear, standardized output format for reasoning traces.

The resulting checkpoint — K2-THINK-SFT — showed significant gains.

Two line charts showing Pass@1 performance over SFT training epochs and Pass@k performance on the AIME2024 benchmark.

Figure 2: Left — Pass@1 scores across multiple benchmarks during SFT training. Gains plateau after ~0.5 epoch, showing rapid acquisition of reasoning abilities.

A line chart comparing the pass@k performance of K2-THINK-SFT against the original Qwen2.5-32B base model on the AIME2024 benchmark.

Figure 3: Pass@k on AIME2024. K2-THINK-SFT (blue) saturates at ~95% by k=16; base model (red) struggles even with k=128.


Pillar 2 — Reinforcement Learning with Verifiable Rewards (RLVR)

With reasoning skills in place, RLVR further refines correctness. Using the Guru dataset (~92,000 prompts across Math, Code, Science, Logic, Simulation, and Tabular domains), K2-THINK learns to optimize accuracy directly.

Key lessons learned:

  • Strong SFT limits RL gains: Starting from a powerful SFT model left less room for RL improvement — ~5% gain vs. 40% when starting RL directly from the base model.
  • Context length matters: Multi-stage RL with reduced initial length (16k tokens) before expanding to 32k hurt performance dramatically.

A 2x2 grid of line plots showing ablation studies on RL training, comparing RL from a base vs. SFT model and single-stage vs. multi-stage context length training.

Figure 4: Top — RL from base achieves faster gain than RL from SFT. Bottom — restricting initial context length causes lasting performance drop.


Pillars 3 & 4 — “Plan-Before-You-Think” and Best-of-N Sampling

Reasoning quality isn’t just about training — it’s also about how the model is used at test time.

A flow diagram illustrating the K2-THINK inference process: User Query -> Plan-Before-You-Think -> K2-Think Model -> Best-of-N -> Final Response.

Figure 5: K2-THINK’s inference pipeline: High-level planning precedes reasoning; multiple solutions are sampled before picking the best.

Pillar 3 — “Plan-Before-You-Think”:
A lightweight planning agent outlines key concepts and steps before the query reaches the reasoning model. This “meta-thinking” guides the reasoning process — like a human making an outline before writing.

Pillar 4 — Best-of-N (BoN) Sampling:
Instead of one answer, K2-THINK generates N=3 candidates and a separate LLM judges them via pairwise comparison, returning the strongest. This small compute cost yields substantial accuracy gains.


Pillars 5 & 6 — Blazing-Fast Deployment

K2-THINK’s reasoning chains and BoN sampling demand speed. The authors tackled this in two ways:

Pillar 5 — Speculative Decoding:
A smaller “draft” model generates tokens in batches; the main model verifies them, avoiding slow token-by-token generation.

Pillar 6 — Inference-Optimized Hardware:
Deployment on the Cerebras Wafer-Scale Engine (WSE) keeps all model weights in massive on-chip memory, removing bandwidth bottlenecks. Result: ~2,000 tokens/sec.

Example: a 32,000-token proof — typical for math or code reasoning — completes in just 16 seconds.


Experiments and Results: Small Model, Big Impact

K2-THINK was benchmarked against leading proprietary and open-source models.

Benchmarks →MathCodeScience
Models ↓AIME 2024AIME 2025HMMT25Omni-HARDMicro-Avg.LCBv5SciCode (sub/main)GPQA-DHLE
K2-Think90.8381.2473.7560.7367.9963.9739.2 / 12.071.089.95
GPT-OSS 120B89.5884.5981.8857.7667.2074.5338.8 / 11.077.0418.58
DeepSeek V3.1†91.8782.4983.5453.2264.4366.5938.2 / 11.779.468.40

Table 1 excerpt: K2-THINK leads all open-source models in math micro-average, even surpassing some much larger proprietary systems.

Component Analysis

AIME 2024AIME 2025HMMT25Omni-HARD
SFT+RL Checkpoint86.2677.7266.4656.74
+ Plan only85.2181.0471.8758.97
+ Bo3 only90.7781.2271.1659.47
+ Plan + Bo3 (K2-Think)90.8381.2473.7560.73

Table 2: Bo3 delivers the largest individual gain; combining with planning yields the strongest overall performance.


Planning Reduced Verbosity — Unexpectedly

ModelAIME 2024AIME 2025HMMT25Omni-HARDLCBv5GPQA-D
SFT+RL Checkpoint21,48225,26229,13634,04213,58914,998
K2-Think20,04024,26627,03030,05012,16614,680

Table 3 excerpt: Planning before reasoning shortened responses by up to ~12%, yielding more concise outputs.


Conclusion: Small Models, Big Ideas

K2-THINK is a blueprint for delivering frontier AI performance without frontier AI scale.

Key takeaways:

  1. Smart engineering beats brute scale: A carefully enhanced 32B model can rival systems 10× bigger.
  2. Test-time compute is a huge lever: Techniques like planning and BoN can drive major gains without retraining.
  3. Efficiency boosts experience: Shorter, clearer answers improve usability and save compute.

Crucially, the team has gone beyond releasing weights and code — deploying K2-THINK as a public API at k2think.ai, inviting the community to interact with a live frontier reasoning system.

K2-THINK shows that the future of AI reasoning may hinge not on endlessly bigger models, but on the synergistic combination of better data, smarter post-training, and clever inference-time strategies — making cutting-edge AI more open, affordable, and accessible to all.