K2-THINK: How a 32B Model Punches Above Its Weight to Rival AI Giants

The K2-THINK logo.

Figure: The official K2-THINK logo from the Institute of Foundation Models at MBZUAI.

In the world of artificial intelligence, there’s a common belief: bigger is better. Large Language Models (LLMs) have ballooned to hundreds of billions, or even trillions, of parameters. These colossal systems have achieved astounding feats—but they come with trade-offs: they are expensive to train, hard to deploy, and often inaccessible for most researchers.

But what if a smaller, more agile model could challenge these giants? What if clever engineering could matter more than brute-force scale?

That is the central premise behind a new system from the Institute of Foundation Models at MBZUAI: K2-THINK — a 32-billion parameter reasoning model that achieves performance competitive with, or even surpassing, far larger AI systems. In complex mathematical reasoning, K2-THINK stands out as the leading open-source model.

How did the team pull this off? By crafting a six-pillar recipe that combines advanced post-training methods with strategic computation at inference time. This article breaks down that recipe step-by-step — showing how K2-THINK was built, why it is so effective, and what it means for the future of accessible, high-performance AI.

A scatter plot showing K2-THINK’s remarkable parameter efficiency compared to other open-source and proprietary models on math benchmarks.

Figure 1: K2-THINK’s high Math Composite scores despite a relatively small parameter count — challenging the “bigger is better” narrative.

Background: The Building Blocks of a Reasoning System

Before diving into K2-THINK’s core innovations, let’s review some foundational ideas that are key to understanding its design.

Chain-of-Thought (CoT) Reasoning: Much like how humans solve problems step-by-step, CoT prompting trains models to “think out loud,” generating intermediate reasoning before giving the answer. This boosts performance in math, logic, and code.
Supervised Fine-Tuning (SFT): Starting from a pretrained “base model,” SFT adapts it using a curated dataset of prompts and high-quality answers (often with explicit chains of thought). The goal here is specialization.
Reinforcement Learning (RL): Post-SFT, models can be further refined using feedback signals. K2-THINK uses a variant called Reinforcement Learning with Verifiable Rewards (RLVR) — perfect for domains with objectively checkable answers like math or code.
Test-Time Computation: You can make a model smarter during inference by giving it more “thinking time” — generating multiple answers, planning solutions, or verifying outputs to improve final accuracy.

The Core Method: K2-THINK’s Six Pillars of Power

K2-THINK transforms the open-weight Qwen2.5-32B base model into a specialized reasoning expert via six integrated pillars — split between post-training and test-time computation.

Pillar 1 — Supervised Fine-Tuning with Long Chains of Thought

The team began by extending the model’s reasoning capabilities using the AM-Thinking-v1-Distilled dataset — rich with multi-domain, long-form chain-of-thought examples.

Goals of this phase:

Teach structured, step-by-step solution writing.
Instill a clear, standardized output format for reasoning traces.

The resulting checkpoint — K2-THINK-SFT — showed significant gains.

Two line charts showing Pass@1 performance over SFT training epochs and Pass@k performance on the AIME2024 benchmark.

Figure 2: Left — Pass@1 scores across multiple benchmarks during SFT training. Gains plateau after ~0.5 epoch, showing rapid acquisition of reasoning abilities.

A line chart comparing the pass@k performance of K2-THINK-SFT against the original Qwen2.5-32B base model on the AIME2024 benchmark.

Figure 3: Pass@k on AIME2024. K2-THINK-SFT (blue) saturates at ~95% by k=16; base model (red) struggles even with k=128.

Pillar 2 — Reinforcement Learning with Verifiable Rewards (RLVR)

With reasoning skills in place, RLVR further refines correctness. Using the Guru dataset (~92,000 prompts across Math, Code, Science, Logic, Simulation, and Tabular domains), K2-THINK learns to optimize accuracy directly.

Key lessons learned:

Strong SFT limits RL gains: Starting from a powerful SFT model left less room for RL improvement — ~5% gain vs. 40% when starting RL directly from the base model.
Context length matters: Multi-stage RL with reduced initial length (16k tokens) before expanding to 32k hurt performance dramatically.

A 2x2 grid of line plots showing ablation studies on RL training, comparing RL from a base vs. SFT model and single-stage vs. multi-stage context length training.

Figure 4: Top — RL from base achieves faster gain than RL from SFT. Bottom — restricting initial context length causes lasting performance drop.

Pillars 3 & 4 — “Plan-Before-You-Think” and Best-of-N Sampling

Reasoning quality isn’t just about training — it’s also about how the model is used at test time.

A flow diagram illustrating the K2-THINK inference process: User Query -> Plan-Before-You-Think -> K2-Think Model -> Best-of-N -> Final Response.

Figure 5: K2-THINK’s inference pipeline: High-level planning precedes reasoning; multiple solutions are sampled before picking the best.

Pillar 3 — “Plan-Before-You-Think”:
A lightweight planning agent outlines key concepts and steps before the query reaches the reasoning model. This “meta-thinking” guides the reasoning process — like a human making an outline before writing.

Pillar 4 — Best-of-N (BoN) Sampling:
Instead of one answer, K2-THINK generates N=3 candidates and a separate LLM judges them via pairwise comparison, returning the strongest. This small compute cost yields substantial accuracy gains.

Pillars 5 & 6 — Blazing-Fast Deployment

K2-THINK’s reasoning chains and BoN sampling demand speed. The authors tackled this in two ways:

Pillar 5 — Speculative Decoding:
A smaller “draft” model generates tokens in batches; the main model verifies them, avoiding slow token-by-token generation.

Pillar 6 — Inference-Optimized Hardware:
Deployment on the Cerebras Wafer-Scale Engine (WSE) keeps all model weights in massive on-chip memory, removing bandwidth bottlenecks. Result: ~2,000 tokens/sec.

Example: a 32,000-token proof — typical for math or code reasoning — completes in just 16 seconds.

Experiments and Results: Small Model, Big Impact

K2-THINK was benchmarked against leading proprietary and open-source models.

Benchmarks →	Math					Code		Science
Models ↓	AIME 2024	AIME 2025	HMMT25	Omni-HARD	Micro-Avg.	LCBv5	SciCode (sub/main)	GPQA-D	HLE
K2-Think	90.83	81.24	73.75	60.73	67.99	63.97	39.2 / 12.0	71.08	9.95
GPT-OSS 120B	89.58	84.59	81.88	57.76	67.20	74.53	38.8 / 11.0	77.04	18.58
DeepSeek V3.1†	91.87	82.49	83.54	53.22	64.43	66.59	38.2 / 11.7	79.46	8.40

Table 1 excerpt: K2-THINK leads all open-source models in math micro-average, even surpassing some much larger proprietary systems.

Component Analysis

	AIME 2024	AIME 2025	HMMT25	Omni-HARD
SFT+RL Checkpoint	86.26	77.72	66.46	56.74
+ Plan only	85.21	81.04	71.87	58.97
+ Bo3 only	90.77	81.22	71.16	59.47
+ Plan + Bo3 (K2-Think)	90.83	81.24	73.75	60.73

Table 2: Bo3 delivers the largest individual gain; combining with planning yields the strongest overall performance.

Planning Reduced Verbosity — Unexpectedly

Model	AIME 2024	AIME 2025	HMMT25	Omni-HARD	LCBv5	GPQA-D
SFT+RL Checkpoint	21,482	25,262	29,136	34,042	13,589	14,998
K2-Think	20,040	24,266	27,030	30,050	12,166	14,680

Table 3 excerpt: Planning before reasoning shortened responses by up to ~12%, yielding more concise outputs.

Conclusion: Small Models, Big Ideas

K2-THINK is a blueprint for delivering frontier AI performance without frontier AI scale.

Key takeaways:

Smart engineering beats brute scale: A carefully enhanced 32B model can rival systems 10× bigger.
Test-time compute is a huge lever: Techniques like planning and BoN can drive major gains without retraining.
Efficiency boosts experience: Shorter, clearer answers improve usability and save compute.

Crucially, the team has gone beyond releasing weights and code — deploying K2-THINK as a public API at k2think.ai, inviting the community to interact with a live frontier reasoning system.

K2-THINK shows that the future of AI reasoning may hinge not on endlessly bigger models, but on the synergistic combination of better data, smarter post-training, and clever inference-time strategies — making cutting-edge AI more open, affordable, and accessible to all.

Background: The Building Blocks of a Reasoning System#

The Core Method: K2-THINK’s Six Pillars of Power#

Pillar 1 — Supervised Fine-Tuning with Long Chains of Thought#

Pillar 2 — Reinforcement Learning with Verifiable Rewards (RLVR)#

Pillars 3 & 4 — “Plan-Before-You-Think” and Best-of-N Sampling#

Pillars 5 & 6 — Blazing-Fast Deployment#

Experiments and Results: Small Model, Big Impact#

Component Analysis#

Planning Reduced Verbosity — Unexpectedly#

Conclusion: Small Models, Big Ideas#