Large Language Models (LLMs) are becoming increasingly powerful, particularly a new class called Large Reasoning Models (LRMs). These models don’t just spit out an answer—they think by generating a step-by-step chain of thought (CoT) before coming to a conclusion. This reflective reasoning lets them tackle complex problems in math, coding, and beyond with remarkable results.
But there’s a crack in the armor. Recent research has revealed that these sophisticated reasoning abilities are surprisingly brittle. A model can be nudged toward generating harmful content simply by giving its thought process a flawed starting point—this is called CoT prefilling. For example, starting the model’s chain of thought with a phrase like “I know how to do it. First…” can be enough to bypass safety training, leading to unsafe outputs. This raises a critical question: Do these models truly understand safety principles, or are they just skilled at following any reasoning path they’re given—whether good or bad?
A new paper from Meta Superintelligence Labs, Georgia Tech, and IBM Research tackles this problem head-on with a training method called RECAP — Robust Safety Alignment via Counter-Aligned Prefilling. Instead of hoping models learn to self-correct naturally, RECAP deliberately exposes them to flawed reasoning during training and rewards them for getting back on track. The goal: build models that can reason critically about their own thought processes, leading to large improvements in safety, robustness, and even helpfulness—without extra computational cost at inference time.
Let’s explore how it works.
The Brittleness of Modern AI Reasoners
LRMs generate a chain of thought \(y_{\text{cot}}\), followed by a final response \(y_{\text{resp}}\). While this often improves output quality, researchers found the final answer is highly dependent on the initial direction of that reasoning.
In a clever experiment, they took several models from the same family (DeepSeek-distilled models, “DS” for short) with varying safety alignment and tested them with CoT prefilling:
- Unsafe Prefilling: They took the first 200 words of the chain of thought from the least safe model (DSQwen-1.5B) and used it to prefill other, safer models.
- Safe Prefilling: They did the same, but using the chain of thought from the safest model (DSQwen-32B).
Table 1 – Prefilling with unsafe CoT from DSQwen-1.5B reduces safety scores of other models by 36.4%. Prefilling with safe CoT from DSQwen-32B increases safety scores by 91.7%.
When forced to continue from unsafe reasoning, the average safety score plummeted. Conversely, safe reasoning prefills sent scores soaring. The conclusion: LRMs tend to follow the reasoning they’re given, right or wrong, instead of critically reassessing it. This brittleness extends beyond safety to math reasoning and overrefusal (when a model refuses to answer a safe question).
The underlying problem? Standard RLHF training rewards only the final answer, not the reasoning process—yet in real-world scenarios, models must navigate complex, noisy reasoning paths.
RECAP: Learning by Correcting Mistakes
What if we could train models to recover from flawed reasoning rather than avoid it? That’s the essence of RECAP.
Figure 1 – Harmful prompts are prefilled with unsafe CoT, and benign ones with refusal-oriented CoT. The model must override flawed trajectories to earn rewards.
Step 1: Build Counter-Aligned Prefills
RECAP creates two kinds of intentionally flawed reasoning traces:
For Harmful Prompts: Prefill with unsafe reasoning from a weakly safety-aligned model.
Example: “To perform a DDoS attack, start by assembling a network of compromised devices…”For Benign Prompts: Prefill with overly cautious reasoning from a refuse-all model.
Example: “Terminating processes is potentially harmful, so I must refuse…”
If the model simply follows these prefills, it produces unsafe or unhelpful outputs. To earn a reward, it must recognize the flaw, override it, and produce safe/helpful responses.
Example recovery: “But creating a botnet is illegal. Instead, I can explain what DDoS attacks are and how to defend against them.”
Step 2: Train via Reinforcement Learning
These counter-aligned prefills are mixed with normal prompts in training. The researchers used DAPO (Decouple clip & Dynamic sampling Policy Optimization), but the method works with any RLHF approach.
When a prompt is prefilled, the model is optimized only for tokens generated after the flawed prefix. Rewards are based on the final response and encourage successful recovery from unsafe starts.
Mathematically:
\[ t_0(x) = \begin{cases} 1 & \text{normal prompt}\\ \ell_{\text{pre}} + 1 & \text{prefilled prompt} \end{cases} \]where \( \ell_{\text{pre}} \) is the prefix length. For prefilled prompts, optimization skips the flawed section and focuses on the recovery.
Step 3: Why This Improves Robustness
Theoretical analysis shows RECAP achieves higher expected reward than vanilla training, especially when inference starts in flawed reasoning states. Standard RLHF never sees these scenarios, so it learns no “escape routes” from them. RECAP’s exposure to recovery tasks is akin to giving the model a safety vaccine—it learns to survive even adversarial conditions.
Experiments & Results
Researchers trained models on a mix of safety, overrefusal, and math problems, and compared RECAP against SFT and vanilla DAPO baselines.
Table 2 – RECAP outperforms baselines on safety, jailbreak robustness, helpfulness, and math reasoning.
Key gains:
- Safety & Jailbreak Robustness: Near-perfect safety on direct harmful prompts, and huge gains on jailbreak tasks designed to trick models into unsafe reasoning.
- Reduced Overrefusal: Improves helpfulness scores by overriding unnecessary refusals on benign prompts.
- Better Reasoning Capability: Boosts math performance despite no math prefills, suggesting generalization of critical reasoning skills.
Inference-Time Cost
Does more critical thinking mean longer outputs? Measurements show no significant increase in total tokens.
Figure 2 – RECAP keeps total token usage similar to DAPO. CoTs are slightly longer in safety tasks but shorter in math.
Qualitatively, RECAP produces more structured, coherent reasoning traces—thinking better, not longer.
What Makes RECAP Tick?
Ablation studies reveal three decisive factors:
Figure 3 – Optimal prefilling ratio: \(\alpha = 0.5\). Prefill length: up to 500 words is best. Prefill source must be counter-aligned.
- Prefilling Ratio: Best results at 50%. Too high, and the model fails to learn safe starts on its own.
- Prefill Length: Longer flawed prefixes (100–500 words) improve safety; too long (700+) reduces performance.
- Source: Prefills must be flawed. Safe prefills lead to worse results than no prefilling—models just copy the good reasoning without learning to correct bad reasoning.
Changing Model Behavior: More Self-Reflection
RECAP-trained models frequently insert self-reflective comments in their CoT, e.g., “Wait, that seems unsafe…”. In prefilling attack tests, 83.4% of RECAP outputs showed self-reflection vs 59.7% for the baseline.
Surviving Adaptive Attacks
The researchers stressed RECAP under two severe, informed attacks:
Full CoT Hijacking: Replace the entire reasoning trace with a malicious one.
Table 3 – RECAP maintains ~98% safety, outperforming DAPO by >35%.
Iterative Prefill Reset (IPR): Repeatedly inject flawed prefixes after resets.
Table 4 – RECAP’s safety barely drops after multiple rounds; baseline declines steadily.
Conclusion: A Step Toward More Critical AI
RECAP shifts the alignment focus from final-answer correctness to process-level resilience. By teaching models to recover from their own flawed reasoning, RECAP builds durable safety that persists under sophisticated attacks.
It’s simple: no changes to RLHF algorithms, no inference-time cost—yet it delivers substantial safety, jailbreak robustness, and helpfulness gains, while preserving or enhancing core reasoning ability.
The lesson is clear: the path to safer AI may lie not in shielding models from every flaw, but in teaching them to confront and overcome flawed thinking. With RECAP, large reasoning models can become not just more powerful, but wiser.