Large Language Models (LLMs) are becoming increasingly powerful, particularly a new class called Large Reasoning Models (LRMs). These models don’t just spit out an answer—they think by generating a step-by-step chain of thought (CoT) before coming to a conclusion. This reflective reasoning lets them tackle complex problems in math, coding, and beyond with remarkable results.

But there’s a crack in the armor. Recent research has revealed that these sophisticated reasoning abilities are surprisingly brittle. A model can be nudged toward generating harmful content simply by giving its thought process a flawed starting point—this is called CoT prefilling. For example, starting the model’s chain of thought with a phrase like “I know how to do it. First…” can be enough to bypass safety training, leading to unsafe outputs. This raises a critical question: Do these models truly understand safety principles, or are they just skilled at following any reasoning path they’re given—whether good or bad?

A new paper from Meta Superintelligence Labs, Georgia Tech, and IBM Research tackles this problem head-on with a training method called RECAPRobust Safety Alignment via Counter-Aligned Prefilling. Instead of hoping models learn to self-correct naturally, RECAP deliberately exposes them to flawed reasoning during training and rewards them for getting back on track. The goal: build models that can reason critically about their own thought processes, leading to large improvements in safety, robustness, and even helpfulness—without extra computational cost at inference time.

Let’s explore how it works.


The Brittleness of Modern AI Reasoners

LRMs generate a chain of thought \(y_{\text{cot}}\), followed by a final response \(y_{\text{resp}}\). While this often improves output quality, researchers found the final answer is highly dependent on the initial direction of that reasoning.

In a clever experiment, they took several models from the same family (DeepSeek-distilled models, “DS” for short) with varying safety alignment and tested them with CoT prefilling:

  1. Unsafe Prefilling: They took the first 200 words of the chain of thought from the least safe model (DSQwen-1.5B) and used it to prefill other, safer models.
  2. Safe Prefilling: They did the same, but using the chain of thought from the safest model (DSQwen-32B).

Table showing how prefilling with unsafe reasoning from a weak model degrades the safety of stronger models, while prefilling with safe reasoning from a strong model improves it.

Table 1 – Prefilling with unsafe CoT from DSQwen-1.5B reduces safety scores of other models by 36.4%. Prefilling with safe CoT from DSQwen-32B increases safety scores by 91.7%.

When forced to continue from unsafe reasoning, the average safety score plummeted. Conversely, safe reasoning prefills sent scores soaring. The conclusion: LRMs tend to follow the reasoning they’re given, right or wrong, instead of critically reassessing it. This brittleness extends beyond safety to math reasoning and overrefusal (when a model refuses to answer a safe question).

The underlying problem? Standard RLHF training rewards only the final answer, not the reasoning process—yet in real-world scenarios, models must navigate complex, noisy reasoning paths.


RECAP: Learning by Correcting Mistakes

What if we could train models to recover from flawed reasoning rather than avoid it? That’s the essence of RECAP.

An infographic showing how RECAP uses prompts with flawed Chain-of-Thought (CoT) to train a policy model. Harmful prompts get an unsafe CoT, while benign prompts get a refusal CoT, forcing the model to override these flawed paths to get a reward.

Figure 1 – Harmful prompts are prefilled with unsafe CoT, and benign ones with refusal-oriented CoT. The model must override flawed trajectories to earn rewards.

Step 1: Build Counter-Aligned Prefills

RECAP creates two kinds of intentionally flawed reasoning traces:

  1. For Harmful Prompts: Prefill with unsafe reasoning from a weakly safety-aligned model.
    Example: “To perform a DDoS attack, start by assembling a network of compromised devices…”

  2. For Benign Prompts: Prefill with overly cautious reasoning from a refuse-all model.
    Example: “Terminating processes is potentially harmful, so I must refuse…”

If the model simply follows these prefills, it produces unsafe or unhelpful outputs. To earn a reward, it must recognize the flaw, override it, and produce safe/helpful responses.

Example recovery: “But creating a botnet is illegal. Instead, I can explain what DDoS attacks are and how to defend against them.”


Step 2: Train via Reinforcement Learning

These counter-aligned prefills are mixed with normal prompts in training. The researchers used DAPO (Decouple clip & Dynamic sampling Policy Optimization), but the method works with any RLHF approach.

When a prompt is prefilled, the model is optimized only for tokens generated after the flawed prefix. Rewards are based on the final response and encourage successful recovery from unsafe starts.

Mathematically:

\[ t_0(x) = \begin{cases} 1 & \text{normal prompt}\\ \ell_{\text{pre}} + 1 & \text{prefilled prompt} \end{cases} \]

where \( \ell_{\text{pre}} \) is the prefix length. For prefilled prompts, optimization skips the flawed section and focuses on the recovery.


Step 3: Why This Improves Robustness

Theoretical analysis shows RECAP achieves higher expected reward than vanilla training, especially when inference starts in flawed reasoning states. Standard RLHF never sees these scenarios, so it learns no “escape routes” from them. RECAP’s exposure to recovery tasks is akin to giving the model a safety vaccine—it learns to survive even adversarial conditions.


Experiments & Results

Researchers trained models on a mix of safety, overrefusal, and math problems, and compared RECAP against SFT and vanilla DAPO baselines.

Table of results showing RECAP outperforming other methods in safety, helpfulness, and math reasoning across two different model sizes.

Table 2 – RECAP outperforms baselines on safety, jailbreak robustness, helpfulness, and math reasoning.

Key gains:

  1. Safety & Jailbreak Robustness: Near-perfect safety on direct harmful prompts, and huge gains on jailbreak tasks designed to trick models into unsafe reasoning.
  2. Reduced Overrefusal: Improves helpfulness scores by overriding unnecessary refusals on benign prompts.
  3. Better Reasoning Capability: Boosts math performance despite no math prefills, suggesting generalization of critical reasoning skills.

Inference-Time Cost

Does more critical thinking mean longer outputs? Measurements show no significant increase in total tokens.

A bar chart comparing the number of Chain-of-Thought (CoT) tokens and total tokens generated by RECAP and a baseline (DAPO). The total token counts are very similar across safety, overrefusal, and math tasks.

Figure 2 – RECAP keeps total token usage similar to DAPO. CoTs are slightly longer in safety tasks but shorter in math.

Qualitatively, RECAP produces more structured, coherent reasoning traces—thinking better, not longer.


What Makes RECAP Tick?

Ablation studies reveal three decisive factors:

A three-panel figure analyzing the impact of prefilling ratio, length, and source on RECAP’s performance. It shows optimal ranges for ratio and length, and that the source must be counter-aligned (unsafe) to be effective.

Figure 3 – Optimal prefilling ratio: \(\alpha = 0.5\). Prefill length: up to 500 words is best. Prefill source must be counter-aligned.

  • Prefilling Ratio: Best results at 50%. Too high, and the model fails to learn safe starts on its own.
  • Prefill Length: Longer flawed prefixes (100–500 words) improve safety; too long (700+) reduces performance.
  • Source: Prefills must be flawed. Safe prefills lead to worse results than no prefilling—models just copy the good reasoning without learning to correct bad reasoning.

Changing Model Behavior: More Self-Reflection

RECAP-trained models frequently insert self-reflective comments in their CoT, e.g., “Wait, that seems unsafe…”. In prefilling attack tests, 83.4% of RECAP outputs showed self-reflection vs 59.7% for the baseline.


Surviving Adaptive Attacks

The researchers stressed RECAP under two severe, informed attacks:

  1. Full CoT Hijacking: Replace the entire reasoning trace with a malicious one.

    Table showing RECAP’s high safety scores under full CoT hijacking, far outperforming the baseline DAPO model.

    Table 3 – RECAP maintains ~98% safety, outperforming DAPO by >35%.

  2. Iterative Prefill Reset (IPR): Repeatedly inject flawed prefixes after resets.

    Table showing RECAP maintaining a very high safety score across multiple rounds of the IPR attack, while the baseline model’s safety degrades with each round.

    Table 4 – RECAP’s safety barely drops after multiple rounds; baseline declines steadily.


Conclusion: A Step Toward More Critical AI

RECAP shifts the alignment focus from final-answer correctness to process-level resilience. By teaching models to recover from their own flawed reasoning, RECAP builds durable safety that persists under sophisticated attacks.

It’s simple: no changes to RLHF algorithms, no inference-time cost—yet it delivers substantial safety, jailbreak robustness, and helpfulness gains, while preserving or enhancing core reasoning ability.

The lesson is clear: the path to safer AI may lie not in shielding models from every flaw, but in teaching them to confront and overcome flawed thinking. With RECAP, large reasoning models can become not just more powerful, but wiser.