Large Language Models (LLMs) have become exceptionally good at tackling tasks with clear, verifiable answers. Ask a model to solve a math problem or write a piece of code, and it often produces a correct solution by “thinking” through the problem step-by-step. This paradigm, known as deep reasoning, involves investing additional computational effort at inference time to solve complex, logical challenges—a huge reason behind the exceptional gains we’ve seen in areas like math and programming.

But what happens when we step into the subjective, ambiguous world of creative and open-ended tasks? How do you teach a model to “reason” about writing a compelling story, a persuasive essay, or a heartfelt poem—when there’s no single “correct” answer? In creative domains, quality hinges on human judgments: originality, emotional resonance, narrative coherence.

Here’s the crux: The two dominant paradigms for instilling reasoning—Reinforcement Learning (RL) and Instruction Distillation—both break down here:

  • RL thrives on clear, verifiable rewards. In chess, a win delivers a +1 reward. But for storytelling, we’d need a reward model capable of scoring creativity, which is arguably just as hard as writing the story itself.
  • Instruction distillation requires a stronger “teacher” model (e.g., GPT-4) to produce example reasoning processes and answers. It’s prohibitively expensive to do at scale and inherently capped by the teacher’s capabilities.

This is the bottleneck in creative AI progress. We need a way to generate vast amounts of high-quality reasoning data without depending on costly teachers or subjective reward functions.

To address this, researchers have introduced a powerful new paradigm: REverse-Engineered Reasoning (REER). Instead of building reasoning by trial-and-error, REER works backwards—starting from a high-quality example and asking:

“What logical, step-by-step thinking process could have led to this?”

In this article, we’ll unpack REER, explore how it powers the DeepWriter-8B model, and examine why this “third path” could redefine creative AI.


A diagram comparing traditional “forward” reasoning methods like Reinforcement Learning and Distillation with the new “backward” approach of Reverse-Engineered Reasoning (REER).
Figure 1: Traditional methods attempt to build deep reasoning “forwards,” which is challenging for creative tasks. REER flips the script, working backwards from a good solution to reconstruct the thinking process behind it.

The Problem with Creative Reasoning

When we ask an LLM to generate a story, we want far more than syntactically correct sentences. We’re seeking:

  • Narrative planning
  • Character and plot development
  • The ability to branch into alternatives
  • Self-correction when an idea isn’t working

Here’s the kind of deep thinking we want:

An example of a model’s internal monologue, showing it planning a story, considering alternatives (“Hmm… Alternatively”), and self-correcting (“Wait, that’s a bit too straightforward.”).
Figure 2: Example of structured, human-like reasoning—planning, exploring alternatives, and self-correction.

Achieving this kind of reasoning is difficult:

  • Reinforcement Learning: Needs a reward function, which is trivial in games but almost impossible for creativity.
  • Instruction Distillation: Relies on generating reasoning from an expensive teacher model. High cost + ceiling on creativity = limited scalability.

REER offers a route around both hurdles.


REER: Discovering Reasoning by Working Backwards

The key innovation in REER is simple but radical:
Instead of generating a solution and reasoning from scratch, start with a known good solution—and then synthesize the reasoning process that could have produced it.

REER as a Search Problem

Formally:

  • Query \(x\): e.g., “Write a story about a reluctant hero.”
  • Solution \(y\): A high-quality story from a trusted source.
  • Trajectory \(z\): A sequence of step-by-step reasoning leading from \(x\) to \(y\).

In creative work there’s no “correct” answer, so what makes a trajectory optimal?
The authors use perplexity (PPL) as a proxy: low perplexity means the model finds a text logical, unsurprising, and coherent.
A good trajectory \(z\) makes \(y\) appear maximally probable to the model.

Formally:

\[ z^* = \arg\min_{z\in\mathcal{Z}} \operatorname{PPL}(y|x,z) \]

This becomes a search problem: find \(z^*\)—the reasoning that minimizes perplexity for \(y\)—without a human reward.

Because the space of all possible reasoning is huge, REER uses an iterative local search to refine reasoning step-by-step:

A four-panel diagram illustrating the iterative local search algorithm: 1. Initialization, 2. Node Expansion, 3. Node Evaluation and Selection, and 4. Termination.
Figure 3: The search algorithm progressively improves each segment of an initial plan, guided by perplexity.

  1. Initialization: Generate a rough, imperfect reasoning draft \(z^{(0)}\).
  2. Node Expansion: Refine one segment at a time by generating candidate improvements—more detail, reflections, alternatives.
  3. Evaluation & Selection: Measure perplexity of \(y\) given the candidate reasoning. Keep the one with lowest PPL.
  4. Termination: Repeat until hitting a perplexity threshold or the max iteration count.

This creates detailed (query, reasoning, solution) triples—massive fuel for training reasoning-enabled models.


Creating the DeepWriting-20K Dataset

Using REER, researchers curated DeepWriting-20K:

  • Sources: Reddit’s r/WritingPrompts, public-domain literature via Project Gutenberg (with reverse-engineered prompts), and public instruction datasets.
  • Synthesis: Applied iterative search to generate human-like reasoning, injecting reflection tokens like “Hmm, maybe…”, “Wait, that’s…”, “Let me think…”.
  • Metrics: Figure 4 shows refinement reduced perplexity and lengthened reasoning chains.

Four charts showing that after the iterative search, perplexity (PPL) is lower and token length is longer.
Figure 4: Iterative search improves logic (lower PPL) and expands reasoning detail (longer trajectories).

After filtering repetitive or low-quality trajectories, the dataset’s topical diversity emerged (Figure 5):

Two pie charts showing the distribution of topics in the DeepWriting-20K dataset. The main chart shows a large slice for “Artistic” writing (48%). The smaller chart shows Creative Writing dominates within “Artistic”.
Figure 5: Distribution of topics. Nearly half are Artistic writing—creative works, essays, scripts—providing rich variety for cultivating creative reasoning.

To balance specialization and general competence, DeepWriting-20K was mixed with public datasets covering math, code, and science.


Experiments: Can a Small Model Think Like a Giant?

DeepWriter-8B was fine-tuned from Qwen3-8B on the mixed dataset. Evaluation focused on:

  • LongBench-Write: Ultra-long text (>10k words) coherence.
  • HelloBench: Real-world open-ended queries (QA and creative generation).
  • WritingBench: Six professional/creative domains (academic, finance, law, literature & arts, education, marketing).

Main Results

ModelBase ModelLBHB-AHB-BWB-AWB-BWB-CWB-DWB-EWB-F
GPT-4o-83.183.787.674.4073.4274.3877.9175.8678.08
Claude 3.5-89.382.988.359.0557.6856.3259.3662.0067.70
Claude 3.7-97.883.993.278.2477.9376.5179.3779.2680.88
LongWriter-8BLlama3.1-8b76.580.182.657.9753.9249.0852.0852.9952.08
DeepWriter-8BQwen3-8b91.2882.6487.4872.2071.7670.5770.5773.6572.29

Table 1: DeepWriter-8B surpasses strong open-source baselines and competes closely with proprietary models.

Key points:

  1. Strong Open-Source Win: Outperforms LongWriter-8B across all tasks, with a huge edge in WritingBench.
  2. Competitive with Giants: Matches GPT-4o and Claude 3.5 on creative HB-B; beats Claude 3.5 across professional WritingBench tasks.
  3. Long-Form Coherence: Outperforms GPT-4o and Claude 3.5 on LongBench-Write.

Ablation Studies

Model ConfigurationLBHB-AHB-BWB-AWB-BWB-CWB-DWB-EWB-F
Full DeepWriter-8B91.2882.6487.4872.2071.7670.5770.5773.6572.29
- Remove Synthesis Data82.9370.9273.7363.4462.7862.8657.7266.3262.78
- Remove Iterative Search83.2081.0884.4866.7268.7967.3665.6669.5370.13
- Remove Reflection Tokens86.9782.2782.8071.6869.6470.4462.0469.9871.94

Table 2: Impact of removing components—synthesized data is most crucial.

Findings:

  • Synthesized Data: Removing REER-generated data devastates performance.
  • Refinement Search: Eliminating iterative PPL-guided refinement causes notable drops—proving its necessity.
  • Reflection Tokens: Loss of creativity flexibility, especially in Literature & Arts.

Qualitative Analysis: Thinking Better

Researchers scored models on five deep-thinking dimensions. DeepWriter-8B’s overall reasoning profile is shown below:

A radar chart comparing reasoning profiles of five models. DeepWriter-8B’s performance polygon is much larger than the open-source baseline and competitive with proprietary models.
Figure 6: DeepWriter-8B shows strong, balanced reasoning—far beyond the baseline, rivaling GPT-4o.

Human-like Thinking Patterns

Adding “thinking tokens” during synthesis changed behaviour dramatically:

Two bar charts comparing frequency of thinking phrases. With injection, the model uses diverse, natural phrases; without, it relies on formulaic words.
Figure 7: Thinking pattern injection leads to diverse, reflective reasoning vs. formulaic repetition without it.

With injection, the model produces reasoning richer in self-reflection and exploration (“let me think…”, “maybe…”), improving flexibility—especially for creative tasks.


Conclusion and Implications

Teaching LLMs to reason in open-ended, non-verifiable domains is a major frontier. RL and distillation stumble here. REER offers a scalable third path:

  • Works backwards from good solutions
  • Uses perplexity as a proxy for reasoning quality
  • Builds massive datasets of detailed thinking without human rewards or costly teacher queries

The DeepWriting-20K dataset, produced through REER, enabled DeepWriter-8B to achieve performance competitive with—and sometimes surpass—world-class proprietary models.

This is a breakthrough: proving that human-like deep reasoning for creative tasks can be cultivated in smaller, open-source models.
By releasing DeepWriting-20K, the authors are not only democratizing access to strong creative reasoning but are also opening the door to rich future research on planning, structured thought, and generation in complex domains.