Large Language Models (LLMs) have become exceptionally good at tackling tasks with clear, verifiable answers. Ask a model to solve a math problem or write a piece of code, and it often produces a correct solution by “thinking” through the problem step-by-step. This paradigm, known as deep reasoning, involves investing additional computational effort at inference time to solve complex, logical challenges—a huge reason behind the exceptional gains we’ve seen in areas like math and programming.
But what happens when we step into the subjective, ambiguous world of creative and open-ended tasks? How do you teach a model to “reason” about writing a compelling story, a persuasive essay, or a heartfelt poem—when there’s no single “correct” answer? In creative domains, quality hinges on human judgments: originality, emotional resonance, narrative coherence.
Here’s the crux: The two dominant paradigms for instilling reasoning—Reinforcement Learning (RL) and Instruction Distillation—both break down here:
- RL thrives on clear, verifiable rewards. In chess, a win delivers a +1 reward. But for storytelling, we’d need a reward model capable of scoring creativity, which is arguably just as hard as writing the story itself.
- Instruction distillation requires a stronger “teacher” model (e.g., GPT-4) to produce example reasoning processes and answers. It’s prohibitively expensive to do at scale and inherently capped by the teacher’s capabilities.
This is the bottleneck in creative AI progress. We need a way to generate vast amounts of high-quality reasoning data without depending on costly teachers or subjective reward functions.
To address this, researchers have introduced a powerful new paradigm: REverse-Engineered Reasoning (REER). Instead of building reasoning by trial-and-error, REER works backwards—starting from a high-quality example and asking:
“What logical, step-by-step thinking process could have led to this?”
In this article, we’ll unpack REER, explore how it powers the DeepWriter-8B model, and examine why this “third path” could redefine creative AI.
Figure 1: Traditional methods attempt to build deep reasoning “forwards,” which is challenging for creative tasks. REER flips the script, working backwards from a good solution to reconstruct the thinking process behind it.
The Problem with Creative Reasoning
When we ask an LLM to generate a story, we want far more than syntactically correct sentences. We’re seeking:
- Narrative planning
- Character and plot development
- The ability to branch into alternatives
- Self-correction when an idea isn’t working
Here’s the kind of deep thinking we want:
Figure 2: Example of structured, human-like reasoning—planning, exploring alternatives, and self-correction.
Achieving this kind of reasoning is difficult:
- Reinforcement Learning: Needs a reward function, which is trivial in games but almost impossible for creativity.
- Instruction Distillation: Relies on generating reasoning from an expensive teacher model. High cost + ceiling on creativity = limited scalability.
REER offers a route around both hurdles.
REER: Discovering Reasoning by Working Backwards
The key innovation in REER is simple but radical:
Instead of generating a solution and reasoning from scratch, start with a known good solution—and then synthesize the reasoning process that could have produced it.
REER as a Search Problem
Formally:
- Query \(x\): e.g., “Write a story about a reluctant hero.”
- Solution \(y\): A high-quality story from a trusted source.
- Trajectory \(z\): A sequence of step-by-step reasoning leading from \(x\) to \(y\).
In creative work there’s no “correct” answer, so what makes a trajectory optimal?
The authors use perplexity (PPL) as a proxy: low perplexity means the model finds a text logical, unsurprising, and coherent.
A good trajectory \(z\) makes \(y\) appear maximally probable to the model.
Formally:
\[ z^* = \arg\min_{z\in\mathcal{Z}} \operatorname{PPL}(y|x,z) \]This becomes a search problem: find \(z^*\)—the reasoning that minimizes perplexity for \(y\)—without a human reward.
Iterative Local Search
Because the space of all possible reasoning is huge, REER uses an iterative local search to refine reasoning step-by-step:
Figure 3: The search algorithm progressively improves each segment of an initial plan, guided by perplexity.
- Initialization: Generate a rough, imperfect reasoning draft \(z^{(0)}\).
- Node Expansion: Refine one segment at a time by generating candidate improvements—more detail, reflections, alternatives.
- Evaluation & Selection: Measure perplexity of \(y\) given the candidate reasoning. Keep the one with lowest PPL.
- Termination: Repeat until hitting a perplexity threshold or the max iteration count.
This creates detailed (query, reasoning, solution)
triples—massive fuel for training reasoning-enabled models.
Creating the DeepWriting-20K Dataset
Using REER, researchers curated DeepWriting-20K:
- Sources: Reddit’s r/WritingPrompts, public-domain literature via Project Gutenberg (with reverse-engineered prompts), and public instruction datasets.
- Synthesis: Applied iterative search to generate human-like reasoning, injecting reflection tokens like “Hmm, maybe…”, “Wait, that’s…”, “Let me think…”.
- Metrics: Figure 4 shows refinement reduced perplexity and lengthened reasoning chains.
Figure 4: Iterative search improves logic (lower PPL) and expands reasoning detail (longer trajectories).
After filtering repetitive or low-quality trajectories, the dataset’s topical diversity emerged (Figure 5):
Figure 5: Distribution of topics. Nearly half are Artistic writing—creative works, essays, scripts—providing rich variety for cultivating creative reasoning.
To balance specialization and general competence, DeepWriting-20K was mixed with public datasets covering math, code, and science.
Experiments: Can a Small Model Think Like a Giant?
DeepWriter-8B was fine-tuned from Qwen3-8B on the mixed dataset. Evaluation focused on:
- LongBench-Write: Ultra-long text (>10k words) coherence.
- HelloBench: Real-world open-ended queries (QA and creative generation).
- WritingBench: Six professional/creative domains (academic, finance, law, literature & arts, education, marketing).
Main Results
Model | Base Model | LB | HB-A | HB-B | WB-A | WB-B | WB-C | WB-D | WB-E | WB-F |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | - | 83.1 | 83.7 | 87.6 | 74.40 | 73.42 | 74.38 | 77.91 | 75.86 | 78.08 |
Claude 3.5 | - | 89.3 | 82.9 | 88.3 | 59.05 | 57.68 | 56.32 | 59.36 | 62.00 | 67.70 |
Claude 3.7 | - | 97.8 | 83.9 | 93.2 | 78.24 | 77.93 | 76.51 | 79.37 | 79.26 | 80.88 |
LongWriter-8B | Llama3.1-8b | 76.5 | 80.1 | 82.6 | 57.97 | 53.92 | 49.08 | 52.08 | 52.99 | 52.08 |
DeepWriter-8B | Qwen3-8b | 91.28 | 82.64 | 87.48 | 72.20 | 71.76 | 70.57 | 70.57 | 73.65 | 72.29 |
Table 1: DeepWriter-8B surpasses strong open-source baselines and competes closely with proprietary models.
Key points:
- Strong Open-Source Win: Outperforms LongWriter-8B across all tasks, with a huge edge in WritingBench.
- Competitive with Giants: Matches GPT-4o and Claude 3.5 on creative HB-B; beats Claude 3.5 across professional WritingBench tasks.
- Long-Form Coherence: Outperforms GPT-4o and Claude 3.5 on LongBench-Write.
Ablation Studies
Model Configuration | LB | HB-A | HB-B | WB-A | WB-B | WB-C | WB-D | WB-E | WB-F |
---|---|---|---|---|---|---|---|---|---|
Full DeepWriter-8B | 91.28 | 82.64 | 87.48 | 72.20 | 71.76 | 70.57 | 70.57 | 73.65 | 72.29 |
- Remove Synthesis Data | 82.93 | 70.92 | 73.73 | 63.44 | 62.78 | 62.86 | 57.72 | 66.32 | 62.78 |
- Remove Iterative Search | 83.20 | 81.08 | 84.48 | 66.72 | 68.79 | 67.36 | 65.66 | 69.53 | 70.13 |
- Remove Reflection Tokens | 86.97 | 82.27 | 82.80 | 71.68 | 69.64 | 70.44 | 62.04 | 69.98 | 71.94 |
Table 2: Impact of removing components—synthesized data is most crucial.
Findings:
- Synthesized Data: Removing REER-generated data devastates performance.
- Refinement Search: Eliminating iterative PPL-guided refinement causes notable drops—proving its necessity.
- Reflection Tokens: Loss of creativity flexibility, especially in Literature & Arts.
Qualitative Analysis: Thinking Better
Researchers scored models on five deep-thinking dimensions. DeepWriter-8B’s overall reasoning profile is shown below:
Figure 6: DeepWriter-8B shows strong, balanced reasoning—far beyond the baseline, rivaling GPT-4o.
Human-like Thinking Patterns
Adding “thinking tokens” during synthesis changed behaviour dramatically:
Figure 7: Thinking pattern injection leads to diverse, reflective reasoning vs. formulaic repetition without it.
With injection, the model produces reasoning richer in self-reflection and exploration (“let me think…”, “maybe…”), improving flexibility—especially for creative tasks.
Conclusion and Implications
Teaching LLMs to reason in open-ended, non-verifiable domains is a major frontier. RL and distillation stumble here. REER offers a scalable third path:
- Works backwards from good solutions
- Uses perplexity as a proxy for reasoning quality
- Builds massive datasets of detailed thinking without human rewards or costly teacher queries
The DeepWriting-20K dataset, produced through REER, enabled DeepWriter-8B to achieve performance competitive with—and sometimes surpass—world-class proprietary models.
This is a breakthrough: proving that human-like deep reasoning for creative tasks can be cultivated in smaller, open-source models.
By releasing DeepWriting-20K, the authors are not only democratizing access to strong creative reasoning but are also opening the door to rich future research on planning, structured thought, and generation in complex domains.