Can AI Be Creative? How 'Collective Critics' Are Teaching LLMs to Write Better Stories

If you have ever asked ChatGPT or Llama to write a story, you have likely encountered a specific problem. The output is usually coherent; the grammar is perfect, the sequence of events makes sense, and the characters do what they are supposed to do. But it is often… boring. It lacks the spark, the clever twist, or the vivid imagery that makes human writing gripping.

In the field of Natural Language Processing (NLP), this is a known trade-off. We have become very good at coherence (logic and flow), but we are still struggling with creativity (novelty, surprise, and emotional resonance).

A fascinating paper titled “Collective Critics for Creative Story Generation” proposes a solution inspired by how human writers actually work: collaboration. The researchers introduce a framework called CRITICS. Instead of asking one model to write a story from start to finish, they build a digital “writers’ room” where different AI agents take on specific roles—critics, leaders, and evaluators—to iteratively refine the plot and the prose.

In this deep dive, we will explore how CRITICS works, why “personas” are the secret sauce to AI creativity, and how this framework might change the way humans and machines co-write fiction.

The Problem: Coherence vs. Creativity

To understand why CRITICS is necessary, we first need to look at how Long-Form Story Generation has evolved. Early attempts at AI storytelling struggled to keep a story straight for more than a few sentences. Characters would change names, or a dead character would suddenly reappear.

Recent advancements, particularly hierarchical generation (planning before writing), solved the coherence issue. Frameworks like the DOC pipeline work by first generating a high-level plan (an outline) and then fleshing out the details. This keeps the logic tight.

However, these frameworks often prioritize safety and predictability. They optimize for the most likely next word, which is the mathematical opposite of creativity. A creative story often requires taking a low-probability path—an unexpected plot twist or a weird, unique metaphor. Existing models tend to produce “vanilla” stories that follow standard tropes without adding anything new.

The CRITICS framework attempts to inject that missing creativity by introducing a revision mechanism. It doesn’t just generate; it critiques, argues, and improves.

The Solution: The CRITICS Framework

The core philosophy of CRITICS is that writing is rewriting. The framework is divided into two distinct stages:

CRPLAN: The Planning Stage. Here, the AI refines the story outline to make the plot more original and the ending more surprising.
CRTEXT: The Writing Stage. Here, the AI refines the actual sentences to improve “Voice” (style) and “Image” (sensory details).

Let’s look at the high-level architecture of this system.

The CRITICS framework architecture showing the workflow of CRPLAN and CRTEXT stages.

As shown in Figure 1, the process is cyclical. It doesn’t move in a straight line; it loops. In CRPLAN (top half), the system generates an initial plan, hands it to a group of critics, selects the best feedback, and refines the plan. This happens over multiple rounds. Once the plan is finalized, it moves to CRTEXT (bottom half), where specific lines of dialogue or description are polished using a similar critique-based loop.

Stage 1: CRPLAN (Refining the Plot)

The planning stage is where the narrative arc is defined. The goal here is to move away from clichés.

The Collective Critics

In a standard LLM approach, you might ask the model, “Make this story more creative.” The results are usually vague. CRITICS solves this by employing Collective Critics.

Three distinct AI critics review the draft plan based on specific creativity criteria:

Original Theme/Setting: Is the world-building unique?
Unusual Story Structure: Does it play with timelines or perspectives?
Unusual Ending: Is the conclusion surprising yet satisfying?

The researchers found that using multiple criteria results in much richer stories than using a single, generic instruction.

A comparison table showing how single-criterion critique differs from three-criteria critique in story planning.

In Table 5 above, you can see the difference. The Single-Criterion Critique (left) suggests a minor thematic change—focusing on an “ethical dilemma.” It’s fine, but standard.

The Three-Criteria Critique (right), however, completely restructures the narrative. It introduces a flashback structure, revealing the protagonist’s past only after establishing his present success. It adds complexity to the timeline and depth to the character development. This is a structural leap that standard LLMs rarely take on their own.

The Power of Personas

One of the most innovative aspects of CRPLAN is the use of Adaptive Personas. If you ask a generic AI to “critique this story,” it gives generic advice. But what if you asked a “Sociologist” or a “Psychologist”?

In CRITICS, the system analyzes the story draft and dynamically assigns personas to the critics.

If the story is about a dystopian future, one critic might become a Sociologist focusing on societal dynamics.
Another might be a Futurist checking the plausibility of the tech.
A third might be a Psychologist analyzing the protagonist’s emotional state.

A bar chart comparing win rates of Persona-Critics versus Non-Persona-Critics.

The impact of personas is statistically significant. As Figure 2 illustrates, critics with personas (left side) consistently outperform those without personas (right side) across three key metrics: Interestingness, Coherence, and Creativity.

Why does this happen? Specificity. A “Psychologist” critic won’t just say “make him sadder.” They might say, “The protagonist’s reaction to the trauma is too immediate; he should display denial first to make the psychological arc realistic.” This specific guidance helps the model generate a better revision.

The Leader and The Evaluator

With three critics shouting suggestions, who decides what to change? If you try to incorporate everyone’s feedback at once, the story becomes a mess.

CRITICS introduces a Leader module. The Leader acts like a head writer or editor. It reviews the suggestions from the three critics, selects the one that best balances creativity and logic, and discards the others.

Finally, because this process happens over several rounds (iterations), an Evaluator looks at the plans produced in Round 1, Round 2, and Round 3, and picks the absolute best version to send to the writing stage.

Stage 2: CRTEXT (Polishing the Prose)

Once the plot is solid, the system needs to write the actual story. This is the CRTEXT stage.

While CRPLAN focuses on the macro (plot), CRTEXT focuses on the micro (sentences). The critics in this stage don’t use personas because the goals are more technical. They focus on two metrics derived from creative writing theory:

Image: The vividness of the description. Does it evoke sight, sound, smell, or touch?
Voice: The uniqueness of the writing style. Does it sound like a generic report, or does it have personality?

A table comparing initial text with refined text, showing enhanced expressiveness.

Look at the example in Table 2.

Initial Text: “Jonathan raised an eyebrow.”
Refined Text: “Jonathan arched an incredulous eyebrow.”

It is a small change, but it does two things. It uses a more specific verb (“arched” vs. “raised”) and adds an adjective (“incredulous”) that conveys character emotion and intent. When applied across a 2,000-word story, these micro-improvements accumulate to create a much more engaging reading experience.

The Creativity-Coherence Trade-off

One of the most difficult challenges in AI generation is knowing when to stop. If you keep asking an AI to “make it more creative,” it eventually starts hallucinating nonsense.

The researchers analyzed how the number of revision rounds affects the story quality.

A line graph showing the trade-off between creativity and coherence over revision rounds.

Figure 3 reveals a classic optimization curve.

Dark Blue Line (Creativity): As the rounds progress (x-axis), creativity generally increases. The story gets wilder and more unique.
Light Blue Line (Coherence): However, coherence drops sharply after the first few rounds.

This data tells us that an infinite loop of critiques is dangerous. The sweet spot seems to be around 2 to 3 rounds. Beyond that, the critics might suggest changes that contradict earlier parts of the story, breaking the narrative logic. This is why the Evaluator role in CRPLAN is so critical—it can stop the process and say, “Round 2 was actually better than Round 5, let’s go with that.”

Human-Machine Collaboration

While CRITICS is designed to run automatically, one of its most exciting implications is interactive writing. Because the framework is modular (Critics -> Leader -> Revision), a human can step into any of those roles.

Human as Critic: You can let the AI generate the plot, but you provide the specific feedback on what to change.
Human as Leader: You let the AI generate three different critique options, and you select which path the story should take.

A diagram illustrating the human-machine interactive writing workflow.

Figure 4 maps out this interaction. The user creates a premise (e.g., “A baby skeleton riding a skateboard”). The system generates a draft. Then, the user can intervene during the “Critic Plan” or “Critic Text” phases.

The researchers built a web application to test this. It allows a user to see the generated text and the AI’s proposed critiques side-by-side.

A screenshot of the web application used for interactive story generation.

In the interface shown in Figure 5, the user can see the critiques on the left and the story draft on the right. They can select specific feedback suggestions or write their own.

Does it work?

The researchers conducted a user experience experiment to see if this “Human-in-the-Loop” method actually helped.

A table showing pass rates for edited and accepted stories in user experiments.

Table 8 shows the results. The “Edited” pass rate is 100%, meaning the system successfully changed the story every time a critique was given. However, the “Accepted” rate was 83.33%. This means that in the vast majority of cases, the human users felt the AI’s revision aligned with their vision. This is a high success rate for a creative tool, suggesting that CRITICS is a viable co-writing partner, not just a random text generator.

Key Takeaways and Future Implications

The CRITICS framework represents a significant step forward in making Large Language Models truly “creative.” By breaking the writing process down into planning and drafting, and then subjecting those steps to rigorous, persona-driven critique, the system forces the LLM out of its comfort zone.

Here is what we can learn from this research:

Diversity of Opinion Matters: Just like a real writers’ room, an AI produces better work when it views a problem from multiple angles (Sociologist, Psychologist, Futurist) rather than a single generic viewpoint.
Conflict Drives Quality: The tension between different creativity criteria (Originality vs. Structure) and the need for a “Leader” to resolve them helps produce a balanced narrative.
Specific is Better than General: instructing an AI to improve “Voice” and “Image” yields better prose than asking it to “write better.”
The Human Role is Changing: As these frameworks evolve, the human role shifts from “writer” to “creative director”—the person who chooses the premise, selects the best critiques, and guides the AI through the trade-off between coherence and creativity.

CRITICS shows us that creativity isn’t necessarily a magical spark; it’s a process of iteration, critique, and refinement—a process that machines are beginning to master.

The Problem: Coherence vs. Creativity#

The Solution: The CRITICS Framework#

Stage 1: CRPLAN (Refining the Plot)#

The Collective Critics#

The Power of Personas#

The Leader and The Evaluator#

Stage 2: CRTEXT (Polishing the Prose)#

The Creativity-Coherence Trade-off#

Human-Machine Collaboration#

Does it work?#

Key Takeaways and Future Implications#