Man vs. Machine: The First True Creative Writing Duel Between GPT-4 and a World-Class Novelist

In the history of Artificial Intelligence, we mark progress by the fallen champions of humanity. We remember when Deep Blue defeated Garry Kasparov at chess. We remember when AlphaGo stunned Lee Sedol. These were pivotal moments where machines proved they could out-calculate the best human minds in closed systems of logic and strategy.

But art is not a closed system.

For years, we have comforted ourselves with the idea that while machines crunch numbers, humans create meaning. However, the rise of Large Language Models (LLMs) like GPT-4 has brought an uneasy question to the surface: Are we losing the creative frontier, too? We know AI can write competent emails and passable high school essays. We know it can outperform the average human writer.

But can it challenge a master?

A fascinating research paper titled “Pron vs Prompt” takes this question seriously. Instead of comparing AI to crowd-sourced workers or amateur writers, the researchers set up a heavyweight bout: GPT-4 Turbo (the reigning champion of LLMs at the time) versus Patricio Pron, an award-winning novelist considered one of the best of his generation.

This was not a casual test. It was a rigorous, scientifically controlled duel evaluated by literary critics. In this post, we are going to break down how this experiment was designed, what happened when the “perfect” statistical machine met a master of fiction, and what the results tell us about the future of creativity.

The Problem with “Average”

To understand why this paper is so significant, we first need to look at the flaw in previous research. Most studies evaluating AI creativity compare the model’s output to that of “average humans.”

If you compare GPT-4 to a random sample of people asked to write a story, GPT-4 will likely win. It has perfect grammar, a massive vocabulary, and impeccable structure. But this is a low bar. In the world of chess, we didn’t test Deep Blue against a casual park player; we tested it against the Grandmaster.

The authors of this paper argue that to truly assess AI’s creative potential, we must pit the best AI against the best Human.

The Contenders:

The Machine: GPT-4 Turbo (gpt-4-0125-preview), set to a temperature of 1 to maximize creativity while maintaining coherence.
The Human: Patricio Pron, a distinguished Spanish-speaking writer, winner of the Alfaguara Award, and recognized by Granta magazine as one of the top young writers in Spanish.

The Setup: A Game of Titles and Synopses

The researchers designed a symmetrical experiment to ensure fairness. It wasn’t enough to just say “write a story.” They needed to control the spark of the story—the prompt.

Here is how the contest worked:

Title Generation: Both Patricio Pron and GPT-4 were asked to generate 30 imaginary movie titles each.
The Cross-Writing Task: This is the clever part. Both contenders had to write a 600-word synopsis for all 60 titles.

Pron wrote synopses for his own 30 titles and GPT-4’s 30 titles.
GPT-4 wrote synopses for its own 30 titles and Pron’s 30 titles.

Language Variation: To test linguistic capabilities, GPT-4 performed the task in both Spanish (Pron’s native language) and English. Pron wrote in Spanish.

This resulted in a massive corpus of texts generated under identical constraints. The prompts given to GPT-4 were not complex “prompt engineered” scripts but simple instructions mirroring what was asked of the human author: Write a creative, appealing synopsis with literary value.

Measuring the Immeasurable: How Do You Score Art?

You cannot calculate the quality of a story the way you calculate a checkmate. To solve this, the researchers turned to Margaret Boden’s definition of creativity.

Boden, a legendary cognitive scientist, argues that for something to be creative, it must be three things:

New: It must be novel.
Surprising: It must defy expectation.
Valuable: It must be attractive or useful to the audience.

The researchers converted this theory into a rubric for their judges—a panel of literary critics and scholars. They didn’t just ask “is this good?” They asked specific questions rated on a scale of 0 to 3:

Attractiveness: Is the style enjoyable? Does the plot engage?
Originality: Is the text unique? Does it avoid clichés?
Anthology Potential: Would you include this in a published collection?
Own Voice: Does the writer have a recognizable style?

RQ1: The Main Event—Who Won?

The results of the duel were stark. When evaluated by experts, the human master completely dominated the machine.

Below is a heatmap summarizing the scores. Darker blue indicates a higher percentage of texts receiving that score. Look at the difference between the “Patricio Pron” column and the “GPT-4” columns.

Figure 1: Summary of expert assessments for each writer

Interpreting the Data:

The 0-1 Trap: Look at the GPT-4 columns (both English and Spanish). The vast majority of its scores cluster around 0 and 1. Experts found the AI texts to be “formulaic,” “conventional,” and lacking in deep appeal.
The 2-3 Excellence: Now look at Patricio Pron’s column (c). His scores cluster heavily around 2 and 3.
The “Creativity” Gap: Pron’s average creativity score was 1.94. GPT-4’s was 0.76 in English and 0.94 in Spanish. The human was roughly twice as creative as the machine.
Originality: This was the biggest slaughter. GPT-4’s “Style Originality” scores were abysmal (0.36 in Spanish). The experts essentially viewed the AI’s writing style as a collection of clichés with zero unique voice.

The conclusion for the first research question is clear: No, the current state of generative AI cannot yet compete with a prestigious human author. The AI produces clean, coherent text, but it lacks the intentionality, subversion, and stylistic depth of a master.

RQ2: The “Prompt” in Pron vs. Prompt

One of the most interesting findings of the study was not about who wrote the text, but who wrote the title.

Remember, both contenders wrote stories based on each other’s titles. The researchers wanted to know: Does a highly creative prompt (a title written by a human novelist) help the AI write better?

The answer is a resounding yes.

Comparison of the impact of using Pron’s titles versus GPT4 titles on the text quality.

The Radar Chart Analysis: The chart above shows how the source of the title affected the quality of the final text.

Blue Line: GPT-4 writing from its own titles. This is the baseline performance, and it is the smallest shape on the chart (lowest quality).
Orange Line: GPT-4 writing from Pron’s titles. Notice how the shape expands outward.
Red Line: Pron writing from Pron’s titles. The gold standard.

When GPT-4 was forced to work with the creative, unusual titles invented by Patricio Pron, its performance improved significantly across almost all metrics. Statistical tests showed significant jumps in Style Originality and Anthology Potential.

Why does this matter? This suggests that “Co-Creation” is a more viable path than autonomous AI writing. The AI is a probability machine; left to its own devices, it drifts toward the average (the cliché). But when a human provides a high-entropy, creative constraint (a unique title), the AI is forced out of its comfort zone and produces better work.

Interestingly, the reverse was not true. When Patricio Pron had to write stories based on GPT-4’s boring, cliché titles, his performance didn’t drop. In fact, he often scored higher. As he told the researchers, he disliked the AI titles so much that he tried to take them in “completely different directions,” using the cliché as a constraint to rebel against.

RQ3: The Language Gap

We often assume that because LLMs are trained on the whole internet, they are equally brilliant in major languages. The study proved this wrong.

The researchers compared GPT-4’s performance in English vs. Spanish. Spanish is the third most spoken language in the world and resource-rich, yet the AI performed significantly worse in it.

Table 1: Paired statistical test results for attribute differences between English and Spanish

The table above shows the statistical significance of the gap. For almost every stylistic category—Style Attractiveness, Style Originality, Creativity, and Own Voice—the AI was significantly better in English (indicated by the “Yes” in the Significant column).

Means of Differences (GPT-4 - Pron) Figure 3: GPT-4 performance differences with Pron in English and Spanish

Note: Focus on the bar chart in the top half of the image above.

This chart visualizes the “gap” between the Human and the AI. A negative score means the AI is worse than the Human.

Blue Bars (Spanish): The negative bars are deeper. The AI is much worse than Pron in Spanish.
Red Bars (English): The AI is still worse than Pron (the bars are still negative), but the gap is smaller.

This highlights a bias in current AI models: they are Anglo-centric. If you are a creative writer in Spanish, the AI is even less of a threat to you than it is to your English-speaking counterparts.

RQ4: The Turing Test — Can Experts Spot the Bot?

If you read 60 synopses mixed up between a human and a machine, could you tell which is which?

At the start of the evaluation, the critics were decent at guessing, but not perfect. However, as they read more, something interesting happened. They started learning.

Look at the line graph in the bottom half of the image above (Figure 4).

The blue line represents the accuracy of identifying AI texts. Notice the upward slope. As the evaluators moved from the 1st text to the 60th, their ability to spot GPT-4 increased.

This suggests that GPT-4 has a “tell.” It has a specific style—likely a flatness of affect, a tendency toward certain tropes, or a specific rhythm—that becomes recognizable after prolonged exposure. It doesn’t write like a person with a voice; it writes like a machine maximizing probability.

RQ5: Validating the Definition of Creativity

Finally, the researchers wanted to check their math. They used Boden’s theory (Creativity = Novelty + Surprise + Value) to build their rubric. But did the judges’ intuitive sense of “Creativity” actually align with those metrics?

Table 2: Spearman correlation for the dimensions of attractiveness, originality, and creativity.

The correlation table above confirms the theory. There is a strong statistical correlation (values above 0.7) between Originality, Attractiveness, and the overall Creativity score.

Figure 5: Correlation plots for creativity versus attractiveness and originality.

These scatter plots visually confirm the data. The dense clusters along the diagonal show that when a text is rated high in “Attractiveness” or “Originality,” it is almost always rated high in “Creativity.”

This is important because it validates the rubric. The fact that Patricio Pron scored higher wasn’t just a subjective preference; he scored higher because he delivered more Originality (Novelty/Surprise) and Attractiveness (Value), which are the building blocks of creativity.

Conclusion: The Cliché vs. The Intent

This paper provides a sobering reality check for the “AI is taking over art” narrative. When you strip away the hype and pit the machine against a true master, the difference is glaring.

The authors conclude that LLMs are currently “cliché machines.” They work by predicting the most probable next word. In creative writing, however, the most probable word is often the least creative one. A great writer makes choices that are low-probability but high-meaning. They subvert expectations. They break rules intentionally.

GPT-4, by contrast, smooths out the edges. It generates content that is high in coherence but low in “soul” or “voice.”

Key Takeaways for Students:

Don’t settle for “Average”: AI beats the average, but it cannot touch the exceptional. If you want to survive in the creative age, you must hone a unique voice.
The Prompt is the Co-Author: The machine is only as creative as the constraint you give it. Using AI for brainstorming (like Pron’s titles) is powerful; using it to do the work for you results in mediocrity.
Language Matters: We must be aware of the Anglo-centric bias in these tools.
Pattern Recognition: AI has a “style,” and once you see it, you can’t unsee it.

The duel between Pron and the Prompt was not a draw. The human won. But the fact that the machine could even step into the ring—and that it improved when guided by a human hand—shows that the future of writing may not be replacement, but collaboration. Just make sure you are the one writing the titles.

The Problem with “Average”#

The Setup: A Game of Titles and Synopses#

Measuring the Immeasurable: How Do You Score Art?#

RQ1: The Main Event—Who Won?#

RQ2: The “Prompt” in Pron vs. Prompt#

RQ3: The Language Gap#

RQ4: The Turing Test — Can Experts Spot the Bot?#

RQ5: Validating the Definition of Creativity#

Conclusion: The Cliché vs. The Intent#