In the age of generative AI, it is easy to be impressed by a Large Language Model’s ability to write fluent text. Ask ChatGPT to write a news article, and it will churn out grammatically correct, structurally sound paragraphs in seconds. But journalism is not just about writing; it is about reporting.
Before a single sentence is written, a journalist engages in a complex planning process. They decide on an “angle” (the specific narrative focus) and figure out which “sources” (people or documents) to consult. This is especially critical when covering press releases—corporate announcements often designed to “spin” a story in a positive light. A good journalist doesn’t just repeat the spin; they challenge it.
Can LLMs replicate this high-level cognitive planning? Can they look at a corporate press release and think, “Wait a minute, I need to call a former employee to verify this”?
A fascinating research paper titled “Do LLMs Plan Like Human Writers?” by Alexander Spangher and colleagues explores this exact question. By analyzing hundreds of thousands of news articles, the researchers compare the planning capabilities of AI against professional human journalists. The results reveal a significant “creativity gap” that highlights the current limitations of AI in complex, investigative tasks.

The Problem: De-Spinning the News
To understand the study, we first have to understand the job of a financial journalist. Companies release press releases to announce earnings, product launches, or responses to scandals. These documents are inherently biased.
If a news article simply repeats the information in the press release, it is performing a summarization task. However, effective journalism involves “de-spinning.” This means:
- Contextualizing: Placing the news in the broader history of the company.
- Challenging: Contradicting false or misleading claims.
- Sourcing: Finding independent voices (former employees, regulators, victims) rather than just quoting the CEO.
The researchers set out to see if LLMs could act as creative aides in this specific workflow. If an LLM reads a press release from a company like Theranos, can it suggest a skeptical angle? Can it recommend interviewing a whistleblower rather than just a “business professor”?
Building the PressRelease Dataset
To teach an AI how to be a journalist, you first need examples of good journalism. The authors constructed a massive dataset called PressRelease, containing over 650,000 news articles linked to 250,000 press releases.
They collected this data in two directions to ensure diversity:
- News \(\rightarrow\) Press Release: They scraped major financial newspapers and looked for hyperlinks pointing to press release domains (like PR Newswire or BusinessWire).
- Press Release \(\rightarrow\) News: They started with press releases from S&P 500 companies and used backlink checkers to find news articles that discussed them.
This creates a “ground truth” of human behavior. For any given corporate announcement, the researchers have the actual resulting article written by a professional journalist. This allows them to see exactly what angle the human took and who they interviewed.
The Core Method: Contrastive Summarization
One of the most innovative contributions of this paper is how the researchers mathematically defined “effective coverage.” How do you tell an algorithm that an article is doing a good job of challenging a press release?
They introduced the concept of Contrastive Summarization.
In Natural Language Processing (NLP), there is a task called Natural Language Inference (NLI). NLI looks at two sentences and determines their relationship:
- Entailment: Sentence A proves Sentence B is true. (e.g., PR: “We made a profit.” Article: “The company reported a profit.”)
- Contradiction: Sentence A proves Sentence B is false. (e.g., PR: “We protect user privacy.” Article: “The company sold user data.”)
- Neutral: The sentences are unrelated.
The authors hypothesized that a “vanilla” summary would have high entailment. However, a high-quality, investigative article should have a mix of entailment (to cover the facts) and contradiction (to challenge the spin).
Visualizing the Metrics
To measure this, they built a system that compares every sentence in a press release against every sentence in the news article.

As shown in Figure 5 above, the model aggregates these sentence-level scores to give the whole document a score. The formula (shown in the image below) basically asks: “Does this article reference the press release enough to be relevant, but contradict it enough to be critical?”

Insight: Contradiction Requires Resources
When the researchers analyzed their dataset using this method, they found a strong correlation between contradiction and effort.
Articles that contradicted the press release (high criticism) tended to use:
- More sources: The median number of sources for critical articles was 9, compared to 3 for uncritical ones.
- Harder sources: They were more likely to use “Quotes” (which require calling someone) rather than “Press Reports” (which just require reading other news).
This confirms that challenging a narrative is resource-intensive. It requires planning. This brings us to the central experiment: Can LLMs do this planning?
Experiment: Human vs. Machine
The researchers designed an experiment to pit LLMs (specifically GPT-3.5, GPT-4, Mixtral, and Command-R) against human journalists. They selected 300 high-quality, critical news articles from their dataset to serve as the “Gold Standard.”
The experiment followed a three-step pipeline, illustrated below:

- Generate LLM Plan: The LLM is given the press release and asked to act as a planner. “De-spin this. What angle should we take? What sources should we call?”
- Assess Human Plan: Since we can’t interview the journalists from years ago, the researchers used an LLM to analyze the final human-written article and reverse-engineer the plan. “What angle did this journalist take? Who did they interview?”
- Compare: Did the LLM suggest the same angle as the human? Did it suggest the same sources?
The Results
The comparison revealed that while LLMs are competent, they are significantly risk-averse and lack the “investigative nose” of a human.
1. LLMs are bad at sourcing
The models struggled significantly to recommend the specific types of sources humans used.
- Human: Might interview a “former safety inspector” or a “local union representative.”
- LLM: Tends to suggest generic “industry experts” or “company spokespeople.”
As the authors note, LLMs perform better at recommending angles (story directions) than they do at recommending sources. This suggests LLMs understand narrative better than they understand the investigative process.
2. The Creativity Gap
The most damning result came from evaluating creativity. The researchers recruited journalists to rate the plans on a 1-5 scale (1 being a simple summary, 5 being a novel investigative direction).

As Figure 3 shows, human plans were consistently rated as more creative than AI plans, regardless of whether the model was Zero-Shot (no examples), Few-Shot (given examples), or even Fine-Tuned.
3. LLMs miss the best ideas
It wasn’t just that LLMs were less creative on average; they specifically failed to predict the highly creative ideas.
When the researchers looked at the overlap—cases where the LLM successfully guessed the human’s plan versus cases where it missed—they found a striking pattern.

Look at the chart above (Figure 7). The orange bars represent the human ideas that the LLM missed. The blue bars are the ideas the LLM successfully recommended.
Notice that for “Source” planning (the right side of the chart), the orange bar is significantly higher. This means that when a human journalist came up with a highly creative sourcing strategy, the LLM almost always failed to predict it. The LLM only successfully matched the “easy,” low-creativity sourcing decisions.
In simple terms: AI is good at predicting the obvious, but bad at predicting the scoop.
4. Fine-tuning helps, but not enough
The researchers tried to fix this by fine-tuning GPT-3.5 on their dataset. While this improved the model’s ability to match the style of human plans, it didn’t solve the creativity deficit.

Figure 6 illustrates the distribution of scores. The red line (Human) has a healthy bump around scores of 3 and 4. The AI models, even after fine-tuning (orange line), cluster around 1 and 2. They are essentially learning to mimic the format of a plan without understanding the deeper investigative intuition required to generate a good plan.
Why Does This Happen?
The paper suggests a few reasons for this performance gap:
- Lack of External Knowledge: A journalist knows the history of a company. If Theranos releases a statement, a journalist remembers the rumors of failed tests. An LLM, unless specifically provided with that context, treats the press release in isolation. It lacks the “world model” of the specific beat.
- Safety and Alignment: LLMs are often trained to be helpful and harmless. “De-spinning” requires being critical, skeptical, and sometimes confrontational. The models may be biased toward “both-sidesism” rather than taking a hard stance, leading to lower creativity scores in contradiction tasks.
- The Nature of Planning: Planning is a latent process. We only see the final article, not the emails, phone calls, and rejected drafts that led to it. Trying to learn planning just by looking at the final output is an incredibly difficult machine learning task.
Conclusion: AI as a Tool, Not a Replacement
This research provides a sobering but important check on the hype surrounding AI in journalism. While LLMs can summarize text and fix grammar, they struggle with the core intellectual labor of journalism: investigation strategy.
The authors conclude that LLMs currently act as “safe” planners. They suggest the obvious angles and the standard sources. For a overworked journalist, this might still be useful—a “sanity check” to ensure the basics are covered. However, relying on an LLM to plan a story would likely result in coverage that is less critical, less diverse, and less informative than what a human would produce.
The future of AI in newsrooms likely lies in Human-in-the-Loop systems. Perhaps LLMs can be connected to external databases (using Retrieval Augmented Generation) to give them the history they currently lack. Or perhaps they should be used to play “Devil’s Advocate,” specifically prompted to find holes in a reporter’s plan rather than generating the plan themselves.
Until then, the “nose for news”—that human instinct to dig where others aren’t looking—remains an exclusively human trait.
](https://deep-paper.org/en/paper/file-2964/images/cover.png)