Every researcher knows the feeling. You have a brilliant idea, you’ve run the experiments, and you’ve drafted the core methodology. Then, you hit the wall: the Related Work Section (RWS).
To write a good RWS, you cannot simply list papers that sound similar to yours. You must craft a coherent narrative. You have to explain the history of the problem, group existing solutions by their approach, point out their flaws, and seamlessly transition into how your work fills the gap. It is a task that requires deep domain expertise, high-level synthesis skills, and the time to read hundreds of papers.
But what if an AI could do it for you?
This is the promise of Related Work Generation (RWG), a field of Natural Language Processing (NLP) that has fascinated researchers for over a decade. In a comprehensive new survey titled “Related Work and Citation Text Generation: A Survey,” researchers Xiangci Li and Jessica Ouyang take us through the history, the failures, and the resurgence of this challenging task.
In this post, we will explore how computers have learned to read and write scientific literature, moving from simple copy-paste algorithms to complex Large Language Models (LLMs) that attempt to “think” like a scientist.
The Core Problem: Why RWG is Hard
Before we look at the solutions, we need to understand the difficulty of the task. Academic research is exploratory. To convince a reader that a new paper matters, the author must perform a literature review that connects prior works to the current work.
As the authors of the survey note, “Writing an RWS is non-trivial; it is insufficient to simply concatenate generic summaries of prior works.” A good literature review is a story. It requires:
- Retrieval: Finding relevant papers (often from a massive, fast-growing feed of pre-prints).
- Understanding: Digesting the full text of these papers.
- Synthesis: Grouping them logically.
- Generation: Writing a cohesive text that positions the new work against the old.
The field of RWG has “waxed and waned” along with the capabilities of NLP models. It started with rule-based systems, moved to extractive summarization, and has now arrived at the era of abstractive generation via LLMs.
Defining the Task: A Moving Target
One of the most surprising findings in this survey is that researchers cannot agree on what “Related Work Generation” actually means. The definition of the task has shifted dramatically based on the technology available at the time.
The authors categorize the history of RWG into three distinct approaches:
- Extractive: Selecting and reordering existing sentences.
- Abstractive (Citation-Level): generating a single sentence about a specific paper.
- Abstractive (Section-Level): generating full paragraphs or sections.
We can visualize these differences in the table below:

1. Extractive Approaches
In the early days (around 2010), the task was defined as Extractive. The system would take a set of cited papers and the target paper, and essentially “copy-paste” salient sentences from the cited papers to form a summary.
As shown in Table 4 below, early works like Hoang and Kan (2010) required a “Topic hierarchy tree” as input—essentially a human outline—and filled it in with sentences extracted from cited papers. This approach ensures the text is factual (since it’s copied directly), but the result often lacks flow and coherence. It reads like a list of disconnected facts rather than a narrative.

2. Abstractive Approaches: The Neural Shift
With the advent of neural networks (like Transformers), the field moved toward Abstractive generation—writing new sentences from scratch. However, early neural models had a major limitation: they couldn’t read long documents. A scientific paper is thousands of words long; early models simply couldn’t hold that much information in memory.
To cope with this, the task definition shrank. Instead of generating a whole section, researchers focused on Citation Text Generation. The goal became to generate a single sentence (or part of a sentence) that describes one cited paper, given the surrounding context.
Table 5 highlights this shift. Notice how the “Target” for many of these works (like AbuRa’ed et al., 2020) is a “Citation sentence w/ single reference.”

This simplification made the problem solvable for neural networks but less useful for humans. A single sentence describing one paper doesn’t help a researcher structure a complex argument involving twenty different sources.
3. The Return to Section-Level Generation
Recently, thanks to LLMs with massive context windows (like GPT-4), the pendulum has swung back. The task definition has returned to the original goal: generating a full, coherent Related Work Section. Modern approaches (like Li and Ouyang, 2024) attempt to generate multiple paragraphs, organize citations logically, and write transition sentences—mimicking the human writing process.
Methodologies: How Machines Read Science
So, how do these systems actually work? The survey identifies several key components that vary across different approaches. We can see a summary of these approaches in Table 7 (from the paper’s appendix), which lists the inputs and specific models used by various researchers.

Representing the Paper
Because full papers are so long, almost all systems use the Abstract as a proxy for the paper’s content. The abstract is concise and usually contains the main contributions. However, some newer methods argue this isn’t enough. For example, Li et al. (2023) proposed finding “Cited Text Spans” (CTS)—the specific sentence in the body of the cited paper that supports the claim being made.
The Importance of Context
Context is everything. You describe a paper differently depending on why you are citing it. Are you citing it to critique it? To use its method? To contrast it with your own? Most abstractive systems use the “Target Paper Context”—the sentences surrounding the citation—to guide the generation. This ensures the generated text flows well with the rest of the paper.
Citation Graphs
Papers don’t exist in a vacuum. Some advanced models (like Ge et al., 2021) use Citation Graphs. They look at the network of papers—who cited whom—to understand the relationships between works. If Paper A and Paper B are often cited together, the model learns they are likely related topics and should be discussed in the same paragraph.
Human-in-the-Loop
The most recent evolution acknowledges that AI still struggles with high-level organization. “Human-Assisted Generation” involves the user providing keywords, intent (e.g., “I want to contrast these papers”), or a rough grouping of citations. The AI then handles the drafting, following the human’s strategic lead.
Data: What Are Models Reading?
To train these models, you need massive datasets of scientific papers. The survey highlights that the vast majority of RWG research focuses on Computer Science and NLP papers. This is a bit of a “meta” situation: NLP researchers find it easiest to scrape and process papers from their own field.
Table 3 lists the common datasets used.

- ACL Anthology Network (AAN): Papers from computational linguistics conferences.
- S2ORC: The Semantic Scholar Open Research Corpus, a massive collection of open-access papers.
- CORWA: A subset of S2ORC specifically annotated for citation generation, labeling the “discourse role” of citations (e.g., is this citation the main point of the sentence, or just a reference?).
A major challenge noted by the authors is that datasets often miss cited papers. If a paper is behind a paywall, the AI can’t read it, meaning it’s trying to write a summary of a paper it has never seen.
Evaluation: How Do We Know It’s Good?
Evaluating generated text is notoriously difficult. In standard summarization tasks, we compare the AI’s output to a “gold standard” human summary. But in RWG, there are many valid ways to write a related work section.
Automatic Metrics
Most studies use ROUGE scores, which measure the overlap of words between the generated text and the actual human-written RWS. Table 9 summarizes the metrics used in abstractive works. While ROUGE is standard, it is a poor proxy for quality in this specific domain. An AI could write a fluent, accurate paragraph that gets a low ROUGE score simply because it used different synonyms than the original author.

Human Evaluation
Because automatic metrics fall short, human evaluation is critical. Judges (usually other researchers) rate the outputs on several criteria. Table 10 provides a fascinating look at what researchers value in an RWS.

Key perspectives include:
- Fluency: Does it read well?
- Coherence: Does it fit with the surrounding text?
- Informativeness: Does it actually tell us what the cited paper is about?
- Succinctness: Does it avoid rambling?
- Factuality: This is crucial. Abstractive models can hallucinate. A model might confidently state that “Smith et al. (2020) proposed a neural network,” when Smith actually proposed a rule-based system.
Extractive works (summarized in Table 8) had it easier regarding factuality—since they copied text, they rarely lied—but they struggled heavily with coherence and flow.

Current Challenges and Ethical Concerns
The survey concludes by identifying significant hurdles that remain before AI can write your literature review for you.
1. Lack of Comparability
Because every paper defines the task differently (single sentence vs. paragraph) and uses different datasets, it is almost impossible to compare Model A against Model B directly. The field needs a standard benchmark.
2. The “Missing Citation” Problem
Current systems assume you give them the list of papers to cite. But finding those papers is half the battle! The authors suggest that future work should combine Retrieval-Augmented Generation (RAG) with RWG. The AI should not just summarize the papers you found; it should tell you which papers you missed.
3. Narrative Flow
Humans are still better at “transitional” writing. We know how to weave a story that connects disparate ideas. AI tends to produce “citation salad”—a list of summaries glued together without a strong underlying argument.
4. Ethics and Education
Finally, the authors raise provocative ethical questions.
- Plagiarism: If an AI generates your RWS, is it plagiarism? Even abstractive models can accidentally reproduce large chunks of training data.
- Academic Integrity: If a PhD student uses AI to write their literature review, have they actually learned the material? Writing an RWS is a thinking process; bypassing it might hinder scientific development.
- Hallucination: An AI-generated RWS might cite non-existent papers or misrepresent real ones. Researchers must verify every claim, which might take as long as writing the section themselves.
Conclusion
The field of Automatic Related Work Generation has come a long way. We have moved from clunky systems that pasted sentences together to sophisticated LLMs that can generate fluent, paragraph-level critiques of scientific literature.
However, the “perfect” AI research assistant isn’t here yet. While models are great at summarizing individual papers, they still struggle with the high-level synthesis and storytelling that makes a literature review truly compelling. For now, the best workflow seems to be a partnership: the human provides the structure and critical thinking, and the AI helps handle the volume of reading and initial drafting.
As the authors summarize, this field acts as an excellent test bed for the capabilities of modern NLP. If an AI can successfully navigate the complex, factual, and narrative demands of scientific writing, it can likely handle almost anything else.
](https://deep-paper.org/en/paper/2404.11588/images/cover.png)