Introduction

In the rapidly evolving landscape of Generative AI, a major legal and ethical storm has been brewing around copyright. We know that Large Language Models (LLMs) are trained on massive datasets that include copyrighted books, articles, and creative writing. A central question for researchers, lawyers, and content creators is: To what extent do these models reproduce protected content?

Until recently, the community has largely focused on literal copying—instances where an AI spits out a passage of text word-for-word, identical to the source material. It is relatively easy to check if a model generates the exact opening paragraph of Harry Potter. However, this narrow focus misses a crucial nuance of copyright law and creative expression. Infringement isn’t just about the exact sequence of words; it is also about the “pattern of the work”—the unique arrangement of plots, events, and characters.

If you ask an AI to write a story about a boy wizard, and it generates a narrative about a kid named Harry living under the stairs who goes to a magic school with friends named Ron and Hermione, the AI hasn’t necessarily copied the text verbatim. However, it has copied the “expression” of the story. This is non-literal copying, and until now, we haven’t had a robust way to measure it at scale.

This is the gap addressed by COPYBENCH, a new benchmark introduced in a recent research paper. The researchers argue that to truly understand the legal risks of LLMs, we must evaluate them not just on their ability to memorize words, but on their tendency to reproduce the heart and soul of fictional works: their events and characters.

Figure 1: Two categories of reproduction of copyrighted content and two categories of model utility, considered in CoPYBENCH. We also show the text generated by Llama3 70B (AI@Meta, 2024) given the prompt.

As illustrated in Figure 1, the researchers propose a dual-axis evaluation. On one side, we measure Copying (both literal and non-literal). On the other, we measure Utility (Fact Recall and Fluency). The goal is to understand the trade-offs: Does making a model smarter and more fluent inevitably make it more prone to copyright infringement?

Background: The Legal and Technical Context

To appreciate the significance of COPYBENCH, we need to understand the intersection of US copyright law and machine learning memorization.

The Legal Framework: Substantial Similarity

Under US copyright law, infringement occurs when a work is “substantially similar” to a protected original. Courts have long established that this goes beyond exact wording. A famous legal test (the Nichols case from 1930) established that copyright cannot be limited literally to the text, or else plagiarists could escape liability simply by making minor changes.

There is a distinction between ideas (which are not copyrightable) and expression (which is). A story about a wizard is an idea. A story about a specific wizard attending a specific school with specific friends and specific plot points is an expression. When an LLM generates a story, we need to know if it is creating new expression or merely paraphrasing existing expression.

The Technical Challenge: Memorization vs. Generalization

In the AI field, we often talk about “memorization.” Usually, this refers to the model’s ability to recite training data. Researchers have developed various attacks to “extract” this data to prove that models were trained on specific corpuses.

However, standard memorization metrics usually check for n-gram overlaps (sequences of matching words). This approach fails to capture high-level semantic copying. If a model rewrites a scene from To Kill a Mockingbird using entirely different vocabulary but identical character actions and plot progression, n-gram metrics will report zero copying. COPYBENCH aims to fix this blind spot by introducing automated protocols for detecting semantic overlap.

Core Method: How COPYBENCH Works

The researchers curated a dataset using popular copyrighted fiction books (published post-1923). To avoid legal issues themselves, they utilized existing datasets like BookMIA and summaries from CliffsNotes.

The benchmark evaluates models across three distinct dimensions:

Literal Copying
Non-Literal Copying (Events and Characters)
Model Utility (Fact Recall and Fluency)

1. Measuring Literal Copying

This is the traditional metric. The researchers provide the model with a 200-word prefix from a book and ask it to complete the passage. They then compare the generated output to the actual next 50 words of the book using the ROUGE-L score, which measures the longest common subsequence of words. If the score is high (above 0.8), it counts as literal copying.

2. Measuring Non-Literal Copying

This is the novel contribution of the paper. Since they cannot rely on string matching, the researchers developed a pipeline to detect if a model is “remixing” a book’s plot.

The “Creative Writing” Task: Instead of asking the model to complete a specific sentence, they provide a prompt that sets up a generic story beginning based on a specific event from a book. They then ask the model to write an “original story.”

Metric 1: Event Copying To measure if the model follows the book’s plot, the researchers first used GPT-4 to extract a list of key events from CliffsNotes summaries of the books.

Evaluation: They use another model (Flan-T5-XL) as a “judge.” This judge compares the LLM’s generated story against the list of reference events. If the generated story contains a threshold number of events from the original book, it counts as event copying.

Metric 2: Character Copying They also extracted character names and aliases from the summaries.

Evaluation: They check if the generated story includes specific character names from the book (excluding names already provided in the prompt). If the model spontaneously introduces “Ron Weasley” into a wizard story, it is a strong signal of non-literal copying.

Figure 3: Demonstration of non-literal copying evaluation.We show two LM-generated stories and referenced events and character in the novel Harry Potter and the Sorcerer’s Stone (1997). The overlapping events are manually highlighted in red and labeled with their indices.Additionally, the overlapping character names are in bold.

Figure 3 demonstrates this perfectly using Harry Potter. In the first example (left column), the model is given a prompt about a boy discovering he is a wizard. The output immediately introduces “Hagrid,” “Voldemort,” “Diagon Alley,” and “Ron.” Although the sentences aren’t identical to J.K. Rowling’s text, the events (learning about Voldemort, buying supplies) and characters are clearly lifted from the source. The second example (right column) takes the same prompt but generates a truly original story with a character named “Ms. Bellamy.”

3. Measuring Utility

To ensure that “safer” models aren’t just “broken” models, the benchmark also tests utility.

Fact Recall: Can the model answer specific questions about the book? (e.g., “What does Voldemort drink in the woods?”). This is measured using F1 scores on QA pairs.
Fluency: Is the text readable and grammatical? This is evaluated using a grading model (Prometheus-v2).

Experiments and Results

The researchers tested a wide array of models, including the Llama2 and Llama3 families, Mistral, and proprietary models like GPT-4. The results revealed several critical insights about the behavior of modern LLMs.

Insight 1: Size Drives Copying

One of the most robust findings is the relationship between model scale and copying behavior. Smaller models (around 7 billion parameters) rarely engage in literal, word-for-word copying. However, they do exhibit non-literal copying. They might not remember the exact prose, but they remember the characters and the plot beats.

As models get larger (e.g., Llama3-70B), the rates of both literal and non-literal copying skyrocket.

Table 2: Comparison of copying and utility of pre-trained base LMs on CoPYBENCH. Proprietary LMs are shown for reference. Models with fewer than 13 bilion parameters can reproduce events and characters,but near-exact literal copying is rare.For white-box language models,utility increases with model size.However,this also leads to more frequent instances of both literal and non-literal copying.

Looking at Table 2, we see a clear progression. Llama2-7B has a literal copying rate of just 0.1%. But Llama3-70B jumps to 10.5%. Similarly, character copying jumps from roughly 1.7% in small models to over 15% in the largest Llama3 model. This suggests that as we scale models to make them more capable, we are essentially building more efficient “copyright copying machines.”

Insight 2: The Utility-Copying Correlation

There is a strong positive correlation between a model’s ability to recall facts and its tendency to copy text.

Figure 2(a) and 2(b) visualize this relationship. The charts show that as “Literal Copying” increases (x-axis), “Event Copying” and “Fact Recall” also increase. This poses a difficult dilemma for developers: the very mechanism that allows a model to correctly answer “Who is Harry Potter’s best friend?” is the same mechanism that leads it to reproduce the plot of the book when asked to write a story.

Insight 3: Mitigation Strategies Are Insufficient

The paper also evaluated current methods designed to stop models from plagiarizing. These generally fall into two categories: Instruction Tuning (training the model to behave nicely) and Inference Mitigation (using algorithms during text generation to block copying).

The Failure of Instruction Tuning

We often assume that “Chat” models (like Llama-2-Chat) are safer than base models. The results show that while instruction tuning often reduces literal copying (perhaps because the model is trained to be conversational rather than to complete text), it does not solve non-literal copying.

In Table 4, we see mixed results. For example, Mixtral-8x7B-Instruct reduced literal copying by 91%, but its Event Copying actually increased by 52%. This suggests that the alignment process might make the model better at following the user’s lead into a story, inadvertently causing it to lean more heavily on the plots it memorized during pre-training.

The Failure of Inference Constraints

The researchers also tested “MemFree Decoding,” a method that explicitly stops the model from generating n-grams (sequences of words) that appear in the training data.

$Table 5: Comparison of copying and utility with and without system-mode self-reminder (Xie et al.,2023)(shown as system prompts in the table)and MemFree decoding (Ippolito et al.,2O23). We observe that the system-mode self-reminder does not affect copying behavior, whereas MemFree decoding completely prevents iteral copying. However,neither method effectivelyreduces non-literalcopying.We highlight the percentage inred if the score is worse and in green if it is better,for cells with more than \$10 \\%\$ of changes.$

Table 5 reveals a stark reality. MemFree decoding is incredibly effective at stopping literal copying (bringing it down to nearly 0%). However, it has almost no effect on non-literal copying. The model simply finds different words to tell the exact same copyrighted story. This confirms that mechanical filters based on text matching are insufficient for semantic copyright protection.

Conclusion & Implications

The COPYBENCH paper serves as a wake-up call for the AI community. It demonstrates that our current definitions of “copying” are too narrow. By focusing solely on verbatim reproduction, we are ignoring the substantial risk of non-literal infringement—where models replicate the plots, characters, and “soul” of a creative work without necessarily matching the “body” (the exact text).

Key Takeaways:

Small models are not safe: Even if they don’t quote books verbatim, they can still reproduce protected plots and characters.
Scale exacerbates risk: The larger the model, the more it copies.
Current defenses are brittle: Tools that stop literal copying (like n-gram filtering) do not stop models from paraphrasing copyrighted content.
The Trade-off is real: There is a direct tension between a model’s utility (knowing facts) and its safety (not copying).

For students and future researchers, this highlights a massive open problem. How do we train models to understand and reason about cultural works (Fact Recall) without allowing them to reproduce those works in creative contexts? The solution likely requires new training paradigms that go beyond simple text prediction and incorporate a deeper understanding of attribution and content boundaries. COPYBENCH provides the measuring stick we need to start solving this problem.

Introduction#

Background: The Legal and Technical Context#

The Legal Framework: Substantial Similarity#

The Technical Challenge: Memorization vs. Generalization#

Core Method: How COPYBENCH Works#

1. Measuring Literal Copying#

2. Measuring Non-Literal Copying#

3. Measuring Utility#

Experiments and Results#

Insight 1: Size Drives Copying#

Insight 2: The Utility-Copying Correlation#

Insight 3: Mitigation Strategies Are Insufficient#

The Failure of Instruction Tuning#

The Failure of Inference Constraints#

Conclusion & Implications#