Introduction: The Copy-Paste Dilemma

In the era of Generative AI, a single question looms larger than perhaps any other: Do Large Language Models (LLMs) actually create new content, or are they just sophisticated copy-paste machines?

This question isn’t just academic curiosity. It is the fulcrum upon which billion-dollar lawsuits (like The New York Times vs. OpenAI) and the future of copyright law pivot. If an LLM is merely regurgitating memorized chunks of text from its massive training corpus, the argument for “fair use” becomes shaky. Conversely, if models are synthesizing information to create truly novel sentences, they are fulfilling the promise of artificial intelligence.

To answer this scientifically, we need to measure “novelty.” But here lies a massive engineering hurdle. Modern LLMs are trained on datasets like “The Pile,” which contains hundreds of billions of tokens (over 800GB of text). To check if a generated sentence is novel, you strictly have to compare it against every single document in that massive haystack.

Standard search methods fail at this scale. They are simply too slow.

In this deep dive, we explore a paper by Merrill, Smith, and Elazar that introduces a breakthrough tool: RUSTY-DAWG. By leveraging a data structure from genomics and high-performance engineering, the researchers built a system capable of searching through massive corpora in constant time. What they found regarding the novelty of models like Pythia challenges our assumptions about how LLMs write—and highlights exactly when they are prone to plagiarism.

The Challenge of Scale

Before understanding the solution, we must understand the metric. The researchers focus on verbatim novelty. They aren’t looking for vague semantic similarities; they want to know if a specific sequence of words (an \(n\)-gram) generated by the model appears word-for-word in the training data.

This is usually measured by the \(n\)-novelty rate: the proportion of generated \(n\)-grams (sequences of length \(n\)) that did not appear in the training corpus.

If you are analyzing a small corpus, you can use a hash table or a simple search index. But when your training data is The Pile (334 billion tokens), checking every generated 50-word phrase against the entire internet archive is computationally prohibitive with naive methods. Previous studies were limited to small models trained on small datasets (like WebText, 40GB), which are not representative of today’s behemoth LLMs.

Enter RUSTY-DAWG: Indexing the Internet

To solve the scale problem, the authors utilized a data structure known as a Compacted Directed Acyclic Word Graph (CDAWG).

A CDAWG is a finite-state machine—a complex graph of nodes and edges—that acts as a perfect index for a text corpus. Unlike a standard search that scans documents, a CDAWG represents every possible substring within the corpus as a path through the graph.

Figure 1: Illustration of CDAWG and resulting novelty curves.

As shown in Figure 1 above, the CDAWG compresses the corpus (in this toy example, “hello$world”) into a graph.

  • Nodes represent states in the matching process.
  • Edges represent characters or tokens.
  • Solid lines show valid transitions (matching text).
  • Dashed lines represent “failure arcs”—shortcuts that tell the algorithm where to back off if a match fails, without restarting from scratch.

Why use a CDAWG?

The magic of the CDAWG lies in its efficiency. Once constructed, you can take a query string (like a document generated by ChatGPT) and stream it through the graph. The time it takes to find the longest matching substring in the training data depends only on the length of your query. It does not depend on the size of the training data.

Whether the training set is 10 megabytes or 10 terabytes, the search speed is roughly the same.

The authors implemented this in Rust (hence the name RUSTY-DAWG) to ensure memory efficiency and speed. They managed to build a CDAWG for the entire Pile dataset, creating what is likely the largest such graph ever constructed.

The Metrics: \(n\)-Novelty and NNSL

Using RUSTY-DAWG, the researchers calculate two vital metrics for every generated text:

  1. \(n\)-novelty: The percentage of \(n\)-grams in the generated text that do not exist in the training data. Equation for n-novelty
  2. Non-Novel Suffix Length (NNSL): At every token position in a generated document, what is the length of the longest preceding sequence that appears verbatim in the training data?

If the NNSL at a specific position is 100, it means the last 100 tokens the model generated were a direct copy from a training document.

Experiment: How Novel is AI Writing?

The researchers analyzed the Pythia suite of models (ranging from 70M to 12B parameters), all trained on The Pile. They generated thousands of documents and compared them against human-written text baselines.

Baseline 1: The “Valid” Set (A Cautionary Tale)

Initially, one might compare AI generation to the “Validation” set of The Pile—text held out from training. However, the researchers discovered that the validation set was heavily “contaminated.” Because the internet is repetitive, many documents in the validation set actually appeared in the training set (e.g., duplicate news articles or boilerplates).

Baseline 2: Dolma (The Real Human Standard)

To get a fair comparison, they used text from Dolma (specifically Reddit and scientific papers) created after The Pile’s cutoff date. This ensures the text definitely wasn’t in the training data. This represents “natural” human novelty.

Finding 1: Models Copy Long Phrases More Than Humans

The first major finding reveals a dichotomy in how LLMs generate text compared to humans.

Figure 2: n-novelty curve for Pythia-12B versus Dolma and Validation sets.

Take a look at Figure 2. The x-axis represents the \(n\)-gram size (log scale), and the y-axis is the percentage of novelty.

  • The Green Line (Pythia-12B): Notice that for small \(n\) (1 to 4), the model is actually more novel than the human baseline (Dolma, dark grey line). This suggests models are excellent at mixing and matching individual words and short phrases in unique ways.
  • The Cross-over: However, as \(n\) gets larger (\(n > 4\)), the trend flips. The model’s text becomes less novel than human text.

What does this mean? While models are creative with word choice, they are statistically more likely than humans to regurgitate long, verbatim sequences (like 10-grams or 50-grams). Human writing naturally avoids long exact overlaps unless quoting; models drift into memorization.

Some of this non-novelty is structural. The authors found the models copying software licenses (Apache License) and code imports (Linux headers) verbatim thousands of times.

Finding 2: Size Matters

Does making a model “smarter” (larger) reduce copying? Surprisingly, no.

Figure 3: n-novelty and Mean NNSL across model sizes.

Figure 3 paints a clear picture.

  • Left (a): As you move from lighter blue lines (70M parameters) to the darkest blue line (12B parameters), the novelty curve drops.
  • Right (b): The Mean NNSL (average length of copied strings) increases linearly with the log of the model size.

The verdict: Larger models have a higher capacity for memorization. While they are more capable, they are also more prone to outputting long sequences directly from their training data.

The Impact of Decoding Strategies

Perhaps the most practical insight for users of LLMs comes from the analysis of “decoding strategies.” When an LLM predicts the next token, it assigns a probability to every word in its vocabulary. The “decoding strategy” is the rule we use to pick the next word from that list.

  • Stochastic Sampling (High Temperature): We pick randomly, but weighted by probability. This adds variety.
  • Greedy / Beam Search (Low Temperature): We consistently pick the most likely word.

The researchers found that constrained decoding kills novelty.

Figure 4: Impact of decoding choices on n-novelty.

In Figure 4, look at the bottom-left and bottom-right graphs:

  • Temperature (bottom-left): As the temperature drops (darker lines), novelty plummets. At Temperature 0 (Greedy), the model is essentially reciting training data.
  • Beam Search (bottom-right): Beam search, often used to make text more “coherent” or “accurate,” is the worst offender. With a beam size of 8, the novelty is near zero even for large \(n\)-grams.

Table 2 confirms this numerically. Using standard sampling, the max copied string length (Max NNSL) was 376. With Beam Search (Beam=8), the max copied length was 408. The model effectively outputted a massive block of text verbatim.

Table 2: NNSL results for Pythia-12B with different decoding strategies.

This suggests that the “most likely” path through a language model is often just a path through its training data. If you want original content, you must use stochastic sampling (higher temperature).

Why Do Models Copy? The Frequency Factor

Why does the model default to copying? The researchers hypothesized that sequences appearing frequently in the training data are “easier” for the model to predict.

They defined a metric called Mean Completion Loss. Lower loss means the model is less “surprised” by a sequence and predicts it with higher confidence.

Figure 5: n-gram completion loss based on presence in train and frequency.

Figure 5 validates this hypothesis perfectly:

  • Graph (a): The blue line (sequences in the training set) has significantly lower loss than the red line (sequences not in training). The model is much more confident when it is traversing a path it has seen before.
  • Graph (b): The loss drops as the frequency of the \(n\)-gram increases. The more often a phrase appeared in The Pile, the more likely the model is to output it verbatim.

This creates a self-reinforcing loop. Boilerplate text, disclaimers, and famous quotes appear frequently in training. The model learns these with low loss. When you run the model with “Greedy” decoding, it gravitates toward these low-loss paths, resulting in verbatim copying.

Conclusion: The Memory vs. Creativity Trade-off

The introduction of RUSTY-DAWG allows us to monitor LLM behavior with a precision that was previously impossible. By indexing the 334 billion tokens of The Pile, the authors revealed the nuanced reality of AI generation.

LLMs are not simply stochastic parrots, nor are they purely creative engines. They exist in a superposition of both states, governed by their parameters and settings:

  1. Small scale: They are highly novel, mixing concepts uniquely at the 1-4 word level.
  2. Large scale: They are prone to regurgitating long sequences (50+ words), significantly more so than human writers.
  3. Settings matter: The more you constrain the model (via low temperature or beam search) to be “accurate,” the more it reverts to memorization.

For developers and researchers, this highlights a critical trade-off. If you need an LLM to be factual and coherent (often achieved via Beam Search), you risk copyright infringement and data regurgitation. If you want novelty, you must accept the randomness of high-temperature sampling.

As we move forward into an era of scrutiny regarding AI training data, tools like RUSTY-DAWG will be essential. They provide the “plagiarism checker” for the age of AI—one that can actually keep up with the scale of the internet.