Artificial Intelligence has stormed the castle of creativity. From DALL-E generating surrealist art to ChatGPT penning sonnets, the line between human and machine creativity is blurring. But when you ask an LLM to write a poem, is it actually being creative? Or is it simply a “stochastic parrot,” reshuffling lines it memorized during training?

For years, the gold standard for evaluating AI art has been the Turing Test: Can a human tell if this poem was written by a machine? If the answer is “no,” we assume the model is successful.

However, a new research paper titled “Evaluating Diversity in Automatic Poetry Generation” argues that this metric is dangerously incomplete. A model could pass a Turing Test by simply copying existing human poems or by producing safe, generic verses that sound “human enough” but lack originality.

In this post, we will dive deep into this research, which proposes a new way to judge AI poets: Diversity. We will look at how researchers analyzed 36 different models—from massive LLMs like LLaMA-3 to specialized poetry systems—to see if they can truly match the structural, lexical, and semantic variety of human creativity.

The Problem with “Good” Poetry

Imagine a poet who writes technically perfect sonnets, but every single one is about a “red rose” and uses the exact same rhyme scheme. Technically, they are good poems. Artistically, the poet is boring.

This is the trap of current AI evaluation. Most research focuses on quality (grammar, flow, meaning) but neglects diversity (variety in structure, vocabulary, and theme).

The researchers behind this paper set out to answer a fundamental question: Do AI models generate poetry that covers the full “bandwidth” of human creativity, or do they collapse into a narrow, repetitive range?

To answer this, they moved away from asking humans to rate single poems. Instead, they performed a distributional analysis. They compared the statistical distribution of thousands of generated poems against thousands of human poems across four key dimensions:

  1. Memorization: Is the model plagiarizing?
  2. Structure: Do the poems look right (length and meter)?
  3. Rhyme: Can the model actually rhyme, and does it vary its patterns?
  4. Lexical & Semantic: Is the vocabulary rich? Are the themes varied?

The Contenders: Models and Datasets

Before we look at the results, we need to understand who is competing in this “poetry slam.” The researchers tested a massive array of models, categorized into two main families:

  1. Poetry-Specific Models: These are older architectures (like LSTMs) explicitly designed with “rhyme modules” and “meter modules.” Examples include DeepSpeare and Structured Adversary (SA).
  2. General Purpose LLMs: These are the Transformer models we know today. The study looked at both Word-level models (GPT-2, GPT-Neo, LLaMA-2, LLaMA-3) and Character-level models (ByGPT5).

Crucially, the LLMs were tested in two modes:

  • Unconditioned: The model is just asked to generate text.
  • Style-Conditioned: The model is trained with special tags (like <RHYME>) so it knows it is supposed to be writing poetry.

Table 2: Models used in this work. The ‘Smaller’ and ‘Larger’ columns display the sizes of the models considered.

As shown in Table 2 above, the study covered both English (EN) and German (DE), providing a robust cross-lingual analysis. They used a dataset called QuaTrain (quatrains, or four-line stanzas) and SonNet (sonnets) for training and evaluation.

Dimension 1: Memorization (The Copycat Test)

The first hurdle for any creative AI is originality. If a model simply regurgitates a stanza from Shakespeare or Goethe, it has failed at creativity.

The researchers used the Ratcliff-Obershelp similarity metric to catch plagiarism. If a generated quatrain was ≥70% similar to a training example, it was flagged as memorized.

The Finding: Surprisingly, memorization is not a major issue. Most models had near-zero memorization rates at the quatrain level. However, a pattern emerged:

  • Larger models memorize more. A larger parameter count (like LLaMA-2 13B) allows the model to store more specific training data than a smaller model.
  • Conditioning helps. When models are explicitly trained to follow a style, they tend to copy less than when they are just predicting the next word unprompted.

Dimension 2: Structure and Length

A quatrain should look like a quatrain. It shouldn’t be two words long, nor should it be a paragraph.

To measure this, the researchers compared the length distribution (number of tokens) of generated poems against human poems. They used a metric called Histogram Intersection: a score of 1.0 means the model’s length distribution perfectly overlaps with humans; 0.0 means they are completely different.

Figure 3: Length distribution of human poems (left), SA (middle) and GPTNeo_L (right) for English.

In Figure 3 above, we see three graphs representing the distribution of poem lengths:

  • (a) Human: Notice the bell curve. Humans write poems of varying lengths, usually centered around 25-30 tokens.
  • (b) SA (Poetry Specific): This model captures the human distribution almost perfectly. It understands the “shape” of a poem.
  • (c) GPTNeo (LLM): This model is too rigid. The spike is very narrow, meaning it produces poems of almost exactly the same length every time. It lacks structural diversity.

Key Takeaway: General LLMs often struggle to match the natural variance of human poem lengths, often producing output that is too short or too uniform.

Dimension 3: Rhyme (The Achilles’ Heel of LLMs)

This is where the results get fascinating. Rhyming is notoriously difficult for standard LLMs (like GPT-4 or LLaMA) because they see text as “tokens” (chunks of words), not as individual letters/phonemes. They often don’t “know” that cat rhymes with hat just by looking at the token IDs.

The researchers classified the rhyme schemes of generated poems (e.g., AABB, ABAB, or ABCD—which means no rhyme).

The Human Baseline

First, let’s look at what humans do.

Figure 2: Distribution of rhyme schemes in (a) the human data, and the samples from the (b) best, (c) worst, and (d) average models.

In Figure 2(a) (Human), you can see that humans use a mix of schemes. There is a lot of AABB and ABAB. The “ABCD” bar (no rhyme) is relatively low.

The Failure of Unconditioned LLMs

Now, look at Figure 2(c), representing the “Worst” model (an unconditioned GPT model). The ABCD bar is massive. This means the model almost never rhymes. It just writes four lines of prose.

This failure is systemic across unconditioned LLMs. Look at the grid of charts below for English unconditioned models:

Figure 6: Rhyme distribution plots for samples generated by English unconditioned large language models.

In Figure 6, notice how almost every chart (GPT2, GPTNeo, LLaMA2) is dominated by the bar on the far right (ABCD). These powerful “intelligent” models are essentially incapable of spontaneous rhyming without specific fine-tuning.

The Success of Conditioning and DeepSpeare

However, not all models failed.

  1. DeepSpeare: As a poetry-specific model, it was explicitly designed to rhyme.
  2. Conditioned LLMs: When the researchers fine-tuned the LLMs with style tags, performance improved drastically.

Figure 8: Rhyme distribution plots for samples generated by English conditioned large language models.

Compare Figure 8 (above) to Figure 6. In Figure 8, the ABCD bars on the right have shrunk significantly. The models are now attempting patterns like AABB and ABAB.

The Surprise Winner: The Character-Level Model (ByGPT5). Because ByGPT5 processes text character-by-character rather than by whole words, it can “see” the spelling. It learns that words ending in “-ing” likely rhyme. Consequently, character-level models produced significantly better rhyming diversity than their word-level counterparts.

Dimension 4: Lexical and Semantic Diversity

Finally, the researchers asked: Is the AI using the same words over and over?

They used metrics like MATTR (Moving Average Type-Token Ratio) to measure vocabulary variety and Sentence-BERT to measure how similar the meanings of the poems were.

Table 5: Lexical diversity metrics for German and English models.

Table 5 (above) reveals a “Quality-Diversity Trade-off.”

  • Local Diversity: Inside a single poem, LLMs (like LLaMA-2) are actually more lexically diverse than humans (Higher ATTR scores). They use a wide range of vocabulary.
  • Global Diversity: Across all generated poems, however, they fall short. They tend to recycle the same “creative” tropes.
  • Model Size: Larger models (Large vs. Small) generally had better vocabulary diversity.

The semantic analysis (Table 6 below) reinforced this. The goal here is a lower similarity score (which implies higher diversity).

Table 6: Average maximum semantic similarity values for German and English.

Notice that no model achieves the diversity of humans. The “Human” scores (top row) for ‘Within’ similarity are lower than any AI model. This means human poets write about a much wider range of topics and feelings than even the best AI, which tends to cluster around similar semantic spaces.

The Trade-off: Quality vs. Diversity

The researchers didn’t just trust the numbers; they also asked humans to read the poems. They compared the most “diverse” models against human poetry.

Table 13: 5 selected English quatrains rated as best in our human evaluation.

Table 13 shows some of the best-rated quatrains. The evaluation revealed a harsh truth: The most diverse model (Structured Adversary - SA) was often the worst in quality. It achieved diversity by being incoherent.

Conversely, the massive LLMs (like LLaMA-3) wrote coherent, grammatically perfect text, but were less adventurous.

Summary of Results

The researchers aggregated all these rankings to find an overall “winner.”

Table 3: Average metrics for different model type aggregations.

Table 3 provides the snapshot of the entire study:

  • Rhyme: Poetry-specific and Character-level models dominate.
  • Semantic/Lexical: Larger LLMs have the edge in vocabulary size but still lack human-level thematic variety.
  • Conditioning: This is the magic bullet. Conditioned models (telling the AI “Write a Rhyme”) outperform unconditioned models across almost every diversity metric.

Conclusion: The “Stochastic Parrot” is Still Molting

This research is a wake-up call for Generative AI. While we marvel at the fluency of tools like ChatGPT, this study highlights that fluency is not creativity.

Current AI models are “under-diverse.” They:

  1. Struggle to match the structural variety of human poems.
  2. Fail to rhyme naturally without explicit conditioning or character-level architecture.
  3. Write about a narrower range of topics than humans.

The most promising path forward appears to be Character-level Style-Conditioned LLMs (like the ByGPT5 model tested here). These models combine the linguistic power of transformers with the granular control of seeing individual characters, allowing for high diversity in both form and content.

Until AI can break out of its statistical safe zones, it remains a talented mimic rather than a true poet. The next time you read an AI poem, don’t just ask “Does this make sense?” Ask yourself: “Is this something new?”