Beyond the Average User: How PERSE Teaches AI to Evaluate Text Like a Human

In the world of Artificial Intelligence, we have become very good at generating text. Models like GPT-4 and LLaMA-2 can write poetry, code, and short stories with ease. However, evaluating that text remains a massive hurdle. In objective tasks like translation or summarization, we have ground truths to compare against. But what about creative writing?

If I write a story with a tragic, ambiguous ending, is it “good”? One reader might call it “poignant and realistic,” while another dismisses it as “depressing and unsatisfying.”

This subjectivity is the core problem addressed in the research paper “Learning Personalized Alignment in Evaluating Open-ended Text Generation.” The researchers introduce PERSE, a framework designed to move away from “one-size-fits-all” metrics and toward personalized, interpretable evaluation.

In this post, we will break down why traditional evaluation fails for open-ended tasks, how the PERSE framework models individual preferences, and why this matters for the future of generative AI.

The Subjectivity Problem

Traditional automated metrics, such as BLEU or ROUGE, rely on lexical similarity—checking how many words in the model’s output overlap with a human-written reference. In creative writing, this approach is fundamentally flawed. A story can use entirely different words than the reference and still be excellent.

More recently, researchers have started using Large Language Models (LLMs) as judges. You feed a story to GPT-4 and ask, “Is this good?” While better than word-counting, this introduces a new bias: the “Generic Reviewer.” LLMs, trained to be helpful and harmless, tend to provide safe, averaged-out feedback. They struggle to account for the diversity of human taste.

Consider the example below from the researchers’ study:

Two human reviewers have distinct preferences for LLM-generated stories from the same premise.

In Figure 1, an LLM generates two different plots based on the premise of an artist struggling with emotional aftermath.

Alice prefers Plot A because she likes uplifting endings.
Bob prefers Plot B because he values complexity and empathy, even if it’s sadder.

A generic evaluation metric cannot satisfy both Alice and Bob. To truly evaluate open-ended generation, an AI judge needs to understand who it is evaluating for. This is where PERSE enters the picture.

The PERSE Framework

PERSE stands for a personalized, interpretable evaluation framework. The goal is to create a model that can look at a user’s history—what they liked and disliked in the past—and predict how they would rate a new piece of content.

The researchers built PERSE by fine-tuning LLaMA-2 (both 7B and 13B parameter versions). The model operates in two specific modes: Scalar Rating and Pairwise Rating.

1. Scalar Rating

In this mode, the model is given a single piece of text (a query) and a Reviewer Profile. The profile consists of a few historical reviews (plots the user read, the comments they wrote, and the scores they gave).

PERSE must analyze the profile to infer the user’s implicit preferences (e.g., “This user hates clichés” or “This user loves horror”). It then generates:

A personalized score (1-10).
A detailed textual explanation justifying the score.

2. Pairwise Rating

Here, the model is given two different texts (Text A and Text B) and asked to compare them based on specific aspects like Interestingness, Surprise, or Character Development. Again, this is done through the lens of the specific reviewer’s profile.

PERSE provides the scalar rating and pairwise rating for the personalized alignment in evaluation.

As shown in Figure 2, the architecture is designed to make the evaluation interpretable. It doesn’t just output a number; it outputs reasoning. For example, in the bottom half of the figure, the model determines that for this specific user, Text A is more interesting, but Text B has better character development.

The Input Structure

To make this work, the prompt engineering is crucial. The model isn’t just asked to “guess the score.” It is fed a structured prompt containing the instruction, the reviewer profile (historical examples), and the new query.

The demonstrate of PERSE. The input is in green, the detailed review and fine-grained aspects are in blue, and the review scores are in orange.

Figure 9 illustrates the prompt format. Notice how the model is explicitly instructed to “discern the reviewer’s preferences” from the examples provided (in green) before generating the new review (in blue) and score (in orange).

The Data Challenge: Contamination and Memorization

One of the most interesting technical challenges discussed in this paper is data contamination.

The researchers wanted to train PERSE using movie reviews (the MPST dataset). However, LLMs like LLaMA-2 and GPT-4 have been pre-trained on the internet, which includes IMDb and Wikipedia. If you ask an LLM to rate the plot of The Godfather, it likely won’t actually “evaluate” the text you provide; it will simply recall that The Godfather is a masterpiece and give it a 10/10.

This memorization makes evaluation unreliable. The model isn’t learning to align with a user’s taste; it’s just retrieving facts from its training data.

The Solution: Anonymization and Summarization

To fix this, the researchers created a data processing pipeline to scrub the identity of the movies.

The flowchart to construct our dataset.

As outlined in Figure 8, the process involves two steps using an intermediate LLM (oasst-30b):

Anonymization: Replacing specific character names (e.g., “Luke Skywalker”) with generic ones (e.g., “The young pilot”).
Summarization: Condensing the plot to remove recognizable, minute details while keeping the narrative arc intact.

By transforming the dataset this way, they created Per-MPST (Personalized Movie Plot Synopses). In this new dataset, an LLM cannot rely on memory; it must read the plot and apply the user’s historical preferences to generate a score.

Experiments and Key Results

The team compared PERSE (based on LLaMA-2 7B and 13B) against several baselines, including:

Reviewer Avg: Simply predicting the user’s average historical score.
Vanilla LLaMA-2: The base model without specific instruction tuning.
GPT-4: The industry standard for zero-shot reasoning.

Scalar Rating Performance

The results for predicting specific ratings (1-10) were compelling. We measure success using Pearson and Kendall correlations—statistical ways to check if the predicted scores move up and down in sync with the actual human scores.

Pearson, Spearman, and Kendall correlations with human ratings for each (x, u) pair on Per-MPST.

Table 2 shows the results on the Per-MPST dataset.

PERSE-13b achieves the highest correlation (0.345 Pearson), significantly outperforming GPT-4.
The vanilla LLaMA models perform poorly, often worse than the simple “Reviewer Avg” baseline. This highlights that “bigger” isn’t always better if the model isn’t tuned for personalization.
GPT-4 performs decently but struggles to fully align with specific user idiosyncrasies compared to the fine-tuned PERSE.

Pairwise Rating Performance

In the second task, derived from the Per-DOC dataset (stories generated from outlines), the model had to judge which of two stories was better across five specific dimensions: Interestingness, Adaptability, Surprise, Character, and Ending.

Fine-grained prediction accuracy for each (x, u, a) on Per-DOC with k = 1.

Table 4 reveals a clean sweep. PERSE-13b achieves the highest accuracy across almost every category. It is particularly notable that on “Interestingness”—a highly subjective metric—PERSE scores 62.1% accuracy compared to GPT-4’s 50.2%.

Why Does PERSE Beat GPT-4?

The researchers suggest that RLHF (Reinforcement Learning from Human Feedback), which is used to train models like GPT-4, pushes models toward a “safe center.” GPT-4 is hesitant to give very low scores or harsh critiques because it is aligned to be polite.

Real human reviewers, however, can be grumpy, niche, or highly critical. PERSE, because it is instruction-tuned on specific reviewer profiles, is willing to be “mean” if the user profile suggests a critical personality.

Look at the example below (Figure 6) to see this difference in action:

An example for evaluating the individual story from the given reviewer’s preference.

In this case:

The Reviewer Profile shows a user who likes “weird little thrillers” and gives varied scores (10 and 7).
The Query is a plot about a financial whiz and a legal drama.
The Actual Human (Reference) gave it a 6, calling it a “time-waster.”
GPT-4 gives it a generic positive review (Score 6), praising the “strong storyline.”
LLaMA-2-70b (Vanilla) gives it a huge 9.
PERSE gives it an 8, but notice the text. It captures the nuance that the movie is “not happy… but thought-provoking,” attempting to mimic the reviewer’s analytical style. While the score is slightly off, the qualitative alignment in the text generation is much closer to a personalized critique than the generic praise of other models.

Analysis: What Makes Personalization Work?

The researchers conducted several ablation studies to understand the factors driving PERSE’s performance.

1. The Value of History

How many past reviews does the model need to read to understand a user?

Kendall correlation on Per-MPST with different numbers of historical reviews (K) in reviewer profile.

Figure 3 shows that for PERSE-13b (the blue bars on the far right), performance generally improves as you add more historical reviews (from K=1 to K=5). Interestingly, for the vanilla LLaMA models, adding more history actually hurts performance (the bars go down). This suggests that standard models get confused by too much context, whereas PERSE has effectively learned how to utilize that history to refine its judgment.

2. Robustness

Does the order of the historical reviews matter? If I show the model a 1-star review first versus a 10-star review, will it change the prediction?

Kendall correlation on Per-MPST with different orders of reviews.

Figure 4 demonstrates that PERSE (the blue and purple lines) is highly stable. The shaded area represents variance; PERSE has very little variance regardless of how the profile data is shuffled. Vanilla models (green lines) are highly sensitive to order, showing massive instability.

Conclusion and Implications

The PERSE framework represents a significant step forward in automated evaluation. By moving away from lexical overlap and generic “AI judgment,” it embraces the reality that quality in text generation is subjective.

The key takeaways from this work are:

Personalization is measurable: We can train LLMs to accurately predict individual human preferences, outperforming even much larger models like GPT-4.
Instruction Tuning is powerful: A smaller model (13B parameters) fine-tuned on high-quality, personalized data can beat a massive generalist model.
Data hygiene matters: Evaluating LLMs on popular concepts (like movies) requires rigorous anonymization to prevent the model from cheating via memorization.

As we look toward a future where AI generates everything from novels to personalized marketing copy, frameworks like PERSE will be essential. They allow us to move beyond asking “Is this text good?” to asking the more important question: “Is this text good for you?”

The Subjectivity Problem#

The PERSE Framework#

1. Scalar Rating#

2. Pairwise Rating#

The Input Structure#

The Data Challenge: Contamination and Memorization#

The Solution: Anonymization and Summarization#

Experiments and Key Results#

Scalar Rating Performance#

Pairwise Rating Performance#

Why Does PERSE Beat GPT-4?#

Analysis: What Makes Personalization Work?#

1. The Value of History#

2. Robustness#

Conclusion and Implications#