Introduction

In the rapidly evolving landscape of Large Language Models (LLMs), generating text is only half the battle. The other half—and arguably the harder half—is evaluating that text. How do we know if a response is harmful, helpful, fluent, or consistent?

Traditionally, we relied on metrics like BLEU or ROUGE, which simply count word overlaps between a model’s output and a human reference. But these metrics are rigid; they fail to capture nuance or semantic meaning. Recently, the industry has shifted toward “LLM-as-a-Judge,” where we ask a powerful model like GPT-4 to score a response. While effective, this approach is incredibly expensive, slow, and relies heavily on the model’s ability to articulate a critique.

But what if a model knows a piece of text is bad before it even generates a single word of critique? What if the “gut feeling” of the model—its internal mathematical representation—is more accurate than its output?

This is the core question behind RepEval, a new evaluation framework introduced by researchers from Shanghai Jiao Tong University. RepEval demonstrates that the internal representations (hidden states) of LLMs contain rich, decisive information about text quality. By extracting these representations and projecting them mathematically, we can achieve evaluation results that correlate better with human judgment than even GPT-4, all while using significantly smaller models and minimal training data.

In this post, we will decode how RepEval works, the mathematics behind its projection strategy, and why looking “under the hood” of an LLM might be the future of AI evaluation.


Background

To understand RepEval, we first need to categorize how we currently evaluate text.

The Two Types of Evaluation

Text evaluation generally falls into two scenarios:

  1. Absolute Evaluation: A model looks at a single piece of text and assigns it a score based on criteria like fluency, coherence, or consistency.
  2. Pair-wise Evaluation: A model looks at two responses to the same prompt and decides which one is better. This is crucial for training methods like Reinforcement Learning from Human Feedback (RLHF).

The Limitation of Current Metrics

Reference-based metrics (like BLEU) require a human-written “perfect answer” to compare against, which is often unavailable in real-world chat scenarios. Reference-free metrics usually involve prompting an LLM to generate a score (e.g., “Rate this 1-5”).

The problem with prompting an LLM for a score is that it relies on the model’s generation capability. A smaller model (like a 7B parameter model) might “know” a sentence is incoherent but struggle to generate a structured critique or consistently output the correct number format. RepEval bypasses the generation phase entirely, tapping directly into the model’s understanding.


The Core Method: RepEval

The central thesis of RepEval is that high-quality text and low-quality text look different in the vector space of an LLM. If we can find the specific “direction” in that space that points from “bad” to “good,” we can measure any piece of text against it.

1. Collecting Representations

The first step is to treat the LLM not as a chatbot, but as a feature extractor. When we feed text into a decoder-only LLM (like Llama or Mistral), the text is processed token by token through several layers.

RepEval uses a prompt template to contextualize the input. For example, if we are evaluating fluency, we might wrap the hypothesis text (hyp) in a prompt like: “Is the following Hyp fluent? Hyp: [Insert Text]…”.

Figure 1: Pipeline of collecting representations with decoder-only LLM and constructing project direction

As shown in Figure 1 above, the input sequence passes through the decoder blocks. At a specific layer \(i\) and token \(k\), the model produces a hidden state vector, denoted as \(rep\). This vector is a dense list of numbers representing the model’s semantic understanding of the text at that moment.

2. Converting Evaluation to Geometry

Once we have these representation vectors, RepEval treats evaluation as a geometry problem.

Absolute Evaluation

In absolute evaluation, we want a score (Equation 1).

Equation 1

The researchers hypothesize that in the vector space, there is a specific direction vector \(\vec{d}\) that represents the property we are measuring (e.g., “Fluency”). If we project our text’s representation onto this vector, the resulting value is our score.

Equation 2

Here, \(rep^T\) is the transpose of the representation vector, and \(\vec{d}\) is the projection direction. The dot product gives us a scalar value—the score.

Pair-wise Evaluation

In pair-wise evaluation, we have two responses, A and B. We can construct a representation for the scenario “A is better than B” (\(rep_{AB}\)) and “B is better than A” (\(rep_{BA}\)).

Figure 2: Evaluation process of absolute evaluation and pair-wise evaluation.

As illustrated in Figure 2, the goal is to determine if the vector suggests A is superior. We calculate the projection for both permutations.

Equation 3

If the projection of \(rep_{AB}\) is greater, the model predicts A is better.

3. Finding the “Magic” Direction (\(\vec{d}\))

The most critical part of RepEval is finding the vector \(\vec{d}\). How do we know which direction in the high-dimensional space corresponds to “Fluency” or “Honesty”?

The authors use Principal Component Analysis (PCA).

  1. Gather Samples: They take a very small number of samples (as few as 5 pairs) of “Good” text and “Bad” text.
  2. Calculate Difference: They compute the difference vectors: \(\Delta rep = rep_{good} - rep_{bad}\). This vector represents the shift required to turn a bad representation into a good one.
  3. Apply PCA: They perform PCA on these difference vectors to find the principal components—the directions where the data varies the most.

Equation 4

By summing the weighted principal components, they create the master direction vector \(\vec{d}\). This vector acts like a compass pointing toward “high quality.”

This method is unsupervised in the sense that it doesn’t require training a neural network or fine-tuning the LLM. It simply analyzes the geometry of a few examples.


Visualization: Seeing the Semantic Shift

Does this vector math actually map to reality? The researchers visualized the representations using t-SNE (a technique for visualizing high-dimensional data) to see if prompts actually change the location of the text in the vector space.

Figure 4: The t-SNE visualization of reps shows the results of dimensionality reduction.

In Figure 4, we see representations of the same text samples under different prompts (Fluency vs. Coherence). The clear separation between the clusters (orange vs. beige) proves that the prompt effectively moves the representation into a different semantic region suitable for that specific evaluation criteria.


Experiments and Results

The researchers tested RepEval against state-of-the-art metrics, including GPT-4, on 14 datasets covering summarization, data-to-text generation, and dialogue.

Absolute Evaluation Results

The results for absolute evaluation (Fluency, Consistency, Coherence) are compelling. They compared RepEval (using the Mistral-7B model) against reference-free metrics like GPTScore and BARTScore, and reference-based metrics like BERTScore.

Table 1: Absolute Evaluation Results.

Key Takeaways from the Data:

  • Outperforming Giants: RepEval (using PCA with just 5 or 20 samples) frequently outperforms GPT-4 and GPT-3.5. For example, in the SummEval Coherence task (COH), RepEval achieves a correlation of 0.534, significantly higher than GPT-4’s 0.263.
  • Efficiency: RepEval achieves these results using a 7B parameter model. GPT-4 is estimated to be over a trillion parameters. The computational savings are massive.
  • Hyp-Only: Interestingly, the “Hyp-only” column shows that even without a specific prompt, the raw representation of the text contains significant quality information, though prompts (“Prompt” column) usually enhance performance.

Pair-wise Evaluation Accuracy

For pair-wise tasks (choosing the better response), RepEval was tested on complex alignment benchmarks like MT Bench and HHH Alignment.

The accuracy formula used is standard:

Equation 5

The study found that RepEval consistently achieves higher accuracy than standard prompting. For instance, on the “HHH Alignment” dataset, RepEval achieved accuracies above 90%, competing with or beating Claude-3 and GPT-4 prompts. This confirms that while a small model might struggle to articulate why Response A is better than Response B, it statistically knows the answer in its hidden layers.

Is PCA Actually Doing Anything?

One might wonder if the direction \(\vec{d}\) is just random noise. To test this, the authors compared the PCA-derived direction against random vectors.

Figure 5: Random Test Results

Figure 5 shows the meta-evaluation results (correlation with human judgment). The box plots represent random vectors, hovering around zero correlation (or 0.5 accuracy for pair-wise). The distinct dots represent RepEval with PCA. The PCA results are outliers in the best possible way—consistently achieving high correlation where random vectors fail. This proves the PCA method is successfully isolating the “quality” signal.

The Secret of the Middle Layers

An unexpected finding in the paper is where the information lives. Intuitively, one might think the very last token of the very last layer holds the final “thought” of the model.

Figure 3: Correlation results for the absolute evaluation of fluency using RepEval with different token and position selections.

However, the heatmaps in Figure 3 reveal a different story. The highest correlations (darkest red) are often found in the middle to late layers, not the final layer (Layer -1).

Why? The authors suggest that the final layers of a decoder-only model are heavily optimized for predicting the next token. The middle layers, however, are where the model integrates context and semantic understanding. Therefore, grabbing representations from layers -5 to -15 often yields better evaluation signals than the final output layer.


Conclusion & Implications

RepEval represents a shift in how we think about “AI Evaluation.” Instead of treating LLMs as black boxes that we must chat with, we can treat them as transparent mathematical instruments.

Key Takeaways:

  1. Hidden States hold the truth: An LLM’s inability to write a good critique doesn’t mean it doesn’t understand the text. The understanding is locked in the vector representations.
  2. Geometry over Generation: By projecting these vectors onto a “quality direction” found via PCA, we get precise, continuous scores.
  3. Efficiency: We can achieve GPT-4 level evaluation performance using open-source, 7B parameter models, drastically reducing the cost and latency of evaluation.

For students and researchers, this opens up exciting possibilities. It suggests that future evaluation metrics might not be new models, but rather better ways of probing the models we already have. RepEval effectively turns the “black box” of LLMs into a transparent glass house, allowing us to see—and measure—exactly what the model is thinking.