How to Grade LLM Prompts Without an Answer Key: Introducing GLaPE

In the rapidly evolving world of Large Language Models (LLMs), finding the perfect prompt is akin to casting a magic spell. A slight change in phrasing—shifting from “Let’s think step by step” to “Take a deep breath and work this out”—can dramatically alter the accuracy of the model’s output.

This has given rise to Prompt Optimization, where researchers treat the LLM itself as an optimizer to hunt for the best possible instructions. However, there is a massive bottleneck in this process: Gold Labels.

Traditionally, to know if a prompt is “good,” you need to run it on a dataset where you already know the answers (the gold labels). You compare the LLM’s output to the correct answer, calculate accuracy, and score the prompt. But what happens when you want to optimize a prompt for a new task where you don’t have the answers yet? What if you are working with private data that has never been labeled?

This is the problem tackled by a fascinating new paper titled “GLaPE: Gold Label-agnostic Prompt Evaluation for Large Language Models.” The researchers propose a method to evaluate and optimize prompts without needing a single answer key.

The Problem: The “Gold Label” Bottleneck

Current state-of-the-art prompt optimization methods, such as OPRO (Optimization by PROmpting), work in a cycle. They generate a prompt, test it against a dataset with known answers, calculate an accuracy score, and then ask the LLM to generate a better prompt based on that score.

Sketch of prompt optimization utilizing the LLM as an optimizer, contrasting accuracy-based evaluation with GLaPE.

As shown in Figure 1 above, the standard path (a) relies heavily on comparing the model’s output to a “Gold Label Answer.” If the model answers “31” and the gold label is “31,” the prompt gets a score of 100. If it answers “36,” it gets a 0.

But look at path (b). This is the GLaPE method. It attempts to assign a quality score to the prompt (e.g., 87.9 or 45.7) without peeking at the answer key. If we can reliably calculate this score, we can optimize prompts for real-world scenarios where data is unlabeled and messy.

Background: The Intuition of Consistency

To understand how GLaPE works, we first need to understand the concept of Self-Consistency (SC).

Proposed in previous research (Wang et al., 2022), Self-Consistency relies on a simple intuition: Correct answers are usually more consistent than incorrect ones.

If you ask an LLM a complex math problem once, it might hallucinate. But if you ask it the same question 10 times using the same prompt:

  • If it answers “42” eight times, “43” once, and “12” once, the answer is likely 42.
  • The “consistency” here is 80% (or 0.8).

We can mathematically define Self-Consistency (\(SC\)) as the frequency of the most common answer (\(a\)) appearing in a set of sampled responses (\(r\)):

Equation for Self-Consistency (SC).

The Flaw in Self-Consistency

The researchers initially considered just using this SC score as a proxy for accuracy. The hypothesis was simple: Prompts that generate high-consistency answers are better prompts.

However, they ran into a problem. LLMs can be confidently wrong.

Scatter plot showing the relationship between Self-Consistency and Accuracy.

Figure 3 illustrates the “SC-Accuracy Graph.” If Self-Consistency were a perfect proxy for accuracy, we would see a straight diagonal line. Instead, we see a jagged, fluctuating mess.

Some prompts (like Prompt 3 in our upcoming examples) might lead the LLM to output the wrong answer over and over again. The SC score is high, but the accuracy is zero. Relying on SC alone creates a “blind spot” where we overestimate the quality of bad prompts.

The Core Method: GLaPE

To solve this, the researchers developed GLaPE (Gold Label-agnostic Prompt Evaluation). The method combines two critical strategies:

  1. Self-Consistency (SC) Evaluation: Measuring how stable a single prompt is.
  2. Mutual-Consistency (MC) Refinement: Checking if different prompts agree with each other.

Think of it like a peer review process. If one student (Prompt A) is consistently shouting the wrong answer, they might seem confident (high SC). But if five other students (Prompts B, C, D, E, F) all agree on a different answer, we should lower our trust in Prompt A.

Visualizing the Algorithm

Let’s break down the architecture of GLaPE using the schematic below.

Schematic representation of GLaPE integrating SC evaluation and MC refinement.

In Figure 2, we see five different prompts attempting to answer the same question about “Oscar’s lollipops.”

  1. Prompts 1 & 2 both arrive at the answer “31” (which happens to be correct). Prompt 1 is very consistent (100%), while Prompt 2 varies a bit (70%).
  2. Prompt 3 arrives at the answer “36” (Incorrect). However, notice the red marker. It has an SC of 70%. If we only looked at SC, we would think Prompt 3 is just as good as Prompt 2.
  3. Prompts 4 & 5 also arrive at the answer “36,” but with much lower consistency (40% and 30%).

The goal of GLaPE is to generate a metric—a final score—that recognizes Prompt 3 is actually bad, despite its high consistency.

The Mathematics of Refinement

GLaPE calculates the final score (\(f_i\)) for each prompt by minimizing a “Loss Function” (\(L_{total}\)). This loss function has two parts.

Part 1: Self-Consistency Loss (\(L_{self}\))

First, the method tries to keep the final score close to the original Self-Consistency score (\(c_i\)).

Equation for Self-Consistency Loss.

This equation simply says: “The final score (\(f\)) shouldn’t drift too far from the raw consistency (\(c\)).”

Part 2: Mutual-Consistency Refinement (\(L_{refine}\))

This is the “peer review” mechanism. It penalizes the score if prompts sharing the same answer have vastly different scores.

Equation for Mutual-Consistency Refinement Loss.

The logic here is subtle but powerful. If multiple prompts produce the same answer (e.g., answer “36”), they should ideally have similar quality scores.

  • For the answer “36”, Prompt 3 has high SC (70%), but Prompts 4 and 5 have low SC (40%, 30%). The average confidence for answer “36” is low.
  • Therefore, the algorithm pulls Prompt 3’s score down to align with its peers (Prompts 4 and 5).
  • Conversely, for answer “31”, the prompts have SCs of 100% and 70%. The group confidence is high, so the scores remain high.

The Total Calculation

The final calculation balances these two objectives using a weight parameter (\(\alpha\)), usually set to 0.5.

Equation for Total Loss combining SC and MC.

A Worked Example

Let’s look at the actual numbers from the Figure 2 scenario to see the math in action.

First, we have the raw Self-Consistency (\(c\)) values:

Raw SC values for the 5 prompts.

We set up the Self-Consistency Loss to minimize the distance between our final scores (\(f\)) and these raw values:

Expansion of the Self-Consistency Loss equation.

Next, we apply the Refinement Loss. This groups prompts by their answers.

  • Prompts 1 and 2 agree (Answer: 31).
  • Prompts 3, 4, and 5 agree (Answer: 36).

The math tries to minimize the difference between scores within these groups:

Expansion of the Refinement Loss equation.

When we combine these and solve for the minimum loss (using gradient descent), we get the final GLaPE Scores:

Final GLaPE scores showing the adjustment.

The Result: Look at \(f_3\). Its raw consistency was 70.0, but its GLaPE score dropped to 50.0. The algorithm successfully identified that Prompt 3 was “confidently wrong” because its peers (Prompt 4 and 5) were struggling to reach that same answer consistently. Meanwhile, Prompt 1 stayed high at 87.9.

Experiments and Results

Does this complex math actually result in better prompts? The researchers tested GLaPE on 8 widely recognized reasoning tasks, including GSM8K (Math), StrategyQA (Commonsense), and Big-Bench Date.

Performance Comparison

The researchers compared prompts optimized using GLaPE (No Gold Labels) against prompts optimized using OPRO (Uses Gold Labels).

Optimization results table comparing GLaPE and OPRO.

Table 3 shows the results. The “GLaPE-based” prompts achieved accuracy scores that were remarkably close to, and sometimes even better than, the baseline methods.

  • On GSM8K, the GLaPE prompt achieved 77.7% accuracy, beating the baseline (74.8%) and beating the OPRO gold-label method (76.6%).
  • On MultiArith, it achieved 99.3%, practically matching the theoretical maximum.

This confirms that we can optimize prompts effectively even without knowing the correct answers.

Better Correlation with Accuracy

The ultimate test of an evaluation metric is how well it correlates with ground truth accuracy.

Comparison of prompt optimization based on SC vs GLaPE.

Table 4 compares the optimal prompts found by pure SC versus GLaPE. The prompt selected by GLaPE (“After careful analysis, the conclusion is evident”) yielded 77.7% accuracy, while the SC-selected prompt only reached 75.1%.

Furthermore, let’s look at the correlation graphs again.

Side-by-side comparison of SC-Accuracy vs GLaPE-Accuracy graphs.

Figure 4 is the “mic drop” moment for the paper.

  • Graph (a) shows the raw SC vs. Accuracy. It’s noisy and erratic.
  • Graph (b) shows GLaPE vs. Accuracy. It is a much tighter, linear correlation. As the GLaPE score goes up, the actual accuracy goes up reliably.

Generalizability Across Models

The researchers didn’t just stick to GPT-3.5. They verified that GLaPE works on open-source models like Mistral-7B, Llama3-8B, and Gemma2-9B.

Optimization results across various models on GSM8K dataset.

As shown in Table 5, GLaPE consistently outperforms the baseline prompts on these models, proving that the method isn’t specific to just one architecture.

Conclusion and Implications

The GLaPE paper presents a significant step forward for the practical application of LLMs. By removing the dependency on gold labels, prompt engineering moves from a laboratory setting (where we have perfect test data) to the real world (where data is messy and unlabeled).

The method’s core innovation—Mutual-Consistency Refinement—provides a robust way to fact-check an LLM’s confidence. It reminds us that in the absence of an answer key, consensus (among different prompts) is often our best proxy for truth.

A Note on Limitations

The authors honestly note that GLaPE isn’t magic. It relies on the assumption that correct answers are generally more consistent than incorrect ones.

  • In datasets like StrategyQA, there were cases where the LLM was consistently wrong across almost all prompts due to inherent knowledge gaps.
  • In these “collective hallucination” scenarios, GLaPE (and even human consensus) can still be misled.

However, for the vast majority of reasoning tasks, GLaPE offers a powerful new tool for engineers and researchers: the ability to grade the exam without ever seeing the answer key.