Introduction

Imagine you are taking a difficult math exam. You solve a problem, but you aren’t sure if the answer is correct. What do you do? You might re-read the question, or perhaps you try to solve it again. But the most effective students often use a different strategy: they plug the answer back into the equation to see if it fits, or they devise a rigorous checklist to verify their steps.

Large Language Models (LLMs) like GPT-4 and Llama-3 are incredible at generating answers, but they are notoriously bad at this second step: Self-Correction. When an LLM makes a mistake—whether it’s a hallucination in a biography or a calculation error in a math problem—it often fails to spot the error even when asked to “double-check.” Worse, when prompted to correct themselves, models frequently change correct answers to incorrect ones due to uncertainty or poor verification logic.

This brings us to a fascinating new paper titled “ProgCo: Program Helps Self-Correction of Large Language Models.” The researchers propose that the ambiguity of natural language is the enemy of self-correction. Their solution? Code.

In this post, we will dive deep into ProgCo (Program-driven Self-Correction). We will explore how forcing an LLM to think in “pseudo-code” allows it to verify its own work with much higher accuracy than standard conversational prompts, effectively unlocking a “System 2” deliberate thinking process for AI.

The Problem with “Intrinsic” Self-Correction

Before understanding the solution, we must understand why LLMs struggle to correct themselves. Intrinsic self-correction refers to the model’s ability to refine its output without any external help—no human feedback, no Google Search, and no calculator.

The typical workflow looks like this:

  1. Generate an initial response.
  2. Verify the response (Self-Verification).
  3. Refine the response based on feedback (Self-Refinement).

Figure 1: Illustration of a typical workflow of LLM’s intrinsic self-correction.

As shown in Figure 1, this is a cyclical process. If the verification fails, the model generates feedback and tries again.

The failure point usually lies in the Self-Verification phase. Current methods typically ask the model, “Is this correct?” or “Check your steps.” The problem is that LLMs suffer from:

  • Overconfidence: They tend to trust their initial output.
  • Flawed Logic: If the model used broken logic to solve the problem, it often uses the same broken logic to verify it.
  • Ambiguity: Natural language checklists (“Check if the tone is professional”) are often too vague to enforce strict logical consistency.

The authors of ProgCo argue that we need a stricter, more structured medium for verification. We need programs.

The Core Method: ProgCo

ProgCo stands for Program-driven Self-Correction. The framework is built on the insight that code is less ambiguous than natural language and forces a structured, step-by-step evaluation.

ProgCo consists of two main components:

  1. ProgVe (Program-driven Verification): Using pseudo-code to verify answers.
  2. ProgRe (Program-driven Refinement): A dual-track system that refines not just the answer, but the verification code itself.

Figure 3: The overall framework of ProgCo, achieving self-correction through iterative Prog Ve and ProgRe.

As illustrated in Figure 3, the process is iterative. The model generates a program to test its answer. It executes that program. If the test passes, we are done. If it fails, the model enters a refinement loop where it updates both its answer and its test code.

Component 1: Program-driven Verification (ProgVe)

Standard prompting asks an LLM to “verify” text. ProgVe asks the LLM to write a function that returns True or False.

Step 1: Verification Program Generation

After the LLM produces an initial response (\(y_0\)), ProgCo prompts the model to generate a verification function (\(f\)). Crucially, this generation is often done independently of the initial answer to avoid bias.

()\nf = M ( P _ { f u c } ^ { g e n } | | x )\n[

What does this verification program look like?

  • For Instruction Following: If the user asks for a poem in all caps with a specific title format, the program explicitly checks response.is_upper() and verifies the title regex.
  • For Math (Reverse Reasoning): This is where ProgVe shines. Instead of re-solving the problem, the code implements Reverse Reasoning. If the question asks for an original price given a discount, the verification code takes the answer, applies the discount, and checks if it matches the final price given in the prompt.

Figure 2: Illustration of generating verification pseudoprogram for input tasks.

Figure 2 demonstrates this versatility. Notice the bottom example regarding the discount calculation. The code doesn’t just ask “is this right?” It mathematically calculates: calculated_discounted_price = original_price - discount. If this doesn’t match the known discounted price ($19.50), the verification returns False.

Step 2: Verification Program Execution

Here is the twist: The code doesn’t necessarily need to run on a computer.

While ProgCo can use a Python interpreter, its default mode uses the LLM itself as the Program Executor. The LLM reads the code and “executes” it step-by-step in its “mind” (context).

]\nr _ { i } = M ( P _ { f u c } ^ { e x e c } | | x | | y _ { i } ) , f b _ { i } = M ( P _ { f b } | | x | | y _ { i } | | r _ { i } )\n[

Why use the LLM as the executor?

  1. Pseudo-Code Flexibility: Sometimes verification requires abstract logic, like is_structured_as_letter(). A real Python compiler would throw a NameError. An LLM understands the intent of that function and executes it based on world knowledge.
  2. Causal Tracing: By forcing the LLM to simulate the execution trace line-by-line, we force it to slow down and adhere to the causal logic of the code structure, reducing hallucinations.

Component 2: Program-driven Refinement (ProgRe)

If ProgVe returns “False,” the model must fix the error. However, a major issue in self-correction is Misleading Feedback. Sometimes the verification code is wrong, or the feedback is vague, causing the model to ruin a perfectly good answer.

To solve this, ProgRe introduces Dual Refinement: it refines the response and the verification program.

Reflecting on the Response

Instead of blindly applying feedback, ProgRe uses a “Contrast and Regenerate” strategy.

  1. Reflection: The model generates a temporary new answer (\(y_{temp}\)) based on the feedback.
  2. Contrast: The model compares the old answer (\(y_i\)) with the new temporary answer (\(y_{temp}\)) to identify the core differences.
  3. Insight Generation: These differences are synthesized into “Insights” (\(ins\)).
  4. Regeneration: The final refined answer is generated using these specific insights.

]\ny _ { i + 1 } ^ { t e m p } = M ( P _ { r e f l e x } | | x | | y _ { i } | | f b _ { i } )\n[

]\ni n s = M ( P _ { c o n t } | | y _ { i } | | y _ { i + 1 } ^ { t e m p } ) , y _ { i + 1 } = M ( i n s | | x )\n[

This extra step of “Contrasting” helps the model understand why the change happened, preventing it from flipping back and forth between wrong answers.

Reflecting on the Program

What if the answer was right, but the test code was wrong? (e.g., the code checked for “less than 100 words” when the prompt asked for “more than 100”).

ProgRe includes a step to Refine the Verification Program. Using the current response and feedback, the model re-evaluates its test code. If the code logic is flawed, it updates the function (\(f_{i+1}\)) for the next round.

]\nf _ { i + 1 } = M ( P _ { r e f l e x } ^ { c o d e } | | x | | f _ { i } | | y _ { i } | | f b _ { i } )\n()

This ensures the “judge” (the program) improves alongside the “candidate” (the response).

Experiments and Results

The researchers evaluated ProgCo on three benchmarks:

  • IFEval: Instruction following (e.g., “write a story in more than 400 words”).
  • GSM8K: Grade school math word problems.
  • MATH: Challenging competition-level mathematics.

They compared ProgCo against standard baselines like Self-Refine (asking the model to refine feedback), Self-Reflection, and CheckList (generating a list of questions to check).

Main Performance

The results, shown in Table 1 below, are striking. ProgCo consistently outperforms all other self-correction methods across all datasets and models (Llama 3, GPT-3.5, GPT-4o).

Table 1: The result of different self-correction methods.

Take a look at the MATH dataset results for GPT-3.5.

  • The Initial Score was 36.2%.
  • Standard Self-Refine actually hurt performance (dropping to 34.8%), a common phenomenon where models second-guess themselves.
  • ProgCo increased accuracy to 44.2% after three rounds—a massive 8.0% improvement.

This highlights that while standard prompting often fails on complex reasoning, the program-driven approach provides the scaffolding necessary for actual improvement.

Why Does It Work? Better Recall.

To improve an answer, you first have to realize it’s wrong. This metric is called Recall of Incorrect Responses.

Figure 4:Recall and F1 scores of self-verification methods for incorrect responses on GPT-3.5.

Figure 4 compares how well different methods identify errors.

  • Checklist (Gray): Very poor recall. It rarely spots mistakes.
  • CoT-Check (Blue-Gray): Better, but still misses many errors.
  • ProgVe (Blue): Significantly higher recall. By forcing the model to run a verification algorithm, it uncovers inconsistencies it would otherwise miss.

The Power of Iteration

One of the most interesting findings is how performance changes over time (iterations).

Figure 5: Score variation with the maximum number of self-correction rounds on GPT-3.5.

In Figure 5 (right side, GSM8K), look at the difference between the Orange line (Self-Reflection) and the Red line (ProgCo).

  • Self-Reflection performance dips initially and struggles to gain momentum.
  • ProgCo shows a steady, consistent climb. As the rounds progress, the dual refinement improves both the answer and the verification code, creating a positive feedback loop.

Pseudo-Code vs. Real Code

While ProgCo is designed to work with the LLM acting as the computer (Pseudo-Code), the researchers also tested what happens if you plug in a real Python interpreter (ProgVe + Python).

Table 3: Performance of ProgCo in integrating the Python executor tool during the ProgVe process.

Table 3 shows that while the pure LLM approach is strong, adding a real Python tool boosts performance even further (e.g., +3.51% on IFEval for GPT-4o). This confirms that the structure of the program is the key active ingredient, but exact execution precision (provided by Python) is the cherry on top.

Conclusion

The ProgCo paper makes a compelling argument: if we want Large Language Models to correct their own mistakes, we need to change how they think about verification.

Conversational self-verification (“Is this right?”) is prone to the same biases that caused the error in the first place. By shifting the paradigm to Program-driven Verification, we force the model into a different mode of reasoning—one that is structural, logical, and rigorous.

Key takeaways for students and practitioners:

  1. Verification needs Structure: Pseudo-code is a powerful prompt engineering tool because it forces strict logic steps that natural language lacks.
  2. Reverse Reasoning is Key: Checking an answer often requires a different logic path than finding the answer (e.g., working backward). Programs are excellent at encapsulating this reverse logic.
  3. Dual Refinement: When an AI fails a test, you must question both the answer and the test. ProgCo’s ability to refine the verification code prevents the model from getting stuck in “false failure” loops.

As LLMs continue to evolve, methods like ProgCo act as a “System 2” cognitive layer—a slow, deliberate check that ensures the lightning-fast generations of “System 1” are actually correct.


The images and data used in this article are sourced from the research paper “ProgCo: Program Helps Self-Correction of Large Language Models” by Song et al. (2024).