Mathematics is often called the universal language. However, there is a distinct difference between the mathematics humans write in textbooks—informal, intuitive, and filled with natural language—and the mathematics computers can verify. The latter requires formalization, a rigorous translation of mathematical concepts into code that a theorem prover (like Isabelle or Lean) can execute and check for logical validity.

This translation process, known as autoformalization, is a massive bottleneck in the field. Building large libraries of formal mathematics usually requires a community of human experts working for years. But what if Large Language Models (LLMs) could do the heavy lifting?

In this post, we will dive deep into the paper “Consistent Autoformalization for Constructing Mathematical Libraries” by Zhang, Quan, and Freitas. We will explore a new framework designed to make LLMs not just capable of translating math, but capable of doing so consistently enough to build reliable mathematical libraries.

The Problem: Why Can’t LLMs Just Write Math Code?

You might assume that because LLMs like GPT-4 are excellent at writing Python or C++, they would be equally good at writing formal mathematical languages like Isabelle/ZF. However, formal mathematical reasoning presents unique challenges:

  1. Specialization and Drift: As you move deeper into specific fields (like topology or complex algebra), the concepts become increasingly “out-of-distribution” (OOD) for the model. The model might know basic calculus, but it struggles with specialized lemmas found in research papers.
  2. Library Consistency: This is the big one. In a software library, if you define a function add_numbers, you must call it exactly that way later. In math libraries, new theorems build upon previous definitions. If an LLM translates a statement but uses slightly different terminology or syntax than what exists in the current Knowledge Base (KB), the code is useless. It breaks semantic consistency.

The researchers propose a solution that coordinates three distinct mechanisms to solve these problems: Most-Similar Retrieval Augmented Generation (MS-RAG), Denoising, and Auto-Correction with Syntax Error Feedback (Auto-SEF).

The overall framework consists of three stages: Retrieval Augmented Generation, Denoising, and Auto-Correction with Syntax Error Feedback.

As shown in Figure 1 above, the framework operates in stages, moving from a natural language input to a refined, formally verified output. Let’s break down exactly how this works.


Phase 1: Context Matters (MS-RAG)

The standard way to use LLMs for this task is “few-shot prompting,” where you give the model a few fixed examples of “Natural Language \(\rightarrow\) Formal Code” pairs to teach it the pattern.

The problem? Fixed examples are static. If you are trying to formalize a statement about topology, but your fixed examples are about arithmetic, the model won’t know the specific syntax or definitions required for topology.

The researchers propose Most-Similar Retrieval Augmented Generation (MS-RAG). Instead of fixed examples, the system looks at the input statement \(s\) and searches a Knowledge Base (KB) for the most similar existing examples.

The transformation function changes from a standard prompt to one that relies on retrieved context:

Equation showing the transformation function using retrieved examples based on similarity.

Here:

  • \(s\) is your input (the math sentence).
  • \(\{(s_i, \phi_i)\}_s\) represents the set of similar examples (pairs of text and code) retrieved from the library.
  • \(p_{auto}\) is the instruction prompt.

By dynamically swapping the examples in the prompt to match the topic of the input, the LLM sees how similar concepts were formalized in the past. This helps ensure that the model uses the correct “dialect” and definitions found in the existing library, significantly improving the chances of generating valid code.


Phase 2: Denoising the Output

LLMs are trained to be helpful assistants. This “helpfulness” is actually a hindrance in autoformalization. When asked to translate a theorem, an LLM often adds:

  • Conversational filler (“Here is the code you asked for…”)
  • Explanations of the logic (“I used the induction principle because…”)
  • Unsolicited proofs (trying to prove the theorem immediately, often incorrectly).

A theorem prover is a compiler; it expects strict syntax. If the LLM dumps conversational text into the file, the compiler crashes. Furthermore, the style of the code needs to match the library (e.g., using specific operators like \<in> vs \in).

The researchers introduced Denoising to strip away this “noise.” They explored two approaches:

  1. Code-Based Denoising (CBD): Using hard-coded rules (like Regular Expressions) to strip out known conversational patterns.
  2. Prompt-Based Denoising (PBD): Asking the LLM itself to clean up its own output.

It turns out that Prompt-Based Denoising is particularly powerful when you explicitly instruct the model to maintain stylistic alignment.

Take a look at the table below to see the impact of denoising on a real example. Notice how the “Raw” output contains notes and incorrect syntax (like neighborhood system of), while the denoised versions (especially PBD 1D) strip the text and correct the syntax to match the library style.

Table showing an example using Mistral, comparing raw output against various denoising strategies.

The prompt used for this denoising step isn’t just “remove text.” It explicitly asks the model to act as an expert and strictly output only the code, while matching the operator style of the provided examples.


Phase 3: The Feedback Loop (Auto-SEF)

Even with RAG and Denoising, the generated code might still have bugs. Maybe a parenthesis is missing, or a variable is undefined. In standard software engineering, when your code fails to compile, you look at the error message and fix it.

The researchers automated this process using Auto-Correction with Syntax Error Feedback (Auto-SEF).

Here is the concept:

  1. The LLM generates formal code \(d(s)\).
  2. The system sends this code to the Isabelle Theorem Prover.
  3. If Isabelle accepts it, great! We are done.
  4. If Isabelle returns an error \(e_k\), this error message is fed back into the LLM as a new prompt.

The equation governing this iterative process is:

Equation showing the iterative generation process using error feedback.

Here, \(g_{k+1}(s)\) is the new code generated in iteration \(k+1\), based on the previous code \(g_k(s)\) and the specific error message \(e_{k,1}\).

This transforms the LLM from a “one-shot” translator into a developer that can debug its own work. It allows the model to leverage the rigorous logic of the symbolic theorem prover to guide its neural generation.


Experimental Results

To test this framework, the authors created a dataset called MathLibForm, based on IsarMathLib (a library for set theory and abstract algebra). They tested several models, including Mistral, Llemma, and GPT-3.5-Turbo.

They used several metrics to evaluate success, including BLEU and CodeBERTScore (similarity to ground truth), but the most critical metric was Pass: Did the code actually compile without errors?

Does RAG Help?

Absolutely. Looking at Table 1 (below), we can see that introducing retrieval (RAG) significantly boosts performance across almost all metrics compared to zero-shot or fixed 3-shot prompting.

Table 1 showing autoformalization results. Settings with retrieval consistently outperform baselines.

A fascinating finding here is that Mistral (7B)—a relatively small model—when equipped with MS-RAG, performs comparably to or even better than GPT-3.5 in its base state on some metrics. This suggests that providing the right context via retrieval is as important, if not more so, than the raw size of the model.

The Impact of Feedback Loops

The Auto-SEF mechanism (the feedback loop) showed clear benefits, though with diminishing returns.

Graph showing the Pass rate improving over iterations.

As shown in Figure 2, the first few iterations of error correction yield the biggest jump in success rates. For GPT-3.5 (the blue line), the pass rate jumps from roughly 65% to nearly 71%. Mistral (the red line) also sees a bump, though it plateaus earlier. This confirms that LLMs can indeed “read” compiler error messages and fix syntax errors, provided the errors aren’t too conceptually deep.

A Note on Metrics

Measuring “correctness” in autoformalization is tricky. Just because code compiles (Pass) doesn’t mean it captures the correct mathematical meaning. Conversely, code might be semantically close but fail to compile due to a missing bracket.

The researchers analyzed the correlation between different metrics:

Heatmap showing correlation coefficients between metrics like BLEU, RUBY, and Pass.

The heatmap in Figure 4 shows a strong positive correlation between RUBY (an edit-distance metric) and the other metrics. This suggests that while no single metric is perfect, they generally agree on what constitutes “good” formalization. The authors recommend using Pass (syntactic correctness) alongside BLEU (semantic similarity) to get a full picture.


Conclusion and Future Implications

The paper “Consistent Autoformalization for Constructing Mathematical Libraries” provides a blueprint for the future of AI-assisted mathematics. By moving beyond simple prompting and integrating Retrieval (RAG), Denoising, and Symbolic Feedback (Auto-SEF), the authors demonstrated a way to build formal libraries that are not just large, but consistent and functional.

Key Takeaways:

  • Context is King: Retrieving similar examples helps models adapt to specialized domains and maintain library consistency.
  • Clean Inputs, Clean Outputs: Denoising is essential to bridge the gap between “chatty” LLMs and strict compilers.
  • Neuro-Symbolic Collaboration: The most powerful systems aren’t just Neural Networks; they are Neural Networks integrated with Symbolic Logic solvers (Theorem Provers) that can check and guide the output.

As these techniques refine, we move closer to a world where mathematical verification is accessible not just to a handful of experts, but to anyone with a research question and an AI assistant.