Breaking the Language Barrier in AI Math: An Introduction to Cross-PAL

Mathematics is often called the universal language. A calculation like \(20 - 12 + 5\) yields the same result whether you describe the problem in English, Chinese, or Swahili. However, for Large Language Models (LLMs), this universality is not a given. While models like GPT-4 exhibit impressive reasoning capabilities in English, their performance often degrades significantly when prompted in low-resource languages.

The challenge lies in multi-step reasoning. Solving a word problem requires understanding the narrative, planning a logical sequence of steps, and executing calculations. When an LLM is forced to do this in a language it wasn’t heavily trained on, the cognitive load is often too high, leading to errors.

In this post, we will explore a research paper that proposes a novel solution to this problem: Cross-lingual Program-Aided Language Models (Cross-PAL). This approach leverages the strict logic of computer code to bridge the gap between languages, allowing models to “think” in high-resource languages (like English) while solving problems in their target language.

The Problem: The Multilingual Reasoning Gap

In-context learning—prompting a model with a few examples—has revolutionized how we interact with AI. Techniques like Chain-of-Thought (CoT) encourage models to generate intermediate reasoning steps (e.g., “First, I will calculate X, then Y…”) before giving a final answer. This significantly improves accuracy on math and logic tasks.

However, CoT has a limitation: it relies on natural language. If you ask a model to reason in Telugu or Bengali, and the model has seen limited training data in those languages, its internal “monologue” often becomes incoherent.

Previous attempts to fix this involved:

Translation: Translating the problem to English, solving it, and translating back. This introduces translation errors.
Native CoT: Forcing the model to think in the target language. This fails if the model’s grasp of that language’s complex reasoning is weak.

The researchers behind Cross-PAL identified a missing link: structure. Natural language is ambiguous; code is not. By forcing the model to structure its reasoning as a computer program, we can decouple the logic of the problem from the syntax of the spoken language.

Enter Cross-PAL: Reasoning with Code

Cross-PAL is a method for aligning reasoning programs across languages. Instead of asking the model to write a paragraph of text explaining the solution, Cross-PAL asks the model to write code (specifically Python-style pseudocode) to solve the problem.

The core innovation is a two-step prompting mechanism that acts as a bridge between the user’s language and the model’s strongest reasoning capabilities.

The Architecture

As illustrated in the figure below, the process is split into two distinct phases: the Understander and the Solver.

Let’s break down these two phases using the example from Figure 1, where the input problem is in Chinese (\(L_s\)).

Phase 1: The Cross-lingual Understander

In this first step, the goal is to comprehend the problem and plan a solution. The prompt asks the LLM to act as an “expert in multilingual understanding.”

Crucially, the prompt includes examples (few-shot demonstrations) where the question is in the target language, but the reasoning steps (the code comments and variable names) are in a high-resource language, typically English (\(L_t\)).

Why English? Because the vast majority of code on the internet—and therefore in the model’s training data—is written with English syntax and comments.

The model reads the Chinese question and generates a reasoning path. Formally, the generated plan \(\mathcal{A}\) is a sequence of steps \(s_1 \dots s_n\). This generation process is maximized based on the input question \(Q\), the source language \(L_s\), and the target reasoning language \(L_t\):

Equation 1

By switching to English for the planning phase, the model leverages its strongest reasoning circuits. It outlines the logic: “define variable for initial lollipops,” “subtract given lollipops,” etc., using English code comments.

Phase 2: The Language-Specific Solver

Once the plan is generated in English, the system moves to the Solver phase. Here, the prompt instructs the model to act as a programmer in the original language (Chinese).

The model takes the English-based plan generated in Phase 1 and converts it into a final executable program or a structured solution in the original language. This might seem counterintuitive—why go back to Chinese? The goal is to ensure the final answer aligns with the user’s request and to verify that the logic holds up when contextualized back into the source language.

The reasoning steps \(\mathcal{R}_t\) for the final solution are generated based on the previous plan \(P\):

Equation 2

Finally, the specific answer \(A_t\) (e.g., the number “11”) is derived from executing or parsing these reasoning steps:

Equation 3

This “sandwich” method—Input (\(L_s\)) \(\rightarrow\) Plan (\(L_{English}\)) \(\rightarrow\) Output (\(L_s\))—allows the model to use English as a cognitive crutch without losing the context of the original request.

Self-Consistency: The Power of Ensembling

The authors didn’t stop at a single reasoning path. They introduced Self-consistent Cross-PAL (SCross-PAL).

In complex reasoning, even humans might solve a problem two different ways to double-check their work. SCross-PAL does the same. It prompts the model to generate multiple different reasoning paths across different languages or variations.

For example, it might generate one plan thinking primarily in English, another thinking in Chinese, and another in German. It then looks at the final numerical answers produced by all these paths.

The system uses a voting mechanism to select the final answer. It selects the answer \(\hat{A}\) that appears most frequently across all generated paths (\(A_t\)) and languages (\(L\)). This majority voting filters out “hallucinations” or calculation errors that might occur in just one specific language path.

Equation 4

This ensemble approach significantly increases robustness, ensuring that a linguistic nuance in one language doesn’t derail the entire calculation.

Experimental Results

The researchers evaluated Cross-PAL on two major multilingual math benchmarks: MGSM (Multilingual Grade School Math) and MSVAMP. They tested across various languages, ranging from high-resource (German, French) to lower-resource (Swahili, Telugu).

Outperforming the Baselines

The results were compelling. Cross-PAL consistently outperformed direct prompting and standard Chain-of-Thought (CoT) methods.

In the MSVAMP benchmark, the radar chart below visualizes the performance coverage. The further the line is from the center, the higher the accuracy.

Figure 2: Accuracies (%) on MSVAMP.

Notice the red line (Cross-PAL) and the purple line (SCross-PAL). They encompass the inner lines (Direct prompting and other baselines), indicating superior performance across almost all languages tested, including Thai (th) and Bengali (bn).

Consistency Across Languages

The effectiveness of Cross-PAL is further highlighted when we compare it to “Native” versions. The authors ran an experiment comparing Cross-PAL (which uses English as the intermediate planning language) against a “Native” version where the intermediate planning happened in the target language itself.

Figure 3: Accuracies (%) on MGSM using CrossPAL, SCross-PAL, Cross-PAL(Native) and SCrossPAL(Native).

As shown in Figure 3, the standard Cross-PAL (blue bars) generally outperforms or matches the Native version (green bars), particularly in lower-resource languages. This confirms the hypothesis: injecting English-based structural planning helps the model reason better in other languages.

Scalability to Smaller Models

One of the most important findings is that this method isn’t exclusive to massive models like GPT-4. The authors tested Cross-PAL on smaller, open-source models like Llama-2-7b, Llama-3-8b, and Phi-3.

Table 2: performances from single-stage prompting, two-stage prompting, and only from the first step in the two-stage prompting by using Cross-PAL

The table above shows that even for smaller models, the Double-step (Cross-PAL) method yields significant improvements over single-step prompting. For example, on Llama-3, the accuracy jumps from roughly 51% (single-step) to 55.4% (double-step). This suggests that structured, program-aided prompting acts as a “reasoning amplifier,” allowing smaller models to punch above their weight class.

Analysis: Why Does This Work?

The paper provides several deep dives into the mechanics of why Cross-PAL is effective.

1. The English Pivot

The dominance of English in pre-training data cannot be ignored. By routing the logic component of the task through English, Cross-PAL minimizes the risk of the model getting confused by the syntax of a less familiar language. The “First-step” results in Table 2 (above) show that the planning phase alone contributes significantly to the success.

2. High-Resource vs. Low-Resource Integration

In the self-consistency experiments (SCross-PAL), the authors investigated which languages should be included in the voting ensemble.

Figure 4: The impact of integrating languages in our SCross-PAL on the final performance. Following Table 13, we integrate languages from low-resources to high-resources and vice versa.

The graph above reveals a critical insight:

Green line (High Resource + English): Adding more high-resource languages generally maintains or boosts performance.
Blue line (Low Resource \(\rightarrow\) High Resource): Starting with low-resource languages results in lower performance, which slowly improves as high-resource languages are added.

Essentially, the quality of the “voters” matters. Ensembling a bunch of low-resource reasoning paths (which are likely error-prone) is less effective than ensembling a few high-resource paths.

3. Bilingual Synergy

The authors also found that simply adding English as a secondary path to a Native prompt creates a massive boost, especially for low-resource languages.

Figure 5: Accuracies (%) on MGSM using SCrossPAL, SCross-PAL(Native), SCross-PAL + English and SCross-PAL(Native) + English.

In Figure 5, look at the difference between the orange bars (Native only) and the green bars (Native + English). For a language like Telugu (te), adding the English reasoning path nearly doubles the performance. This confirms that English acts as a stabilizer for the model’s reasoning process.

Conclusion and Implications

The Cross-PAL paper presents a significant step forward in making AI more equitable. By acknowledging the current limitations of LLMs—specifically their bias toward English—and engineering a prompt strategy that turns this bias into a feature (using English for planning), the authors have improved multilingual mathematical reasoning without the need for expensive model retraining.

Key Takeaways:

Code is a Reasoning Anchor: Structured code demonstrations are more effective than natural language for guiding multi-step reasoning.
Code-Switching Works: Planning in a high-resource language (English) and executing in a target language yields better results than staying monolingual in low-resource settings.
Ensembling is Robust: Voting across different language paths (SCross-PAL) filters out errors and hallucinations.
Small Models Benefit: This technique unlocks reasoning capabilities in smaller, open-source models, making powerful AI more accessible.

As we move toward more global AI adoption, techniques like Cross-PAL will be essential. They ensure that a user’s language preference does not dictate the quality of the intelligence they receive. While we wait for pre-training datasets to become truly balanced across all languages, Program-Aided Language Models provide a clever and effective bridge.

The Problem: The Multilingual Reasoning Gap#

Enter Cross-PAL: Reasoning with Code#

The Architecture#

Phase 1: The Cross-lingual Understander#

Phase 2: The Language-Specific Solver#

Self-Consistency: The Power of Ensembling#

Experimental Results#

Outperforming the Baselines#

Consistency Across Languages#

Scalability to Smaller Models#

Analysis: Why Does This Work?#

1. The English Pivot#

2. High-Resource vs. Low-Resource Integration#

3. Bilingual Synergy#

Conclusion and Implications#