Why Thinking in Python Makes LLMs Smarter: The Power of Code Prompting

If you have ever tried to navigate a complex legal document or determine your eligibility for a visa, you know that the logic involved is rarely straightforward. It is a maze of conditional statements: “If you are over 18, AND you have lived here for 5 years, OR you are married to a citizen, THEN…”

This is known as conditional reasoning, and it is a fundamental component of human intelligence. For Large Language Models (LLMs), however, it can be a significant stumbling block. While models like GPT-4 are impressive, they often hallucinate or lose track of logic when faced with long chains of conditions buried in natural language text.

But what if we changed the language we use to ask the question?

A fascinating research paper, “Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs,” proposes a counter-intuitive solution. The researchers discovered that if you translate a natural language problem into code (specifically Python-like pseudocode) and feed it back to the LLM, the model’s reasoning performance skyrockets—even if the code is never actually executed by a computer.

In this post, we will break down how “Code Prompting” works, why it triggers specific reasoning circuits in LLMs, and the experiments that prove code is sometimes worth a thousand words.

The Problem: The Ambiguity of Text

Reasoning in natural language is messy. Words can be ambiguous, sentences can be convoluted, and tracking the state of multiple entities (like “the applicant,” “the spouse,” “the document”) across several paragraphs requires significant cognitive load.

Recent advancements in Text+Code LLMs (models trained on both natural language and programming languages, like GPT-3.5, Mistral, and Mixtral) have shown that exposure to code improves general reasoning. However, most techniques that leverage code, such as “Program of Thoughts,” rely on an external Python interpreter to run the code and get the answer. This treats the LLM as a parser, offloading the actual thinking to the computer.

The authors of this paper ask a different question: Can the code representation itself trigger better reasoning within the LLM, without external execution?

The Solution: Code Prompting

The researchers introduce a methodology called Code Prompting. The core idea is to transform a natural language task into a code-based representation, prompting the model to “think” in structured logic.

As illustrated in Figure 1 below, the process involves a two-step pipeline:

Transformation: The model takes a Natural Language (NL) problem and converts it into a “Code Prompt.”
Inference: This code is fed back into the LLM, which processes the code to generate the final answer in natural language.

Figure 1: Code prompting converts a natural language problem into a code prompt and prompts a large language model with such code to generate an answer.

What Does the “Code” Look Like?

The generated code isn’t necessarily meant to run on a compiler. It is a structured representation of the logic. It typically includes:

Variable Assignments: Converting entities into variables (e.g., husband_pass_away = True).
Logical Structures: Using if, else, and, and or statements to map out the rules found in the text.
Comments: Crucially, the code retains the original natural language text as comments, ensuring no semantic information is lost.

Figure 2 demonstrates this transformation on a question about funeral expenses. Notice how the messy text rules (“you are eligible if…”) are converted into clean, logical if blocks.

Figure 2: Code prompting converts natural language descriptions into code to be solved with a large language model. The figure shows a transformed instance from the ConditionalQA dataset.

By forcing the problem into this format, the LLM is explicitly made to identify variables and the logical relationships between them before it attempts to answer the question.

Does It Work? The Experimental Results

To test this, the researchers evaluated Text Prompts vs. Code Prompts across three challenging datasets:

ConditionalQA: A reading comprehension dataset requiring answers based on complex scenarios.
BoardgameQA: A dataset involving conflicting rules and preferences, highly dependent on logic.
ShARC: A conversational dataset dealing with natural language rules.

They tested three prominent models: GPT-3.5, Mixtral 8x7B, and Mistral 7B.

Significant Performance Gains

The results were striking. As shown in Table 1, Code Prompting consistently outperformed standard Text Prompting.

Table 1: Comparison (F1 score)of text prompt and code prompts. All results use one demonstration per class Delta CP = Code Prompt - Text Prompt, i.e., the average performance gain from code prompts across all datasets.

Key Takeaways from the Results:

Huge Boosts: GPT-3.5 saw a massive gain (up to 22.5 points) on the logic-heavy BoardgameQA (BGQA) datasets.
Consistency: The improvement held true across different model sizes and architectures (Mistral and Mixtral also saw gains).
Complexity Matters: The gap between Code and Text prompting was widest on the most difficult datasets (BGQA-3), suggesting that the harder the reasoning task, the more beneficial the code structure becomes.

Reduced Uncertainty

Why is Text Prompting failing? A look at the confusion matrices (Figure 4) reveals an interesting pattern.

Figure 4: Confusion matrices of text and code prompts for each model on al datasets.

In difficult datasets like BoardgameQA, text prompts frequently defaulted to predicting “not enough information” (the lighter/white squares in the text columns). The models were hesitant. Code Prompts, however, reduced this uncertainty, allowing the models to correctly identify “Yes” or “No” answers much more frequently.

Why Does Code Prompting Work?

This is the most scientifically interesting part of the paper. Is it just because code is shorter? Is it the syntax? Or is it something deeper about how LLMs “think”? The researchers performed extensive ablation studies on GPT-3.5 to find out.

1. It’s Not Just Text Simplification

You might argue that converting text to code simply removes fluff, making the problem easier to read. To test this, the authors compared Code Prompting against:

Atomic Statements: Breaking text into simple, declarative bullet points.
Back-Translated Code: Converting the generated code back into clear, structured natural language (e.g., “If variable X is true…”).

Table 2: Performance gap of atomic statements and back-translated code when compared to code prompts using GPT 3.5.Results from the dev set of each dataset.

Table 2 shows the results. Both alternative text methods performed worse than the Code Prompts (indicated by the negative numbers). This suggests that the syntax of code itself—the brackets, the indentation, the == operators—triggers specific reasoning capabilities that natural language structure does not.

2. Semantics Matter (You Can’t Just Fake It)

Does the code actually have to make sense? The researchers tried confusing the model by:

Anonymizing Code: Renaming variables to var_1, var_2, etc.
Random Code: Inserting random, irrelevant code logic while keeping the natural language comments.

Table 3: Performance gap to code prompts for each code perturbation. cQA stands for CondQA, CQA-YN for the partition of CondQA with yes-no answers,BG for BGQA. Results reported on the dev set of each dataset.

As Table 3 reveals, performance drops significantly when the code is randomized or anonymized. This proves that the semantic link between the natural language concepts and the code variables is vital. The model relies on the meaningful variable names (like husband_passed_away) to track the logic.

3. Sample Efficiency

One of the most practical benefits of Code Prompting is that it helps the model learn faster. In the world of LLMs, “learning” often means In-Context Learning (providing examples in the prompt).

Figure 3 illustrates that Code Prompting with just one demonstration (1 Dem./Class) often outperforms Text Prompting with three demonstrations. This makes Code Prompting highly efficient for scenarios where you have a limited context window or few examples to provide.

Figure 3: Performance comparison of GPT 3.5 between text (green) and code prompts (blue) using 1, 2,and 3 demonstrations per class. Results reported on dev sets.

4. The “State Tracking” Hypothesis

Perhaps the most profound insight is State Tracking.

In programming, you often define a variable at line 1 (x = 5), write a hundred lines of other code, and then reference x again. Code LLMs are heavily trained to track the “state” of these variables across long distances. Natural language models, however, tend to focus on local context (the words immediately surrounding the current sentence).

The researchers hypothesized that Code Prompting activates this “variable tracking” circuit. To prove it, they interrupted the models mid-generation and probed them: “What is the current value of variable X?”

Table 4: Comparison of the percentage of memory errors made by GPT 3.5. For each dataset, we separately compute memory errors for the instances where the model gives the correct and incorrect answers.Lower is better. Results from the dev set of each dataset.

The results in Table 4 are staggering.

Look at the Correct Ans / Text column: Text prompts had a memory error rate of 71.08% on ConditionalQA. This means even when the text model got the answer right, it often couldn’t correctly recall the specific facts that led there (suggesting it might have been guessing).
In contrast, Correct Ans / Code had an error rate of only 4.39%.

This confirms that Code Prompting allows the LLM to accurately track the state of entities and conditions throughout the reasoning process, essentially giving the model a better “working memory.”

Conclusion and Implications

The paper “Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs” offers a compelling glimpse into the black box of Large Language Models. It turns out that for models trained on code, programming languages are not just a tool for building software—they are a tool for thinking.

By casting natural language problems into the rigid, state-based structure of code, we can unlock superior reasoning capabilities in LLMs. This approach:

Outperforms standard text prompting on complex logical tasks.
Improves the model’s ability to track variable states and facts.
Requires valid syntax and semantics—you cannot simply dress text up as code; the logic must be sound.

For students and practitioners in AI, this suggests that the future of prompt engineering might not be in writing better English sentences, but in writing better pseudocode. If you need an LLM to solve a logic puzzle, don’t just ask it to “think step by step.” Ask it to write a program—even if you never intend to run it.

Why Thinking in Python Makes LLMs Smarter: The Power of Code Prompting#

The Problem: The Ambiguity of Text#

The Solution: Code Prompting#

What Does the “Code” Look Like?#

Does It Work? The Experimental Results#

Significant Performance Gains#

Reduced Uncertainty#

Why Does Code Prompting Work?#

1. It’s Not Just Text Simplification#

2. Semantics Matter (You Can’t Just Fake It)#

3. Sample Efficiency#

4. The “State Tracking” Hypothesis#

Conclusion and Implications#