Introduction

In the race to achieve Artificial General Intelligence (AGI), reasoning capability is the holy grail. We want Large Language Models (LLMs) that don’t just regurgitate facts but can plan, deduce, logic through complex puzzles, and solve novel problems.

Currently, we face a paradox in training these models. We have massive amounts of data for specific tasks like solving math problems or writing code. Consequently, models are getting quite good at those narrow skills. However, general reasoning—encompassing logical deduction, scientific inference, and symbolic manipulation—suffers from a lack of high-quality, diverse training data. You can train a model on math, but that doesn’t necessarily help it solve a logic puzzle or understand a scientific hypothesis.

So, where can we find a universal “textbook” for reasoning?

A team of researchers from DeepSeek-AI, Shanghai Jiao Tong University, and HKUST proposes an elegant solution in their paper, CODEI/O. Their hypothesis is fascinating: computer code is essentially a crystallized form of reasoning. Every function, loop, and conditional statement represents a logic flow, a decision tree, or a state-space search.

But simply showing the model raw code isn’t enough. The researchers have developed a novel method called Code Input-Output Prediction. Instead of just asking models to write code, they ask models to act as the computer—predicting the output of a function given an input, or reverse-engineering the input given an output.

In this post, we will dive deep into CODEI/O. We’ll explore how transforming code into reasoning tasks can teach LLMs to “think” in natural language, leading to performance gains across a wide variety of benchmarks, from math to common sense.

The Motivation: Why Code?

Before understanding the how, we must understand the why. Why look at code if we want to improve general reasoning?

Traditional pre-training on raw code (like GitHub repositories) has already shown some benefits for reasoning. However, it is noisy. A raw code file contains license headers, comments, import errors, and messy syntax that distract from the core logic. Furthermore, standard “text-to-code” generation tasks constrain the model to focus heavily on syntactic correctness (indents, colons, variable definitions) rather than the underlying problem-solving process.

The authors of CODEI/O argue that contextually grounded code contains universal reasoning primitives:

  • Logic Flow Planning: Structuring a sequence of operations.
  • State-Space Search: Exploring different possibilities to find a solution.
  • Modular Decomposition: Breaking a big problem into small functions.
  • Decision Tree Traversal: Handling if-else conditions.

If we can strip away the syntax heavy-lifting and force the model to verbally explain the execution of code, we can transfer these reasoning skills to natural language tasks.

The Method: Constructing CODEI/O

The core contribution of this paper is a rigorous pipeline that transforms messy, raw code into a structured training dataset designed for reasoning.

Step 1: From Raw Code to Unified Format

The process begins by gathering over 800,000 raw code files from diverse sources, including “CodeMix” (an internal corpus) and “PyEdu-R” (educational reasoning code).

These raw files are not usable as-is. They need to be cleaned and standardized. The researchers use DeepSeek-V2.5 to refactor this code into a Unified Format. This format includes:

  1. Cleaned Reference Code: The core logic, stripped of print statements, plots, or file I/O.
  2. Main Entrypoint: A single function that encapsulates the logic.
  3. Input Generator: A separate Python function designed to create valid, non-trivial test cases for the code.
  4. Query: A natural language description of what the problem is solving.

To visualize this transformation, look at the example below. Notice how messy code dealing with physics calculations is structured into a clean function with a specific input generator.

Table 10: A complete example showing how we transform a raw code file into our designed unified format.

Step 2: The Prediction Tasks

Once the code is clean and executable, the researchers generate the training data. This is where the “I/O” in CODEI/O comes in.

They execute the code using the Input Generator to create valid Input-Output Pairs. With the code, the query, and these pairs, they construct two distinct training tasks:

  1. Output Prediction: The model is given the code and an input. It must predict the result. This forces the model to simulate the code’s execution step-by-step.
  2. Input Prediction: The model is given the code and an output. It must deduce what input would produce that result. This is often harder and requires reverse-engineering or constraint satisfaction reasoning.

Crucially, the model is not just asked for the final answer. It is trained to produce a Chain-of-Thought (CoT) rationale in natural language.

Figure 1: Overview of our training data construction: Raw code files are gathered from various sources and converted into a unified format. Input-output pairs are then generated by executing the code, while natural language CoTs for predictions are collected from DeepSeek-V2.5. The verified CoTs can undergo optional revisions to further enhance reasoning chains.

As shown in Figure 1, the pipeline is cyclical. Raw code leads to inputs/outputs, which prompts an AI to generate reasoning (CoT). Because we have the actual code, we can verify if the AI’s prediction is correct by running it. This ground-truth verification is a massive advantage over standard text datasets where verifying reasoning is difficult.

Step 3: What the Model Actually Sees

Let’s look at what these training samples look like. In Figure 2 below, you can see the difference between the two tasks.

  • Left (Output Prediction): The model traces the “Coin Change” algorithm. It breaks down the greedy approach vs. optimal combinations in natural language before giving the JSON output.
  • Right (Input Prediction): The model is given the result “4” and has to find an input (amount and coins) that requires exactly 4 coins. It tests hypotheses (“Suppose we have coins [1, 2, 5]…”) until it finds a match.

Figure 2: Two examples for the collected responses for input and output prediction respectively.

This “verbal execution” decouples the reasoning from the syntax. The model isn’t worrying about missing a semicolon; it’s worrying about the logic of the algorithm.

Step 4: CODEI/O++ and Multi-Turn Revision

The researchers didn’t stop at the initial predictions. Because code is executable, they can automatically check if the generated training data is correct.

If the teacher model (DeepSeek-V2.5) generates a wrong prediction during data creation, the system captures the error (e.g., “Your input resulted in 3, not 4”). It then feeds this error back to the model as a prompt: “This was wrong, try again.”

This process creates CODEI/O++, a dataset enriched with multi-turn revisions. It teaches the model not just how to reason, but how to self-correct based on feedback.

Figure 7: In multi-turn revision, we track the percentage (%) of each response type across the entire dataset after each revision turn.

Figure 7 illustrates this revision process. In the first turn (Turn-0), about 50% of the predictions are correct. By adding a revision step (Turn-1), a significant chunk of the “Wrong” answers are converted to “Correct.” Interestingly, the researchers found that doing this more than once (Turn-2) yielded diminishing returns, so CODEI/O++ primarily focuses on that first influential revision.

Experiments and Results

Does this method actually work? The researchers tested CODEI/O by using it as a “Stage 1” training step for several popular base models (Qwen 2.5 Coder, LLaMA 3.1, DeepSeek Coder V2, Gemma 2). They then followed up with standard instruction tuning.

They evaluated the models on 14 different benchmarks, covering math (GSM8K, MATH), reasoning (Big-Bench Hard), knowledge (MMLU-STEM), and code (CruxEval).

Main Performance

The results were impressive. CODEI/O consistently outperformed baseline methods.

Table 1: Main evaluation results on all benchmarks.

Take a look at Table 1. The green cells indicate performance improvements over the baseline (standard instruction tuning).

  • Consistency: Unlike other datasets like OpenMathInstruct2 (which is great at math but sometimes hurts performance in other areas, indicated by red cells), CODEI/O provides a balanced lift across almost all categories.
  • Symbolic Reasoning: Look at the BBH (Big-Bench Hard) scores. CODEI/O provides significant gains here, proving that the logic learned from code transfers to general symbolic tasks.
  • Scientific Reasoning: Improvements in GPQA (graduate-level science) and MMLU-STEM suggest that the structured thinking required for code helps in rigorous scientific domains.

Comparison to Other Methods

You might ask: “Is it just because DeepSeek-V2.5 is a smart model and we are distilling its knowledge?”

To test this, the researchers compared CODEI/O against WebInstruct, a massive dataset of 11.6 million samples (much larger than CODEI/O’s 3.5 million). They even created a version of WebInstruct where they regenerated answers using DeepSeek-V2.5 (WI-DS25) to make the comparison fair.

Figure 3: Average scores of Stage 1 training on CODEI/O, a 3.5M WebInstruct subset (WI) and an enhanced version distilled from DeepSeek-V2.5 Directly (WI-DS25).

Figure 3 shows the result. Even when the “teacher” model is the same, CODEI/O (the purple and blue bars on the far left) outperforms WebInstruct. This proves that the source of the data—code input/output prediction—is inherently more valuable for learning reasoning than standard web data.

Scaling Laws

Another critical question for any dataset is: “Does more data equal better performance?”

The answer for CODEI/O appears to be yes.

Figure 4: The scaling effect of CoDEI/O in the first stage training.

Figure 4(a) on the left shows the performance of the Qwen model as the size of the CODEI/O dataset increases. The radar chart expands outward on almost all axes—Math, Symbolic, Logic—as the dataset grows from 0.32M to the full 3.52M samples. This suggests that the diversity of reasoning patterns found in code is vast, and the model continues to learn as it sees more examples.

Discussion: Why This Matters

The CODEI/O paper highlights a shift in how we think about training data. For a long time, the assumption was that to get better at math, you train on math. To get better at biology, you train on biology.

CODEI/O suggests that reasoning is a transferable skill. By training a model to trace the execution of a Python script that calculates a trajectory or sorts a list, we are teaching it the mental distinct primitives of:

  1. Variable tracking.
  2. Conditional logic.
  3. Iterative processing.

When the model later encounters a word problem about arranging seating for a wedding (a logic puzzle), it reuses those same neural pathways. It treats the word problem as a “program” to be executed mentally.

Data Leakage?

The authors were careful to check for “data leakage”—the possibility that the model is just memorizing code problems that appear in the test sets. They performed a strict 13-gram overlap analysis.

While they found some overlap in coding benchmarks (like LeetCode), the overlap in general reasoning benchmarks (MMLU, GSM8K, MATH) was near zero (0.1%). Yet, the performance gains in those zero-overlap areas were substantial. This confirms that the model is learning patterns, not just memorizing answers.

Conclusion

CODEI/O represents a significant step forward in training Large Language Models. By transforming the silent, structured logic of code into explicit, natural language input-output predictions, the researchers have unlocked a massive, scalable source of training data.

The key takeaways are:

  1. Code is Reasoning: Code is not just for computers; it contains the blueprints for logical thought.
  2. Verbal Execution: Forcing models to explain code execution in natural language (CoT) bridges the gap between syntax and logic.
  3. Broad Transfer: This training improves performance not just in coding, but in math, science, and common sense.
  4. Self-Correction: Using code execution to verify and revise training data (CODEI/O++) further boosts reliability.

As we look toward the future of AI, methods like CODEI/O suggest that the path to smarter models might not lie in reading more books, but in understanding the logic that runs our world—one function at a time.