For decades, the idea of an AI that could write its own code has been a holy grail of computer science. We’ve seen glimpses of this future in science fiction, but in reality, teaching a machine the logic, creativity, and precision required for programming has been an immense challenge.
When large language models (LLMs) like GPT-3 emerged, they revealed a surprising, albeit rudimentary, ability to generate simple code snippets from natural language prompts — even though they weren’t explicitly trained to code.
That led to a provocative question: what if we took a powerful language model and trained it specifically on code?
This is the central idea behind OpenAI’s landmark 2021 paper, Evaluating Large Language Models Trained on Code. The paper introduces Codex — a GPT model fine-tuned on a massive corpus of public code from GitHub. Codex powers the popular GitHub Copilot tool.
The researchers didn’t just build a model; they also created a new way to measure its capabilities, moving beyond fuzzy text-matching to what really matters: does the code actually work?
In this article, we’ll dive deep into the paper:
- How Codex was trained.
- How the researchers built the HumanEval benchmark for “functional correctness.”
- Why generating many possible solutions dramatically boosts problem-solving power.
- What limitations Codex still has.
- The profound broader impacts of unleashing such a tool into the world.
How Do You Grade an AI’s Code?
Before we discuss Codex itself, we need to address a fundamental problem: how do you evaluate AI-generated code?
Traditionally, generative models in NLP are measured using metrics like BLEU score — which compares the output against a reference text and measures overlap of words.
For code, this fails spectacularly.
Consider two Python functions to double a number:
These are functionally identical — both return the same result for all inputs.
But BLEU would rate them poorly because variable names and structure differ. Conversely, BLEU might reward code with high textual similarity that is functionally broken.
The Codex authors argue for a more practical metric: functional correctness — does the code pass the tests?
They evaluate this via unit tests, using a metric called pass@k
, defined as the probability that if the model generates k different solutions for a problem, at least one of them passes all unit tests.
Formally:
\[ \text{pass@k} := \mathbb{E}_{\text{Problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] \]Here:
- \(n\) = total samples generated
- \(c\) = number of correct samples
- \(k\) = number of samples you allow per problem
This gives an unbiased probability of solving a problem within k tries.
HumanEval: A Realistic Benchmark
To apply this metric, the team built HumanEval — 164 hand-written programming problems with:
- Function signatures,
- Docstrings (natural language problem descriptions),
- Unit tests.
Why hand-written? Because Codex was trained on a huge slice of GitHub, and they wanted to avoid test problems already existing in the training set.
Problems span language comprehension, algorithms, and basic math — much like simple software interview questions.
All code was run in a secure sandbox built with gVisor, preventing malicious or unsafe code from harming the host or network.
Building and Training Codex
From GPT-3 to Codex
Codex starts from the GPT-3 model family. The idea: since HumanEval’s docstrings are in English, starting with a strong natural language model could help it understand prompts.
In May 2020, the team collected 159 GB of unique Python files from 54 million public GitHub repositories. They filtered out auto-generated files, abnormal formatting, and files with too little code.
For efficiency, they modified the GPT-3 tokenizer — adding special tokens for whitespace runs — reducing token count by about 30%.
Scaling Laws Hold for Code
One of GPT-3’s major findings was that larger models improve predictably in a power-law relationship between size and performance. Codex shows the same pattern for code:
Bigger models trained on more code → lower loss → better performance.
Sampling: Codex’s Secret Weapon
While single-shot generation (pass@1
) is important, the real boost comes from generating many solutions (pass@100
).
This mirrors human programming — try multiple approaches, fix bugs, iterate.
Temperature Tuning
Temperature controls output randomness:
- Low temperature (e.g., 0.2) → deterministic, best single guess.
- High temperature (e.g., 0.8) → more creative, diverse outputs.
For Codex:
pass@1
→ best at T=0.2.pass@100
→ best at T=0.8. Diversity matters.
Size vs. Success
The largest Codex — 12B parameters — scores:
- 28.8%
pass@1
- 72.3%
pass@100
Choosing the Best Without Tests
In production (e.g., Copilot), you can’t show 100 outputs. Without unit tests (“oracle”), the team tried heuristics:
- Highest mean log probability — model’s most confident solution wins.
- This beats random choice and is practical for deployment.
BLEU Score’s Final Defeat
Correct vs. incorrect solutions have overlapping BLEU score distributions — proof that BLEU can’t reliably measure correctness.
Codex vs. Other Models
Codex dwarfs the competition:
Codex-300M scores similarly to GPT-J-6B — with 20× fewer parameters.
Codex-S: Supervised Fine-Tuning Pays Off
Raw GitHub contains mixed content. The team curated ~50,000 high-quality problems from:
- Competitive programming sites — strong specs, hidden tests.
- CI projects — traced inputs/outputs to create new problems.
Fine-tuning Codex on these yielded Codex-S:
- 12B Codex-S → 37.7%
pass@1
, 77.5%pass@100
.
Codex-S benefits from higher sampling temperatures:
APPS Dataset Generalization
Codex-12B performs comparably to GPT-Neo fine-tuned on APPS — despite Codex not being trained on APPS format problems.
Codex-D: Explaining Code with Docstrings
Reverse task — generate docstrings from code — is great for documentation and code comprehension.
Codex-D, trained for this, had hand-graded accuracy slightly below Codex-S’s code generation performance.
Limitations
Data Inefficiency
Codex trained on hundreds of millions of lines of code, yet struggles with problems a CS student might solve easily.
Long Sequential Instructions
Performance drops exponentially with chained operations:
Variable Binding Errors
Complex prompts with multiple variables/operations can cause misbinding — applying the wrong transformation to the wrong variable.
Broader Impacts & Risks
The paper devotes a large section to hazards:
Over-Reliance
Novices may trust Codex blindly. Code can look correct but hide bugs or security flaws.
Misalignment
Codex predicts the next token like its training data — not necessarily aligned with user intent.
If shown buggy code, it may imitate that flawed style.
Security Issues
Codex inherits vulnerabilities from training data. For example, it often suggests insecure cryptographic configurations.
Bias & Representation
It can generate code/comments reflecting societal biases present in source data.
Economic & Labor Market Impacts
Potential productivity boosts — but also possible job displacement or workflow changes.
Conclusion: A New Era of Programming
Codex is a milestone toward AI-assisted programming, showing:
- Functional correctness + unit tests » text-matching scores like BLEU.
- Scaling laws apply to code — bigger models yield better results.
- Sampling at high temperature and smart ranking is a breakthrough for tough problems.
- Specialized fine-tuning (Codex-S) greatly improves efficiency.
Codex isn’t replacing programmers — it’s augmenting them. As a powerful collaborator, it can handle boilerplate, offer solutions, accelerate development, and even aid learning.
But the power comes with responsibility: review and verify outputs, guard against over-reliance, misalignment, and bias.
The journey toward a reliable AI coding partner has begun. How we integrate — and regulate — these tools will define the next chapter of software development.