For decades, the idea of an AI that could write its own code has been a holy grail of computer science. We’ve seen glimpses of this future in science fiction, but in reality, teaching a machine the logic, creativity, and precision required for programming has been an immense challenge.

When large language models (LLMs) like GPT-3 emerged, they revealed a surprising, albeit rudimentary, ability to generate simple code snippets from natural language prompts — even though they weren’t explicitly trained to code.

That led to a provocative question: what if we took a powerful language model and trained it specifically on code?

This is the central idea behind OpenAI’s landmark 2021 paper, Evaluating Large Language Models Trained on Code. The paper introduces Codex — a GPT model fine-tuned on a massive corpus of public code from GitHub. Codex powers the popular GitHub Copilot tool.

The researchers didn’t just build a model; they also created a new way to measure its capabilities, moving beyond fuzzy text-matching to what really matters: does the code actually work?

In this article, we’ll dive deep into the paper:

  • How Codex was trained.
  • How the researchers built the HumanEval benchmark for “functional correctness.”
  • Why generating many possible solutions dramatically boosts problem-solving power.
  • What limitations Codex still has.
  • The profound broader impacts of unleashing such a tool into the world.

How Do You Grade an AI’s Code?

Before we discuss Codex itself, we need to address a fundamental problem: how do you evaluate AI-generated code?

Traditionally, generative models in NLP are measured using metrics like BLEU score — which compares the output against a reference text and measures overlap of words.

For code, this fails spectacularly.

Consider two Python functions to double a number:

1
2
3
4
5
6
7
8
# Reference Solution
def double(x):
    return x * 2

# AI-Generated Solution
def double_number(num):
    result = num + num
    return result

These are functionally identical — both return the same result for all inputs.
But BLEU would rate them poorly because variable names and structure differ. Conversely, BLEU might reward code with high textual similarity that is functionally broken.

The Codex authors argue for a more practical metric: functional correctness — does the code pass the tests?

They evaluate this via unit tests, using a metric called pass@k, defined as the probability that if the model generates k different solutions for a problem, at least one of them passes all unit tests.

Formally:

\[ \text{pass@k} := \mathbb{E}_{\text{Problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] \]

Equation for the unbiased estimator of pass@k.

Here:

  • \(n\) = total samples generated
  • \(c\) = number of correct samples
  • \(k\) = number of samples you allow per problem

This gives an unbiased probability of solving a problem within k tries.

A numerically stable Python script for calculating the unbiased estimate of pass@k.


HumanEval: A Realistic Benchmark

To apply this metric, the team built HumanEval — 164 hand-written programming problems with:

  • Function signatures,
  • Docstrings (natural language problem descriptions),
  • Unit tests.

Why hand-written? Because Codex was trained on a huge slice of GitHub, and they wanted to avoid test problems already existing in the training set.

Problems span language comprehension, algorithms, and basic math — much like simple software interview questions.

Three example problems from the HumanEval dataset, showing the prompt (white) and a correct model completion (yellow). The problems vary in difficulty.

All code was run in a secure sandbox built with gVisor, preventing malicious or unsafe code from harming the host or network.


Building and Training Codex

From GPT-3 to Codex

Codex starts from the GPT-3 model family. The idea: since HumanEval’s docstrings are in English, starting with a strong natural language model could help it understand prompts.

In May 2020, the team collected 159 GB of unique Python files from 54 million public GitHub repositories. They filtered out auto-generated files, abnormal formatting, and files with too little code.

For efficiency, they modified the GPT-3 tokenizer — adding special tokens for whitespace runs — reducing token count by about 30%.


Scaling Laws Hold for Code

One of GPT-3’s major findings was that larger models improve predictably in a power-law relationship between size and performance. Codex shows the same pattern for code:

A log-log plot showing that as the number of non-embedding parameters in Codex increases, the test loss on a held-out code dataset decreases in a smooth power-law trend.

Bigger models trained on more code → lower loss → better performance.


Sampling: Codex’s Secret Weapon

While single-shot generation (pass@1) is important, the real boost comes from generating many solutions (pass@100).

This mirrors human programming — try multiple approaches, fix bugs, iterate.


Temperature Tuning

Temperature controls output randomness:

  • Low temperature (e.g., 0.2) → deterministic, best single guess.
  • High temperature (e.g., 0.8) → more creative, diverse outputs.

For Codex:

  • pass@1 → best at T=0.2.
  • pass@100 → best at T=0.8. Diversity matters.

The top panel shows that for a higher number of samples (k), a higher temperature leads to a better pass@k rate. The bottom panel plots the optimal temperature for each value of k, showing a clear upward trend.


Size vs. Success

This plot shows the pass@1 and pass@100 rates for Codex models of different sizes. Both metrics improve with model size, but pass@100 (orange line) shows much higher performance, reaching over 70% for the largest model.

The largest Codex — 12B parameters — scores:

  • 28.8% pass@1
  • 72.3% pass@100

Choosing the Best Without Tests

In production (e.g., Copilot), you can’t show 100 outputs. Without unit tests (“oracle”), the team tried heuristics:

  • Highest mean log probability — model’s most confident solution wins.
  • This beats random choice and is practical for deployment.

This chart compares different heuristics for selecting one sample out of k. The “Oracle” (blue) represents the theoretical maximum performance. Ranking by “Mean logp” (red) is significantly better than “Random” (purple).


BLEU Score’s Final Defeat

Correct vs. incorrect solutions have overlapping BLEU score distributions — proof that BLEU can’t reliably measure correctness.

These four plots show the distribution of BLEU scores for correct (blue) and incorrect (green) solutions on four different HumanEval problems. The significant overlap shows that BLEU score cannot reliably separate working code from broken code.


Codex vs. Other Models

Codex dwarfs the competition:

This table compares the pass@k rates for various sizes of Codex, GPT-Neo, GPT-J, and TabNine on the HumanEval dataset. Codex models consistently outperform others, and show strong scaling with size.

Codex-300M scores similarly to GPT-J-6B — with 20× fewer parameters.


Codex-S: Supervised Fine-Tuning Pays Off

Raw GitHub contains mixed content. The team curated ~50,000 high-quality problems from:

  1. Competitive programming sites — strong specs, hidden tests.
  2. CI projects — traced inputs/outputs to create new problems.

Fine-tuning Codex on these yielded Codex-S:

  • 12B Codex-S → 37.7% pass@1, 77.5% pass@100.

The top plot compares Codex-S (solid lines) with Codex (dashed lines), showing that Codex-S achieves higher pass rates across all model sizes. The bottom plot shows that ranking heuristics like mean logp are also more effective for Codex-S.

Codex-S benefits from higher sampling temperatures:

This plot shows the optimal sampling temperature for Codex and Codex-S. Codex-S (orange) consistently requires a higher temperature than Codex (blue) for any given number of samples k > 1.


APPS Dataset Generalization

Codex-12B performs comparably to GPT-Neo fine-tuned on APPS — despite Codex not being trained on APPS format problems.

This table shows the performance of Codex-12B on the APPS dataset across different difficulty levels. Sampling and filtering are shown to be highly effective, especially for introductory problems.


Codex-D: Explaining Code with Docstrings

Reverse task — generate docstrings from code — is great for documentation and code comprehension.

Codex-D, trained for this, had hand-graded accuracy slightly below Codex-S’s code generation performance.

This table shows the hand-graded pass rates for the docstring-generating model, Codex-D.


Limitations

Data Inefficiency

Codex trained on hundreds of millions of lines of code, yet struggles with problems a CS student might solve easily.

Long Sequential Instructions

Performance drops exponentially with chained operations:

This chart shows that the pass rate of Codex-12B drops sharply as the number of chained components in a synthetic problem’s docstring increases.

Variable Binding Errors

Complex prompts with multiple variables/operations can cause misbinding — applying the wrong transformation to the wrong variable.


Broader Impacts & Risks

The paper devotes a large section to hazards:

Over-Reliance

Novices may trust Codex blindly. Code can look correct but hide bugs or security flaws.

Misalignment

Codex predicts the next token like its training data — not necessarily aligned with user intent.

If shown buggy code, it may imitate that flawed style.

This plot demonstrates misalignment. When the prompt contains examples with subtle bugs (orange), Codex’s performance drops compared to when it sees correct examples (green) or no examples (gray). This gap widens with model size, suggesting bigger models get better at imitating bad patterns.


Security Issues

Codex inherits vulnerabilities from training data. For example, it often suggests insecure cryptographic configurations.

This plot shows that when asked to create encryption keys, Codex models suggest clearly insecure configurations (e.g., RSA keys shorter than 2048 bits) in a significant fraction of cases.


Bias & Representation

It can generate code/comments reflecting societal biases present in source data.

Economic & Labor Market Impacts

Potential productivity boosts — but also possible job displacement or workflow changes.


Conclusion: A New Era of Programming

Codex is a milestone toward AI-assisted programming, showing:

  1. Functional correctness + unit tests » text-matching scores like BLEU.
  2. Scaling laws apply to code — bigger models yield better results.
  3. Sampling at high temperature and smart ranking is a breakthrough for tough problems.
  4. Specialized fine-tuning (Codex-S) greatly improves efficiency.

Codex isn’t replacing programmers — it’s augmenting them. As a powerful collaborator, it can handle boilerplate, offer solutions, accelerate development, and even aid learning.

But the power comes with responsibility: review and verify outputs, guard against over-reliance, misalignment, and bias.

The journey toward a reliable AI coding partner has begun. How we integrate — and regulate — these tools will define the next chapter of software development.