Large language models (LLMs) have shown astonishing capabilities: writing code, composing essays, and answering complex questions. Much of that success rests on few-shot learning—showing a model a few examples in the prompt and letting it generalize. But few-shot prompting has drawbacks: you need examples, and you often must engineer the prompt carefully.

What if we could simply tell a model, in plain English, what we want it to do—and have it do it well without any example? That’s the core question of “Finetuned Language Models Are Zero-Shot Learners” (Google Research). The paper shows that a surprisingly simple trick—instruction tuning—turns large pretrained models into strong zero-shot learners. The instruction-tuned model, FLAN (Finetuned Language Net), improves zero-shot performance across many tasks and even beats GPT-3 (175B) zero-shot on most evaluated datasets.

In this post I’ll walk through the idea, how the authors built FLAN, the evidence that instruction tuning works, and the key lessons you can take away.

Three paradigms for adapting language models

There are three common ways to turn a pretrained LM into a useful system for downstream tasks. The differences are important for understanding where instruction tuning fits.

A diagram comparing three approaches to adapting language models: Pretrain-Finetune, Prompting, and Instruction Tuning.

Figure 1: A comparison of the three main paradigms for adapting language models. Instruction tuning combines the strengths of finetuning and prompting.

  • Pretrain → finetune: take a pretrained model and finetune it on one labeled task (BERT, T5). Highly effective for that task, but you need a separate finetuned model per task and lots of labeled data.
  • Prompting: keep model weights fixed and steer behavior with text prompts (GPT-3). Zero-shot prompting describes the task in language; few-shot prompting adds a handful of examples. Few-shot often works much better than zero-shot, but it requires careful prompt design.
  • Instruction tuning (FLAN): finetune a single model on many tasks expressed as natural language instructions. The goal is to teach the model the skill of following instructions so it generalizes to new, unseen tasks without examples.

Instruction tuning is essentially a hybrid: it uses label supervision (like finetuning) but with the target format being natural-language instructions (like prompting). The experiments in the paper test whether this hybridization improves zero-shot generalization.

The FLAN recipe — simple and practical

At a high level, FLAN is built by: (1) converting many existing datasets to instruction-style prompts, (2) finetuning a large pretrained decoder-only transformer on the mixture, and (3) evaluating on held-out task clusters so test tasks are truly unseen during instruction tuning.

Key building blocks:

1) A diverse mixture of tasks

The authors aggregated 62 public NLP datasets spanning 12 task clusters (NLI, reading comprehension, translation, sentiment, commonsense, struct-to-text, etc.). Diversity matters: seeing many task families teaches the model a broader instruction-following skill.

A table showing the 12 task clusters and the datasets within each, used for instruction tuning.

Figure 2: The diverse collection of over 60 datasets, grouped into 12 task clusters, used to train FLAN. Blue indicates language understanding tasks; teal indicates language generation tasks.

2) Natural-language instruction templates

For each dataset the team wrote up to 10 different instruction templates that describe the task in natural language (not just a dataset tag). For example, the same NLI example can be phrased in multiple ways: “Does the premise imply the hypothesis?”; “Can we infer X from Y?”; etc. This variety helps prevent the model from overfitting to a single phrasing and trains it to respond to different instruction styles.

An example showing how a single NLI task (premise, hypothesis, target) can be formatted into four different natural language instruction templates.

Figure 3: A single NLI instance expressed in multiple instruction templates. The templates introduce diversity in phrasing, encouraging robust instruction following.

3) A large pretrained base model and pragmatic finetuning

FLAN is an instruction-tuned version of LaMDA-PT (a decoder-only transformer). The authors finetune the 137B-parameter model on the mixed instruction dataset using token-packed sequences, Adafactor optimizer, and careful mixing so no dataset dominates. They capped examples per dataset and trained for tens of thousands of steps (the full run is a modest finetune relative to pretraining).

4) A rigorous “hold-a-cluster-out” evaluation

To test zero-shot generalization, they held out whole task clusters during instruction tuning. For example, to evaluate NLI zero-shot performance, they trained a FLAN checkpoint that had never seen any NLI datasets during instruction tuning. This conservative split tests whether instruction tuning teaches general instruction-following, not just dataset memorization.

You can see the overall pipeline in Figure 4 below—train on many instruction-formatted tasks, then evaluate on an unseen task cluster.

A diagram showing the instruction tuning process. The model is finetuned on a mixture of tasks, then evaluated on an unseen task type. A bar chart below shows FLAN’s zero-shot performance outperforming GPT-3’s zero-shot and few-shot performance on several unseen task types.

Figure 4: Overview of instruction tuning: finetune on a diverse instruction-formatted mix, then evaluate on a completely held-out task cluster. The bar chart highlights gains on several unseen task types.

Does instruction tuning work? — The headline results

Yes. The instruction-tuned FLAN model gives large zero-shot gains over the untuned base model and outperforms GPT-3 (175B) zero-shot on the majority of evaluated benchmarks.

The paper reports results across many tasks, but the most striking comparisons are on groups such as:

  • Natural Language Inference (NLI): large gains relative to the untuned model and GPT-3 zero-shot.
  • Reading comprehension and closed-book QA: FLAN improves zero-shot question answering and reading comprehension without examples.
  • Translation: FLAN outperforms GPT-3 zero-shot on many language-pair tasks (especially into English).

The scatter plot below summarizes zero-shot performance differences across several representative datasets.

A scatter plot comparing the zero-shot performance of FLAN, LaMDA-PT, GPT-3, and GLaM across four categories of tasks: Natural language inference, Reading comprehension, Closed-book QA, and Translation.

Figure 5: Zero-shot performance comparison. FLAN (blue stars) substantially improves over the untuned base (blue circles) and often beats GPT-3 (yellow) in zero-shot evaluation.

Two practical points about evaluation:

  • For each dataset FLAN reports the mean accuracy across up to ten instruction templates to reduce sensitivity to prompt wording.
  • They also report the best-dev-template performance as an upper bound (mimicking the fact that some prior work tuned prompts on a small dev set).

What drives the gains? Ablations and deeper analysis

The authors ran several careful ablations to unpack why instruction tuning helps. Three findings are particularly important.

1) Task diversity matters: more clusters → better zero-shot

When they progressively added task clusters to the instruction-tuning mix, held-out cluster performance rose monotonically (for most held-out clusters). More task families during finetuning teaches a more general instruction-following capability rather than learning idiosyncratic behavior for a small collection of tasks.

A line chart showing that as more task clusters are used for instruction tuning, the performance on held-out clusters (like NLI and Commonsense) increases.

Figure 6: As instruction tuning incorporates more diverse clusters, zero-shot performance on held-out clusters improves. Diversity of training tasks is a key ingredient.

This tells us that instruction tuning benefits from breadth: adding more different kinds of tasks continues to increase generalization.

2) Scale is crucial: instruction tuning helps only for large models

A surprising and important finding is that instruction tuning doesn’t uniformly help at all model sizes. The authors tested models from ~0.4B up to 137B parameters:

  • Small models (≤ 8B): finetuning on many instructions tended to hurt held-out performance. The hypothesis is that small models use their limited capacity to memorize the finetuning mixture without learning the generalizable instruction-following skill.
  • Large models (68B and 137B): instruction tuning provided large, consistent gains on unseen tasks.

A line chart showing the effect of instruction tuning across different model sizes. For small models, instruction tuning hurts performance on held-out tasks, but for large models, it provides a massive boost.

Figure 7: The benefit of instruction tuning emerges with scale. Small models may lose generalization when overloaded with many finetuning tasks; very large models can both absorb the task mixture and learn transferable instruction-following skills.

This result underscores an important practical point: instruction tuning is most effective when applied to very large models that have the capacity to learn both task-specific patterns and a higher-level instruction-following behavior.

3) It’s the natural-language instructions—not just multi-task finetuning

One might suspect the gains come merely from multi-task finetuning (learning to do many tasks), not from training with textual instructions per se. The authors tested two ablations:

  • No-template finetuning: feed only input-output pairs (no instruction text).
  • Dataset-name finetuning: prepend a brief dataset/task label (e.g., “[Translation: WMT’14]”) to the input.

Both ablations performed substantially worse than full instruction tuning where the finetuning examples are described with natural language instructions. Put another way, learning to map from English instructions to the desired behavior is critical to generalization.

A bar chart comparing FLAN (FT: instruction; Eval: instruction) to models finetuned without instructions or with just dataset names. FLAN’s performance is substantially higher.

Figure 8: Finetuning with explicit natural-language instructions is crucial. Multi-task finetuning without instructions or with only dataset tags yields poorer zero-shot performance.

This confirms that the model is learning an instruction-to-behavior mapping, not merely memorizing supervised examples.

Instruction tuning plays nicely with other methods

Instruction tuning is not a competitor to few-shot prompting or prompt tuning—it’s complementary.

Few-shot + FLAN

When few-shot exemplars are added at inference time in FLAN’s instruction format, performance improves further across task clusters. Few-shot examples help especially when the output format or space is complex (e.g., structured generation, translation).

A bar chart comparing Zero-shot FLAN and Few-shot FLAN. Adding a few examples at inference time improves performance across all task clusters.

Figure 9: Adding a few in-context exemplars on top of FLAN’s instruction format improves performance, indicating that instruction tuning and few-shot prompting are complementary.

Prompt tuning on top of FLAN

Prompt tuning is a parameter-efficient technique where a small continuous “soft prompt” is optimized while the main model’s weights are frozen. FLAN provides a stronger base for prompt tuning: with FLAN as the base model, prompt tuning achieves markedly higher performance than when applied to the untuned pretrained model—especially in low-data settings.

A bar chart showing that an instruction-tuned model achieves much higher performance after prompt tuning compared to an untuned model, both in low-data and full-dataset settings.

Figure 10: Instruction-tuned models are more amenable to prompt tuning. When fine-tuning only a small continuous prompt, FLAN yields much better results than the untuned base, especially with few examples.

This suggests instruction tuning creates a checkpoint that is easier to steer both with text prompts and with learned soft prompts.

Practical takeaways

Here are concise lessons from the paper for practitioners and researchers:

  • Natural-language instructions are a powerful interface. Training (finetuning) models to respond to instructions improves zero-shot behavior more than multi-task finetuning without instructions.
  • Diversity in finetuning tasks helps. The more varied the instruction-formatted tasks the model sees, the better it generalizes to unseen tasks.
  • Scale matters. Instruction tuning’s benefits are emergent at large scale: big models can absorb both task-specific detail and the general skill of following instructions.
  • Instruction tuning complements few-shot prompting and prompt tuning. It produces a foundation model that is both better at zero-shot tasks and easier to adapt with few examples or soft prompts.
  • Instruction tuning isn’t a silver bullet for everything. Tasks that closely resemble the pretraining objective (e.g., next-token completion tasks formatted as sentence continuation) may not see gains, because the instruction adds little extra information beyond raw language modeling.

Short list of limitations and open questions

The paper is careful and methodical, but there are caveats worth noting:

  • The finetuned model (FLAN 137B) is large and costly to serve; instruction tuning is easiest to justify for large, shared checkpoints.
  • The evaluation uses “hold-a-cluster-out” splits, which is conservative, but there remains the possibility of pretraining overlap for some datasets. The authors perform contamination analyses to address that concern.
  • Template creation was manual. Can we automate generation of high-quality, diverse instruction templates at scale?
  • The instruction style used in the paper are mostly short, single-sentence directions. There is room to explore longer, multi-step or hierarchical instructions akin to what a human might write.
  • The beneficial behavior is scale-dependent. Understanding how instruction-following emerges with size is an open research area.

Final thoughts

“Finetuned Language Models Are Zero-Shot Learners” shows a simple and effective idea: teach a model to follow natural-language instructions by finetuning on many tasks phrased as instructions. For very large models, this yields substantial zero-shot gains and makes the model more adaptable to downstream steering (few-shot prompts, prompt tuning).

The takeaway for engineers and researchers is practical: if you want a single, generalist model that you can “tell” what to do in English, instruction tuning is a highly effective approach—provided you have a model with enough capacity. As LLMs continue to scale and as we collect more diverse instruction-style supervision, instruction tuning looks like a foundational tool for building more usable, general-purpose language systems.

If you want to dive deeper, the paper provides many experimental details, dataset templates, and ablation data that are useful starting points for reproducing and extending the work.