Modern deep learning models are astonishingly powerful, achieving superhuman performance in tasks ranging from image recognition to language translation. Yet they share one major weakness: an insatiable hunger for data. Training these systems requires massive, carefully curated datasets, which are often expensive, time-consuming, or even impossible to obtain. What if you want to develop a model for a rare disease, a niche product, or specialized legal documents? Gathering thousands of labeled samples may simply not be feasible.
Humans, on the other hand, are remarkably efficient learners. Show a child a single picture of a zebra, and they can usually recognize zebras for life. We learn quickly from limited examples by drawing on prior knowledge and experience. This ability to generalize from only a few examples inspires a burgeoning research frontier in AI: Few-Shot Learning (FSL).
Few-Shot Learning aims to design models that can learn new tasks using only a handful of training examples—sometimes just one (one-shot learning) or even none (zero-shot learning). If successful, FSL could redefine artificial intelligence by making it more adaptable, less data-dependent, and vastly more efficient.
In this article, we’ll unpack the fundamental ideas behind Few-Shot Learning using Learning From Few Examples: A Summary of Approaches to Few-Shot Learning by Archit Parnami and Minwoo Lee as our guide. We’ll explore meta-learning—the idea of “learning to learn”—and walk through its three major families: metric-based, optimization-based, and model-based approaches.
Background: Learning How to Learn with Meta-Learning
Before diving into the specific FSL methods, we need to understand their backbone: Meta-Learning, often described as “learning to learn.”
Imagine you can ride a bicycle. When trying to ride a motorcycle for the first time, you don’t start from scratch—you already know about balance, steering, and braking. Your prior experience helps you learn the new, related skill faster. Similarly, a meta-learning system gains broad experience across many tasks so it can quickly adapt to new ones.
A conventional neural network is trained to solve one problem—for instance, classifying cats versus dogs—using a single, large dataset. A meta-learner, in contrast, is trained on a distribution of tasks. Each task is a miniature supervised learning problem with its own training and testing examples. Through exposure to many different tasks, the meta-learner develops a general strategy for learning that can be applied to unseen tasks.
This differs from related concepts like:
- Transfer Learning: A model is trained on a large source dataset and fine-tuned on a smaller target dataset. It transfers knowledge from one domain to a similar domain.
- Multi-Task Learning: The model learns several related tasks simultaneously to enhance joint performance.
- Meta-Learning: The model learns how to learn across many tasks so that it can adapt to entirely new ones.
The Meta-Learning Problem Setup
In ordinary supervised learning, we train a model \( f(x; \theta) \) on a dataset \( \mathcal{D}^{train} \) to minimize a loss function:
\[ \theta = \arg\min_{\theta} \sum_{(x, y) \in \mathcal{D}^{train}} \mathcal{L}(f(x; \theta), y) \]In meta-learning, we no longer deal with a single dataset but rather a distribution of tasks \( p(\mathcal{T}) \). Each task \( \mathcal{T}_i \) comes with its own tiny training set \( \mathcal{D}_i^{train} \) and test set \( \mathcal{D}_i^{test} \). The goal is to learn parameters \( \theta \) that yield good performance when adapted to new tasks sampled from this distribution. The objective becomes:
\[ \theta = \arg\min_{\theta} \sum_{\mathcal{D}_i \in \mathcal{D}_{meta-train}} \sum_{(x, y) \in \mathcal{D}_i^{test}} \mathcal{L}(f(\mathcal{D}_i^{train}, x; \theta), y) \]During training, the meta-learner sees many small classification tasks (e.g., “cat vs. dog,” “bird vs. otter”) and learns a general strategy for classification. During testing, it uses that strategy to solve unseen tasks (e.g., “flower vs. bicycle”) after seeing very few examples.

Figure: Example of a meta-learning setup with distinct tasks for training and testing.
In Few-Shot Learning, we often refer to a task’s tiny training set as the support set and its test set as the query set.
The Landscape of Few-Shot Learning
Armed with an understanding of meta-learning, we can explore how it powers Few-Shot Learning. Approaches to FSL can be grouped into two main categories: Meta-Learning-based and Non-Meta-Learning-based methods.

Figure: Hierarchical taxonomy of Few-Shot Learning methods.
Most research in FSL uses the M-way K-shot classification setup:
- M: Number of classes per task (e.g., 5-way classification).
- K: Number of examples per class in the support set (e.g., 1-shot or 5-shot learning).
Each task provides only \( M \times K \) labeled examples; the model must classify new samples in the query set using this limited information.
Meta-Learning-Based FSL: The Three Pillars
The survey organizes meta-learning-based Few-Shot methods into three main families, each distinguished by how they model the probability \( P_{\theta}(y|x) \):
| Approach | Key Idea | Advantages | Disadvantages |
|---|---|---|---|
| Metric-Based | Learn an embedding space and a similarity metric. | Simple, fast inference, widely applicable. | Less adaptive, inference cost grows with support size. |
| Optimization-Based | Learn optimal model initialization or update rules. | Flexible to new tasks, strong generalization. | Requires gradient updates at inference; risk of overfitting. |
| Model-Based | Design architectures with rapid-learning capabilities (e.g., memory). | Fast inference without optimization. | Can be complex and memory-heavy. |
Let’s explore each pillar.
1. Metric-Based Meta-Learning — Learning to Compare
The intuition is direct: if images from the same class are close together in feature space and different classes are far apart, classifying new examples is straightforward. Metric-based methods aim to learn:
- An embedding function \( g(x; \theta_1) \) that maps raw inputs into a feature space.
- A distance function \( d_{\theta_2}(x_1, x_2) \) that measures similarity.
Training proceeds episodically:
- Sample an M-way K-shot episode.
- Embed all images using \( g(x) \).
- Compute distances between query and support embeddings.
- Evaluate classification loss.
- Update parameters and repeat.

Figure: Pipeline for metric-based meta-learning.
Siamese Networks
One of the earliest metric-learning architectures, Siamese Networks, uses two identical convolutional networks with shared weights to compare pairs of images. The network outputs the probability that the two inputs belong to the same class.

Figure: Convolutional Siamese Network used for one-shot classification.
Matching Networks
Matching Networks introduce an attention-based comparison. Each query is compared to all support images using cosine similarity; the query’s label is predicted as a weighted sum of the support labels. This approach dynamically adapts embeddings based on context.

Figure: Matching Networks with attention-based similarity.
Prototypical Networks
Instead of comparing a query to every support sample, Prototypical Networks compute one “prototype” per class—the average of embedded support examples:
\[ \mathbf{v}_c = \frac{1}{|S^c|} \sum_{(x_k, y_k) \in S^c} g_{\theta_1}(x_k) \]A query is classified based on its distance to these prototypes.

Figure: Prototypical Networks forming compact class prototypes.
Relation Networks
Relation Networks replace manually defined metrics with a trainable “relation” module—a CNN that learns to output similarity scores given pairs of embeddings.

Figure: Relation Networks learn the similarity function directly.
Advanced extensions like TADAM, TapNet, and CTM adapt embeddings and metrics to specific tasks, improving robustness and accuracy.
2. Optimization-Based Meta-Learning — Learning to Optimize
This family focuses on the optimization process rather than the metric. Training from scratch with limited data often leads to overfitting. What if we could learn model parameters that start in a “good place,” requiring only a small adjustment for a new task?
These methods meta-learn either:
- The optimizer itself, or
- The initial model parameters to enable fast adaptation.
LSTM Meta-Learner
Instead of manually designing the optimizer (like SGD), an LSTM learns to perform updates:
\[ \theta_{i+1} = g_i(\nabla f(\theta_i), \theta_i; \phi) \]The LSTM meta-learner receives gradients from a base learner and outputs parameter updates tailored to few-shot scenarios.

Figure: How an LSTM Meta-Learner updates the base learner parameters.
Model-Agnostic Meta-Learning (MAML)
MAML seeks a set of initial weights \( \theta \) such that, for a new task, taking just one or a few gradient steps leads to high performance. It operates in nested loops:
- Inner loop: Fine-tune \( \theta \) on each task’s support set to obtain \( \theta^* \).
- Outer loop: Update \( \theta \) by minimizing error on query sets across tasks.

Figure: MAML finds an initialization enabling rapid adaptation across diverse tasks.
Variants of MAML expand its capabilities:
- HSML (Hierarchically Structured Meta-Learning): Learns different initializations for task clusters, rather than one global parameter set.
- MTL (Meta-Transfer Learning): Uses a pretrained feature extractor and meta-learns lightweight adaptation parameters.
- LEO (Latent Embedding Optimization): Performs optimization in a low-dimensional latent space, improving speed and stability.

Figure: Hierarchical task-specific initialization with HSML.
3. Model-Based Meta-Learning — Learning with Memory
Model-based methods introduce architectures designed for instant adaptation, often by including external memory modules to store and retrieve task information.
Memory-Augmented Neural Networks (MANN)
MANNs use a Neural Turing Machine, coupling a neural controller with an external memory matrix. The controller learns to read and write memory entries via attention, enabling quick updates without retraining.

Figure: Memory interaction in MANNs enhances rapid few-shot learning.
Meta Networks
MetaNet introduces fast and slow weights. Fast weights are generated by a meta-network in response to a task and combined with slow, gradient-trained weights for prediction—allowing instantaneous adaptation.

Figure: MetaNet merges task-specific fast weights with globally learned slow weights.
SNAIL
SNAIL (Simple Neural Attentive Meta-Learner) treats meta-learning as a sequence problem. It processes example-label pairs using temporal convolutions (to aggregate past information) and causal attention (to retrieve relevant data), predicting the label of a new query example.

Figure: SNAIL integrates attention and temporal layers for sequence-based meta-learning.
Non-Meta-Learning Approaches: The Rise of Simplicity
Despite the sophistication of meta-learning, recent findings show that simple transfer learning approaches can perform surprisingly well.
The formula is straightforward:
- Pretrain a deep network such as ResNet on a large dataset of base classes.
- Extract embeddings for new classes using the pretrained network as a fixed feature extractor.
- Classify new samples with a simple nearest-neighbor classifier using Euclidean or cosine distance.
Methods like SimpleShot and Meta-Baseline demonstrate that strong, generic embeddings alone can rival specialized meta-learning techniques. Sometimes, a good representation is all you need.
Progress in Few-Shot Learning
Few-Shot Learning has made remarkable progress. From early models in 2017 achieving around 43% accuracy on the miniImageNet benchmark to today’s advanced systems surpassing 78%, the improvements are dramatic.

Figure: Rapid progress in Few-Shot Learning between 2017 and 2020.
Across methods, performance varies—but intriguingly, no single category dominates. Metric-, optimization-, hybrid-, and even non-meta-learning methods exhibit competitive results.

Figure: Accuracy comparison of various Few-Shot Learning methods.
Challenges and The Road Ahead
Despite impressive progress, Few-Shot Learning still faces significant hurdles:
- Rigid M-way K-shot Setup: Most models train and test with fixed numbers of classes (M) and samples (K). Real-world scenarios are far less predictable.
- Cross-Domain Generalization: FSL models often fail when applied to tasks from different data domains, such as transferring from natural to medical images.
- Joint Classification of Seen and Unseen Classes: Real-world applications need models that can recognize both previously seen and new categories.
- Beyond Vision: Applying FSL to text, audio, graphs, and other data types requires tackling fundamentally different challenges in representation and learning.
Few-Shot Learning symbolizes a paradigm shift—from models that thrive on massive labeled datasets to systems that learn efficiently from scarce information. By teaching machines how to learn, we move closer to truly adaptive and intelligent AI. The journey continues, but the destination promises an era of data-efficient learning.
](https://deep-paper.org/en/paper/2203.04291/images/cover.png)