How can a child see a single illustration of a zebra and recognize the same animal effortlessly later on? This ability to generalize from one or a handful of examples—known as few-shot learning—comes naturally to humans. For deep learning models, however, it remains an enormous challenge.
Traditional deep neural networks thrive on data abundance. To reach high accuracy, they typically demand thousands of labeled examples per class. That’s fine for popular objects like cats, cars, and trees, but not for rare species, new products, or specialized medical cases, where collecting large datasets is often impossible.
The quest of few-shot learning is to build models that can learn new concepts from a few examples, mimicking human flexibility. Over the years, researchers have proposed approaches based on meta-learning, memory systems, and metric learning. Among these, a particularly elegant idea stands out—learning how to compare directly.
That’s the premise of the paper Learning to Compare: Relation Network for Few-Shot Learning. The authors introduce a simple yet powerful framework called the Relation Network (RN). Instead of just learning good representations, RN learns how to compare pairs of examples themselves. This subtle shift leads to a model that achieves state-of-the-art results across both few-shot and zero-shot learning, while remaining conceptually straightforward.
The Landscape of Few-Shot Learning
To understand the Relation Network’s contribution, let’s first clarify the few-shot setup. A typical task is defined as a C-way K-shot problem:
- A support set contains \( K \) labeled examples for each of \( C \) classes—like 5 classes with 1 image each in a 5-way 1-shot setting.
- A query set contains unlabeled images that must be classified using only the limited information in the support set.
- The goal is to correctly assign each query image to one of these \( C \) classes.
Training a conventional deep network directly on such small data would cause severe overfitting. To address this, modern few-shot methods use meta-learning—learning how to learn. During training, the model is exposed to many simulated few-shot tasks called episodes. Each episode randomly samples several classes and a tiny support/query split from a large dataset. As it solves thousands of these mini-problems, the network gradually learns transferable knowledge about how to handle new, unseen tasks.
Previous strategies have approached meta-learning in different ways:
- Learning good initializations: Methods like MAML learn a set of weights that can quickly adapt to new classes after a few gradient steps. However, this still requires fine-tuning at test time.
- Learning embeddings: Approaches such as Prototypical Networks and Siamese Networks learn to map samples into a feature space where items of the same class cluster tightly together. Classification is then done using a fixed metric—usually Euclidean distance.
The Relation Network begins here but asks a simple question: what if the metric itself could be learned?
The Core Idea: Learning a Deep Metric
The key insight is that comparing two images is a complex operation—too complex to be captured by fixed metrics like Euclidean or cosine distance. Instead of relying solely on the embedding to make comparisons trivial, the authors propose to learn the comparison function end-to-end.
The Relation Network is composed of two main modules:
- Embedding Module (\( f_\varphi \)) — a convolutional neural network that extracts feature maps from input images.
- Relation Module (\( g_\phi \)) — a smaller neural network that takes pairs of feature maps and produces a relation score between 0 and 1, indicating the similarity of images.

Figure 1: Relation Network architecture for a 5-way 1-shot task. Each query image is embedded and compared with embeddings of five support images to produce relation scores.
Let’s walk through a 5-way 1-shot example:
- The query image \( x_j \) and five support images \( x_i \) (one for each class) are passed through the shared embedding module \( f_\varphi \).
- Their feature maps \( f_\varphi(x_i) \) and \( f_\varphi(x_j) \) are concatenated depth-wise.
- This combination is fed into the relation module \( g_\phi \), producing a scalar relation score: \[ r_{i,j} = g_{\phi}\big(C(f_{\varphi}(x_i), f_{\varphi}(x_j))\big) \]
- The query image is classified as belonging to the class whose support image yields the highest relation score.
From One-Shot to K-Shot
If more examples per class exist (\( K > 1 \)), the model pools them by elementwise summing their embeddings to form a combined class representation. This aggregated feature map represents the “prototype” for that class. The rest of the procedure remains identical: concatenate this representation with the query’s embedding and compute relation scores for classification.
How Relation Networks Are Trained
Like other meta-learning models, RNs use episodic training. The network is optimized so that pairs of inputs from the same class produce high relation scores (close to 1), while mismatched pairs produce low scores (close to 0).
To accomplish this, the authors employ a Mean Squared Error (MSE) loss to regress each score \( r_{i,j} \) toward its ground-truth similarity:

Figure 2: Training objective for the Relation Network. Matching pairs are labeled 1, non-matching pairs 0, and the MSE loss encourages accurate relation scores.
This formulation treats the task as a regression problem, letting the network learn subtle similarity patterns beyond simple categorical labels.
The Architecture: Simple Yet Effective
Under the hood, the Relation Network is refreshingly simple—built entirely from standard convolutional blocks.

Figure 3: Architecture overview. The embedding module has four conv blocks; the relation module adds two conv blocks and two fully connected layers.
- The embedding module (
f_φ) uses four convolutional blocks, each with a 64-filter 3×3 convolution, batch normalization, and ReLU activation. The first two blocks include 2×2 max-pooling. - The relation module (
g_φ) has two 3×3 conv blocks (also with max-pooling), followed by two fully connected layers. The last layer uses a Sigmoid activation to output a similarity score between 0 and 1.
Despite its simplicity, this modular design learns remarkably rich relationships when trained properly.
From Few-Shot to Zero-Shot Learning
One of the Relation Network’s most striking features is its natural extension to zero-shot learning (ZSL). In ZSL, there are no images available for unseen classes—instead, each class is defined by a semantic vector describing attributes (e.g., “has stripes,” “is mammal,” etc.).
To adapt, RN replaces the image-based support branch with a separate semantic embedding module (a simple MLP). Meanwhile, the query branch still uses the CNN for visual input.

Figure 4: RN for zero-shot learning. The network learns to compare a semantic description embedding with an image embedding.
The relation score for zero-shot tasks becomes:

Equation defining the relation score for zero-shot learning using heterogenous embeddings.
This identical principle of “learning to compare” unifies few-shot and zero-shot learning under one conceptual framework.
Putting It to the Test: Experimental Results
The authors evaluated RN on multiple benchmark datasets, comparing against leading few-shot and zero-shot methods.
Few-Shot Classification
Omniglot and miniImageNet serve as the standard benchmarks.

Table 1: Omniglot results. RN achieves state-of-the-art accuracy on nearly all tests without fine-tuning.
RN outperforms sophisticated baselines like MAML and memory-augmented networks, and does so using only feed-forward inference—no fine-tuning required.

Table 2: miniImageNet results. RN reaches top performance on 5-way 1-shot and remains competitive for 5-shot.
The results demonstrate RN’s capacity to generalize across tasks with drastically different complexity.
Zero-Shot Classification
The paper also benchmarks RN on Animals with Attributes (AwA) and Caltech-UCSD Birds (CUB). These datasets include both traditional and more realistic Generalized Zero-Shot Learning (GZSL) settings, where seen and unseen classes coexist at test time.

Table 3: Conventional ZSL results. RN reaches state-of-the-art accuracy on the fine-grained CUB dataset.
RN delivers competitive results on AwA and superior performance on CUB, confirming its robustness in cross-modal comparison tasks. Under the stricter modern GBU benchmark, it remains among top-performing methods, especially under realistic GZSL evaluation.
Why Relation Networks Work
The success of RN lies in its ability to jointly learn embeddings and comparisons. Prior metric-learning methods rely on fixed similarity measures (like Euclidean or cosine). These assume that the embedding space alone makes all samples linearly separable—a heavy burden on the feature extractor. RN, by contrast, introduces a learnable, deep, non-linear comparator, allowing richer modeling of relationships between features.

Figure 4: Synthetic visualization. RN learns a complex spiral similarity structure that static metrics fail to represent.
Here, RN captures a non-linear spiral relationship between sample pairs—beyond the reach of simpler metric-based methods.
Real datasets show similar benefits.

Figure 5: Omniglot visualization. Left: matched (cyan) and mismatched (magenta) samples are tangled in the raw embedding space. Right: after the relation module, they become linearly separable.
By explicitly learning the similarity function, RN transforms a messy embeddings space into one where matches and mismatches are easily distinguished.
Conclusion and Key Takeaways
The Relation Network reframes few-shot and zero-shot learning around the idea of learning to compare. By integrating a deep embedding module with a learnable relation module, it achieves remarkable results across benchmarks—without fine-tuning, memory mechanisms, or complex optimization.
Key insights:
- Simplicity pays off: RN uses standard CNN components yet rivals or surpasses complex alternatives.
- Learn the metric, not just the features: Its deep relation module captures complex similarity structures.
- A unified framework: A single model elegantly handles both few-shot and zero-shot learning.
- State-of-the-art effectiveness: RN achieves leading performance while remaining intuitive and efficient.
In an era of increasingly complicated architectures, the Relation Network offers a reminder that sometimes the most powerful solutions stem from a simple idea—learning not just representations, but relationships.
](https://deep-paper.org/en/paper/1711.06025/images/cover.png)