Unlocking Few-Shot Learning: A Universal Prior That Learns From Data

Humans have a remarkable ability to learn new concepts from just one or two examples. See a picture of a toucan once, and you can likely recognize it for life. Deep learning models, on the other hand, are notoriously data-hungry. They often require thousands of examples to achieve similar performance, making them struggle in situations where data is scarce or expensive to collect — such as medical imaging or specialized robotics.

This is the challenge of few-shot learning: how can we enable models to generalize from just a handful of samples, like humans do? A promising field called meta-learning—or “learning to learn”—tackles this problem by training models to leverage prior knowledge gathered from a wide range of related tasks. The idea is that by learning the common structure across many tasks, a model can quickly adapt to a new, unseen task with minimal data.

The quality of this “prior knowledge” is critical. Most meta-learning methods rely on simple, pre-selected priors such as the Gaussian distribution. While effective to some extent, a fixed-shape prior is like a one-size-fits-all tool — not expressive enough to capture the complex patterns needed for challenging, data-starved scenarios.

A recent paper titled “Meta-Learning Universal Priors Using Non-Injective Change of Variables” proposes a groundbreaking solution. Instead of relying on a fixed, off-the-shelf prior, the authors introduce a method to learn a flexible, data-driven prior that adapts its shape to optimally fit the tasks at hand. Their key innovation is a new generative model called the Non-injective Change of Variables (NCoV) model, which is theoretically proven to be a universal approximator of probability distributions. Let’s explore how this works.

Background: The “Learning to Learn” Framework

At its core, meta-learning operates over a collection of different tasks. For each task \( t \), we have a small training dataset \( \mathcal{D}_t^{\mathrm{trn}} \) and a validation dataset \( \mathcal{D}_t^{\mathrm{val}} \). The goal is to learn a shared, task-invariant representation (the prior) that helps a model quickly learn the details of any given task.

This process is naturally framed as a bilevel optimization problem.

The bilevel optimization objective of meta-learning.

Figure 1: Meta-learning formulated as a bilevel optimization — learning shared priors across tasks.

Let’s break it down:

Inner Level (Task-Level): For each task \( t \), find the best task-specific parameters \( \phi_t^* \). This optimization uses the small training set \( \mathcal{D}_t^{\mathrm{trn}} \) and is guided by the shared prior, represented by the regularization term \( \mathcal{R}(\phi_t; \theta) \).
Outer Level (Meta-Level): Evaluate the optimized task-specific parameters \( \{\phi_t^*\}_{t=1}^T \) on their respective validation sets \( \{\mathcal{D}_t^{\mathrm{val}}\}_{t=1}^T \). The shared prior parameters \( \theta \) are then updated to produce better priors across tasks.

From a Bayesian perspective, \( \mathcal{L} \) represents the negative log-likelihood (data-fit), whereas \( \mathcal{R} \) captures the negative log-prior \( -\log p(\phi_t; \theta) \). The inner loop thus performs maximum a posteriori (MAP) estimation, where the prior acts as a smart regularizer or initialization — helping prevent overfitting when data is scarce.

MAML and the Implicit Gaussian Prior

One of the most influential meta-learning algorithms is Model-Agnostic Meta-Learning (MAML). Instead of using an explicit regularization term \( \mathcal{R} \), MAML learns a single shared initialization \( \phi^0 = \theta \) for all tasks, and performs a few gradient descent steps starting from this common point.

The K-step gradient descent optimizer used in MAML.

Figure 2: Inner-loop optimization in MAML using K-step gradient descent.

While MAML doesn’t explicitly define a prior, researchers later showed that its update process is approximately equivalent to MAP estimation under a Gaussian prior.

MAML’s optimizer is approximately equivalent to finding the MAP estimate with a Gaussian prior.

Figure 3: MAML implicitly assumes a Gaussian prior, limiting expressiveness for complex distributions.

Here, the learned initialization \( \phi^0 \) behaves like the mean of an implicit Gaussian prior. This insight exposes a key limitation: Gaussian priors are unimodal and symmetric by nature, which restricts their expressiveness. Real-world parameter distributions could be multi-modal or skewed — and MAML’s implicit Gaussian can fail to capture that complexity.

The Core Method: Learning a Universal Prior with NCoV

Instead of guessing the shape of the prior, what if we could learn it from data? The authors propose doing exactly that using a powerful concept from probability theory: the change of variables principle.

A Quick Primer on Normalizing Flows

This principle is central to models known as Normalizing Flows (NFs). These models learn an invertible transformation \( f \) that maps a simple random variable \( \mathbf{Z} \) to a more complex one \( \mathbf{Z}' = f(\mathbf{Z}) \). Because \( f \) is invertible, the probability density of \( \mathbf{Z}' \) can be computed exactly using the change-of-variable rule:

The change-of-variable formula for Normalizing Flows.

Figure 4: The standard change-of-variable formula used for Normalizing Flows.

This allows NFs to both evaluate densities and generate samples efficiently. However, NFs are limited by a crucial assumption — invertibility. The transformation must be bijective, meaning every output corresponds to exactly one input. This constraint makes it hard for NFs to model multi-modal distributions or those lying on low-dimensional manifolds (like natural images).

The NCoV Breakthrough: Dropping the Invertibility Constraint

The core breakthrough of the paper is simple yet profound: what if we drop the requirement that \( f \) be invertible?

According to Theorem 3.1 (Multivariate Probability Integral Transform), for any target cumulative distribution function (CDF) \( Q \), there exists a (possibly non-injective) function \( f^* \) that transforms a simple source distribution into a random variable \( \mathbf{Z}' \) with the target distribution \( Q \).

The key result of Theorem 3.1: the transformed distribution P_Z’ matches Q almost everywhere.

Figure 5: By relaxing invertibility, NCoV can model any target distribution.

This is transformative. It means NCoV models can represent any distribution — multimodal, discrete, skewed — without structural constraints. The trade-off is that we lose the analytic form of the transformed density, which now becomes an integral over all pre-images of \( f \):

The intractable probability density function for a non-injective transformation.

Figure 6: The density under a non-injective transformation involves an integral over pre-images of \( f \).

Fortunately, in meta-learning, we don’t need this density in closed form — we only need to sample from it and optimize the transformation. This makes non-injectivity a powerful feature rather than a drawback.

Universal Approximation with Sylvester NCoVs

To approximate the ideal transformation \( f^* \), the authors use a parametric model known as a Sylvester Flow, defined as:

The mathematical form of a one-layer Sylvester NCoV.

Figure 7: The functional form of a single-layer Sylvester NCoV transformation.

Here, \( \mathbf{A} \), \( \mathbf{B} \), and \( \mathbf{c} \) are learnable parameters, and \( \sigma \) is a nonlinear activation (typically sigmoid). Theorem 3.5 in the paper proves that a sufficiently wide Sylvester NCoV can approximate any “well-behaved” target distribution — establishing a universal approximation theorem for distributions.

Transforming a standard Gaussian pdf (left) into multi-modal target pdfs (right) using Sylvester NCoVs. The top row shows the estimated pdfs from the NCoV model, and the bottom row shows the ground-truth target distributions. The model successfully captures complex shapes like rings and curves.

Figure 8: Sylvester NCoVs can flexibly transform a simple Gaussian into complex, multi-modal target distributions.

These results highlight the expressive power of non-injective transformations — enabling accurate modeling of distributions that simple priors cannot represent.

The MetaNCoV Algorithm: Bringing NCoV to Meta-Learning

Now, let’s connect NCoV to the meta-learning framework. The resulting method, MetaNCoV, learns a universal, data-driven prior that adapts across tasks.

Instead of optimizing task parameters \( \phi_t \) directly, we introduce latent variables \( \mathbf{z}_t \) drawn from a simple prior \( p_{\mathbf{Z}} = \mathcal{N}(\mathbf{0}, \mathbf{I}) \). The transformation \( f(\mathbf{z}_t; \theta_f) \) generates the model parameters \( \phi_t \). Meta-learning then jointly optimizes the latent variables (inner loop) and transformation parameters \( \theta_f \) (outer loop).

The bilevel optimization objective for MetaNCoV. The outer loop learns transformation f, while the inner loop finds optimal latent variables per task.

Figure 9: The two-level optimization structure in MetaNCoV — learning latent priors across tasks.

The initialization becomes elegant and automatic: for a Gaussian base distribution, the maximum a priori point is \( \mathbf{z}_t^0 = \mathbf{0} \).

The simple and elegant initialization for the latent variable z_t.

Figure 10: Initialization of latent variables from the mode of the base distribution.

This eliminates the need to learn task-invariant initialization explicitly — a key parameter in MAML. The shared transformation \( f \) now encapsulates all transferable prior information.

Experiments: Does a Better Prior Boost Performance?

The authors conduct an extensive empirical evaluation across standard few-shot learning benchmarks.

Few-Shot Classification on miniImageNet

MetaNCoV is integrated as a plug-in prior into existing methods like MAML and MetaSGD, and tested on miniImageNet.

Table 1: Performance comparison on the miniImageNet dataset. MetaNCoV, combined with MAML and MetaSGD, achieves state-of-the-art results.

Figure 11: MetaNCoV achieves superior performance, especially in 1-shot learning where priors are most critical.

MetaNCoV yields substantial improvements, particularly in 1-shot settings — confirming that expressive priors greatly enhance performance when data is scarce.

Scaling Up: WRN-28-10 and TieredImageNet

Next, the authors evaluate MetaNCoV using a larger Wide ResNet (WRN-28-10) backbone on both miniImageNet and tieredImageNet datasets.

Table 2: Performance using a powerful WRN-28-10 feature extractor. MetaNCoV continues to improve over strong baselines.

Figure 12: Consistent improvements with stronger architectures validate the robustness of MetaNCoV.

Even with a high-capacity feature extractor, MetaNCoV continues to provide consistent accuracy gains, demonstrating compatibility across architectures.

Fine-Grained Classification on CUB-200-2011

The model’s ability to capture subtle distinctions was further validated on CUB-200-2011, a fine-grained dataset of bird species.

Table 3: Performance on the fine-grained CUB dataset. MetaNCoV excels in the 1-shot scenario where an expressive prior is essential.

Figure 13: MetaNCoV’s expressive prior improves learning of subtle features in the fine-grained bird classification task.

MetaNCoV again outperforms competitors, confirming that expressive priors shine in fine-grained, low-data environments.

Ablation Studies and Generalization Across Domains

To verify the theoretical foundations, several ablation studies were conducted.

Table 4: Ablation studies on miniImageNet. Non-injective NCoV outperforms injective baselines, and the theoretically motivated sigmoid activation yields the best results.

Figure 14: Design choices validated — non-injective structure and sigmoidal activations are key to expressiveness.

Results show that non-injective NCoVs significantly outperform injective flows, and sigmoidal activations outperform ReLU — aligning perfectly with theoretical predictions.

Cross-Domain Generalization

Finally, MetaNCoV was tested on cross-domain few-shot learning setups. Trained on miniImageNet, the model was evaluated on tieredImageNet, CUB, and Cars datasets.

Table 5: Cross-domain generalization. MetaNCoV outperforms baselines when transferring across domains.

Figure 15: MetaNCoV demonstrates strong cross-domain transferability, learning priors that generalize beyond training data.

MetaNCoV maintains strong performance across domains — suggesting that the learned, data-driven prior captures fundamental task structures rather than memorizing specifics.

Conclusion: Toward Expressive, Human-Like Learning

The paper “Meta-Learning Universal Priors Using Non-Injective Change of Variables” makes a compelling case that the future of few-shot learning lies in expressive data-driven priors. By discarding the constraint of invertibility imposed on traditional generative models, the proposed NCoV framework learns rich, adaptable priors with universal approximation properties.

The resulting MetaNCoV algorithm achieves state-of-the-art performance across few-shot benchmarks, especially in ultra-low-data environments. More broadly, this work suggests that improving priors — the very foundations of what models “believe” before seeing data — can unlock new levels of sample efficiency and adaptability.

As we move closer to models that learn to learn like humans, flexible universal priors such as MetaNCoV could become key building blocks for the next generation of adaptive AI systems.

Background: The “Learning to Learn” Framework#

MAML and the Implicit Gaussian Prior#

The Core Method: Learning a Universal Prior with NCoV#

A Quick Primer on Normalizing Flows#

The NCoV Breakthrough: Dropping the Invertibility Constraint#

Universal Approximation with Sylvester NCoVs#

The MetaNCoV Algorithm: Bringing NCoV to Meta-Learning#

Experiments: Does a Better Prior Boost Performance?#

Few-Shot Classification on miniImageNet#

Scaling Up: WRN-28-10 and TieredImageNet#

Fine-Grained Classification on CUB-200-2011#

Ablation Studies and Generalization Across Domains#

Cross-Domain Generalization#

Conclusion: Toward Expressive, Human-Like Learning#