Humans have a remarkable ability to learn new concepts from just one or two examples. See a picture of a toucan once, and you can likely recognize it for life. Deep learning models, on the other hand, are notoriously data-hungry. They often require thousands of examples to achieve similar performance, making them struggle in situations where data is scarce or expensive to collect — such as medical imaging or specialized robotics.
This is the challenge of few-shot learning: how can we enable models to generalize from just a handful of samples, like humans do? A promising field called meta-learning—or “learning to learn”—tackles this problem by training models to leverage prior knowledge gathered from a wide range of related tasks. The idea is that by learning the common structure across many tasks, a model can quickly adapt to a new, unseen task with minimal data.
The quality of this “prior knowledge” is critical. Most meta-learning methods rely on simple, pre-selected priors such as the Gaussian distribution. While effective to some extent, a fixed-shape prior is like a one-size-fits-all tool — not expressive enough to capture the complex patterns needed for challenging, data-starved scenarios.
A recent paper titled “Meta-Learning Universal Priors Using Non-Injective Change of Variables” proposes a groundbreaking solution. Instead of relying on a fixed, off-the-shelf prior, the authors introduce a method to learn a flexible, data-driven prior that adapts its shape to optimally fit the tasks at hand. Their key innovation is a new generative model called the Non-injective Change of Variables (NCoV) model, which is theoretically proven to be a universal approximator of probability distributions. Let’s explore how this works.
Background: The “Learning to Learn” Framework
At its core, meta-learning operates over a collection of different tasks. For each task \( t \), we have a small training dataset \( \mathcal{D}_t^{\mathrm{trn}} \) and a validation dataset \( \mathcal{D}_t^{\mathrm{val}} \). The goal is to learn a shared, task-invariant representation (the prior) that helps a model quickly learn the details of any given task.
This process is naturally framed as a bilevel optimization problem.

Figure 1: Meta-learning formulated as a bilevel optimization — learning shared priors across tasks.
Let’s break it down:
- Inner Level (Task-Level): For each task \( t \), find the best task-specific parameters \( \phi_t^* \). This optimization uses the small training set \( \mathcal{D}_t^{\mathrm{trn}} \) and is guided by the shared prior, represented by the regularization term \( \mathcal{R}(\phi_t; \theta) \).
- Outer Level (Meta-Level): Evaluate the optimized task-specific parameters \( \{\phi_t^*\}_{t=1}^T \) on their respective validation sets \( \{\mathcal{D}_t^{\mathrm{val}}\}_{t=1}^T \). The shared prior parameters \( \theta \) are then updated to produce better priors across tasks.
From a Bayesian perspective, \( \mathcal{L} \) represents the negative log-likelihood (data-fit), whereas \( \mathcal{R} \) captures the negative log-prior \( -\log p(\phi_t; \theta) \). The inner loop thus performs maximum a posteriori (MAP) estimation, where the prior acts as a smart regularizer or initialization — helping prevent overfitting when data is scarce.
MAML and the Implicit Gaussian Prior
One of the most influential meta-learning algorithms is Model-Agnostic Meta-Learning (MAML). Instead of using an explicit regularization term \( \mathcal{R} \), MAML learns a single shared initialization \( \phi^0 = \theta \) for all tasks, and performs a few gradient descent steps starting from this common point.

Figure 2: Inner-loop optimization in MAML using K-step gradient descent.
While MAML doesn’t explicitly define a prior, researchers later showed that its update process is approximately equivalent to MAP estimation under a Gaussian prior.

Figure 3: MAML implicitly assumes a Gaussian prior, limiting expressiveness for complex distributions.
Here, the learned initialization \( \phi^0 \) behaves like the mean of an implicit Gaussian prior. This insight exposes a key limitation: Gaussian priors are unimodal and symmetric by nature, which restricts their expressiveness. Real-world parameter distributions could be multi-modal or skewed — and MAML’s implicit Gaussian can fail to capture that complexity.
The Core Method: Learning a Universal Prior with NCoV
Instead of guessing the shape of the prior, what if we could learn it from data? The authors propose doing exactly that using a powerful concept from probability theory: the change of variables principle.
A Quick Primer on Normalizing Flows
This principle is central to models known as Normalizing Flows (NFs). These models learn an invertible transformation \( f \) that maps a simple random variable \( \mathbf{Z} \) to a more complex one \( \mathbf{Z}' = f(\mathbf{Z}) \). Because \( f \) is invertible, the probability density of \( \mathbf{Z}' \) can be computed exactly using the change-of-variable rule:

Figure 4: The standard change-of-variable formula used for Normalizing Flows.
This allows NFs to both evaluate densities and generate samples efficiently. However, NFs are limited by a crucial assumption — invertibility. The transformation must be bijective, meaning every output corresponds to exactly one input. This constraint makes it hard for NFs to model multi-modal distributions or those lying on low-dimensional manifolds (like natural images).
The NCoV Breakthrough: Dropping the Invertibility Constraint
The core breakthrough of the paper is simple yet profound: what if we drop the requirement that \( f \) be invertible?
According to Theorem 3.1 (Multivariate Probability Integral Transform), for any target cumulative distribution function (CDF) \( Q \), there exists a (possibly non-injective) function \( f^* \) that transforms a simple source distribution into a random variable \( \mathbf{Z}' \) with the target distribution \( Q \).

Figure 5: By relaxing invertibility, NCoV can model any target distribution.
This is transformative. It means NCoV models can represent any distribution — multimodal, discrete, skewed — without structural constraints. The trade-off is that we lose the analytic form of the transformed density, which now becomes an integral over all pre-images of \( f \):

Figure 6: The density under a non-injective transformation involves an integral over pre-images of \( f \).
Fortunately, in meta-learning, we don’t need this density in closed form — we only need to sample from it and optimize the transformation. This makes non-injectivity a powerful feature rather than a drawback.
Universal Approximation with Sylvester NCoVs
To approximate the ideal transformation \( f^* \), the authors use a parametric model known as a Sylvester Flow, defined as:

Figure 7: The functional form of a single-layer Sylvester NCoV transformation.
Here, \( \mathbf{A} \), \( \mathbf{B} \), and \( \mathbf{c} \) are learnable parameters, and \( \sigma \) is a nonlinear activation (typically sigmoid). Theorem 3.5 in the paper proves that a sufficiently wide Sylvester NCoV can approximate any “well-behaved” target distribution — establishing a universal approximation theorem for distributions.

Figure 8: Sylvester NCoVs can flexibly transform a simple Gaussian into complex, multi-modal target distributions.
These results highlight the expressive power of non-injective transformations — enabling accurate modeling of distributions that simple priors cannot represent.
The MetaNCoV Algorithm: Bringing NCoV to Meta-Learning
Now, let’s connect NCoV to the meta-learning framework. The resulting method, MetaNCoV, learns a universal, data-driven prior that adapts across tasks.
Instead of optimizing task parameters \( \phi_t \) directly, we introduce latent variables \( \mathbf{z}_t \) drawn from a simple prior \( p_{\mathbf{Z}} = \mathcal{N}(\mathbf{0}, \mathbf{I}) \). The transformation \( f(\mathbf{z}_t; \theta_f) \) generates the model parameters \( \phi_t \). Meta-learning then jointly optimizes the latent variables (inner loop) and transformation parameters \( \theta_f \) (outer loop).

Figure 9: The two-level optimization structure in MetaNCoV — learning latent priors across tasks.
The initialization becomes elegant and automatic: for a Gaussian base distribution, the maximum a priori point is \( \mathbf{z}_t^0 = \mathbf{0} \).

Figure 10: Initialization of latent variables from the mode of the base distribution.
This eliminates the need to learn task-invariant initialization explicitly — a key parameter in MAML. The shared transformation \( f \) now encapsulates all transferable prior information.
Experiments: Does a Better Prior Boost Performance?
The authors conduct an extensive empirical evaluation across standard few-shot learning benchmarks.
Few-Shot Classification on miniImageNet
MetaNCoV is integrated as a plug-in prior into existing methods like MAML and MetaSGD, and tested on miniImageNet.

Figure 11: MetaNCoV achieves superior performance, especially in 1-shot learning where priors are most critical.
MetaNCoV yields substantial improvements, particularly in 1-shot settings — confirming that expressive priors greatly enhance performance when data is scarce.
Scaling Up: WRN-28-10 and TieredImageNet
Next, the authors evaluate MetaNCoV using a larger Wide ResNet (WRN-28-10) backbone on both miniImageNet and tieredImageNet datasets.

Figure 12: Consistent improvements with stronger architectures validate the robustness of MetaNCoV.
Even with a high-capacity feature extractor, MetaNCoV continues to provide consistent accuracy gains, demonstrating compatibility across architectures.
Fine-Grained Classification on CUB-200-2011
The model’s ability to capture subtle distinctions was further validated on CUB-200-2011, a fine-grained dataset of bird species.

Figure 13: MetaNCoV’s expressive prior improves learning of subtle features in the fine-grained bird classification task.
MetaNCoV again outperforms competitors, confirming that expressive priors shine in fine-grained, low-data environments.
Ablation Studies and Generalization Across Domains
To verify the theoretical foundations, several ablation studies were conducted.

Figure 14: Design choices validated — non-injective structure and sigmoidal activations are key to expressiveness.
Results show that non-injective NCoVs significantly outperform injective flows, and sigmoidal activations outperform ReLU — aligning perfectly with theoretical predictions.
Cross-Domain Generalization
Finally, MetaNCoV was tested on cross-domain few-shot learning setups. Trained on miniImageNet, the model was evaluated on tieredImageNet, CUB, and Cars datasets.

Figure 15: MetaNCoV demonstrates strong cross-domain transferability, learning priors that generalize beyond training data.
MetaNCoV maintains strong performance across domains — suggesting that the learned, data-driven prior captures fundamental task structures rather than memorizing specifics.
Conclusion: Toward Expressive, Human-Like Learning
The paper “Meta-Learning Universal Priors Using Non-Injective Change of Variables” makes a compelling case that the future of few-shot learning lies in expressive data-driven priors. By discarding the constraint of invertibility imposed on traditional generative models, the proposed NCoV framework learns rich, adaptable priors with universal approximation properties.
The resulting MetaNCoV algorithm achieves state-of-the-art performance across few-shot benchmarks, especially in ultra-low-data environments. More broadly, this work suggests that improving priors — the very foundations of what models “believe” before seeing data — can unlock new levels of sample efficiency and adaptability.
As we move closer to models that learn to learn like humans, flexible universal priors such as MetaNCoV could become key building blocks for the next generation of adaptive AI systems.
