Introduction: The Unlabeled Data Dilemma
In machine learning, data is king — but not all data is created equal. Most state-of-the-art models thrive on massive, meticulously labeled datasets, which take enormous time and resources to collect. Unfortunately, most real-world data isn’t neatly labeled; it’s messy, unlabeled, and abundant. This mismatch between what AI needs and what the world provides is a major bottleneck for deploying machine learning systems in everyday applications.
What if a model could learn to learn — adapting to new tasks with only a handful of examples, without ever being told what things are? This is meta-learning, or learning to learn. The idea is to train a model across many small tasks so that it can tackle new, unseen tasks with very few labeled examples.
However, most meta-learning methods rely heavily on labeled data to generate those training tasks. In domains where labeling is impractical — such as medical imaging, wildlife monitoring, or satellite analysis — even creating these meta-learning datasets becomes impossible.
Enter Unsupervised Meta-Learning (UML), which aims to learn transferable representations directly from unlabeled data. But this introduces a new challenge: without labels, how can a model create meaningful training tasks that mimic real-world learning?
In a groundbreaking study, researchers introduce CAMeLU — Context-Aware Meta-Learning in Unsupervised Scenarios — a new way to teach a Transformer to learn from unlabeled data. Inspired by the in-context learning capabilities of large language models (LLMs), CAMeLU reframes visual meta-learning as a sequence modeling problem. By treating image classification tasks like text prompts, it enables the Transformer to infer task structure directly from context.
At the heart of CAMeLU lies a clever task-creation process that generates diverse and challenging classification problems from unlabeled images. This mechanism allows the model to achieve state-of-the-art results, outperforming all existing UML methods and even rivaling some supervised approaches — all without seeing a single label during training.
Background: Setting the Stage
To grasp how CAMeLU achieves this, we need to understand three key ideas: meta-learning, unsupervised task creation, and in-context learning — the Transformer’s secret weapon.
Meta-Learning in a Nutshell
Meta-learning operates one level above regular machine learning: instead of learning to classify specific objects, it learns a strategy for classification itself. The standard setup uses N-way K-shot tasks, where the model must identify N categories using K examples per category.
- Support Set: The small collection of labeled examples per class — the “study material.”
- Query Set: New examples from the same classes — the “test.”
Training involves thousands of such small tasks, enabling the model to extract general principles for quick adaptation later. At meta-test time, given a few examples from a new task, a meta-learned model should classify new instances almost instantly.
The Unsupervised Twist
Traditional meta-learning assumes access to labeled data — but in the unsupervised setting, labels don’t exist. UML must somehow manufacture pseudo-tasks from unlabeled images.
Earlier methods have tried approaches such as:
- Clustering: Group visually similar images using unsupervised embeddings, and assign cluster IDs as pseudo-labels.
- Data Augmentation: Treat augmented variants of the same image (rotated, cropped, blurred) as belonging to the same pseudo-class.
These strategies work to an extent but tend to produce easy or overly similar tasks that fail to capture the diversity of real-world problems. Consequently, models trained this way often overfit and struggle to generalize, especially to cross-domain scenarios — testing on datasets unlike what they were trained on.
In-Context Learning: The Transformer’s Superpower
Recent advances in Transformer models have revealed something astonishing: in-context learning. Large language models like GPT-3 can infer a new task from a textual prompt — no weight updates, no retraining, just reasoning from context.
For example, you can give a model a few translation examples:
The model correctly outputs fromage — not because it was fine-tuned for French, but because it inferred the translation task from the pattern.
This paper takes inspiration from that idea: can a Transformer visually infer a classification rule from a handful of examples, just like an LLM does with text?
The Core Method: Inside CAMeLU
CAMeLU combines two cohesive components:
- A task-creation mechanism that fabricates diverse few-shot learning tasks from unlabeled data.
- A Transformer-based in-context learner that treats each task as a sequence and performs classification by reasoning from context.

Figure 1: Overview of the CAMeLU pipeline. Each unlabeled image generates augmented support samples and mixup-based query samples. These representations are encoded and fed as a sequence into a Transformer, which learns task context to classify the query image.
Part 1: Crafting Tasks from Scratch
The key to UML is generating meaningful tasks without labels. CAMeLU’s approach ensures that each constructed task mirrors the structure of real few-shot learning problems.
Creating the Support Set
- Sample N images randomly from the unlabeled training dataset.
- Assume distinct classes. In large datasets such as ImageNet, the chance that two random samples share a class is negligible (≈1% for 5-way tasks).
- Apply K augmentations per image using transformations like random cropping, rotation, and color jittering.
- Assign pseudo-labels 1…N. All augmented versions of one image share the same pseudo-label.
This produces an unlabeled but semantically varied support set for each task.
Creating the Query Set: A Mixup-Inspired Strategy
A naive approach would generate query examples by augmenting the same support images. However, this doesn’t replicate the true test-time challenge — where query samples are different instances of the same classes.
CAMeLU introduces a mixup-inspired method to make tasks more realistic:
- Select one support sample image \(x_n\).
- Apply a new augmentation, obtaining \(\tilde{x}_{n,j}\).
- Randomly sample another image \(z_j\) from the dataset.
- Combine them using: \[ x_j^{(qr)} = \lambda z_j + (1 - \lambda)\tilde{x}_{n,j} \] where \(\lambda \in (0, 0.5]\) ensures that the resulting query still mostly reflects \(x_n\)’s class characteristics.
The mixed image \(x_j^{(qr)}\) inherits the same pseudo-label as its originating support image, encouraging the model to focus on robust, generalizable features instead of superficial image details. This is a crucial step toward closing the gap between training and test conditions.
Part 2: The In-Context Learning Engine
Once these pseudo-tasks are built, CAMeLU transforms them into sequences suitable for a Transformer encoder.
From Task to Sequence
Each task comprises its support set and a query image organized as:
\[ S_{i,j} = \left( (x_1^{(sp)}, y_1^{(sp)}), \dots, (x_{NK}^{(sp)}, y_{NK}^{(sp)}), x_j^{(qr)} \right) \]The Transformer receives this sequence, treating the support samples (and their labels) as context, and predicts the label for the final query element.
Model Architecture
CAMeLU’s architecture includes:
- Fixed Feature Extractor (\(f_\psi\)) — a pre-trained ResNet-50 converts each image into a feature embedding. This extractor remains frozen during training.
- Learned Class Encoder (\(g_\phi\)) — transforms pseudo-labels into dense learned embeddings. For the query image, a special learnable “unknown” token replaces the missing label.
- Transformer Encoder (\(M_\theta\)) — combines image and label embeddings into a sequence. Through self-attention, it infers relationships between supports and queries.
The transformer’s output corresponding to the query position goes through a linear classifier head to predict its label. The training objective minimizes the cross-entropy loss over query predictions:
\[ \min_{\theta,\phi} \mathbb{E}_{S_i} \left[\frac{1}{Q} \sum_{j=1}^{Q} \ell(M_{\theta}(S_{i,j}), y_j^{(qr)}) \right] \]At inference, there’s no fine-tuning. The model predicts in one forward pass by leveraging the context from the support set — true in-context learning for vision.
Visualizing In-Context Learning

Figure 2: The Transformer learns tighter, well-separated clusters based on task context, transforming ambiguous embeddings into clear class boundaries.
The difference is striking: the pre-trained feature extractor provides weakly separated clusters, but after the Transformer’s contextual refinement, those clusters become compact and distinct. The query image moves closer to its correct class centroid, illustrating how in-context reasoning boosts discrimination.
Experiments and Results: Putting CAMeLU to the Test
The researchers conducted extensive experiments across five benchmark datasets — miniImageNet, CIFAR-fs, CUB, Aircraft, and Meta-iNat — comparing CAMeLU with leading UML baselines, a supervised counterpart (CAML), and state-of-the-art self-supervised methods.
Outperforming the Unsupervised Baselines

Figure 3: Performance comparison across datasets. CAMeLU consistently achieves top results across both 1-shot and 5-shot tasks.
Key takeaways:
- New state-of-the-art: CAMeLU surpasses all previous UML models (CACTUs, UMTRA, Meta-GMVAE, PsCo) across every dataset.
- Cross-domain resilience: While prior models perform well only when trained and tested on the same domain, CAMeLU thrives even when test datasets differ markedly from training data.
- No fine-tuning required: PsCo needs an additional adaptation step; CAMeLU predicts instantly — ideal for real-time applications.
- Competitive with supervision: Astonishingly, CAMeLU matches or even outperforms CAML, its supervised counterpart, especially on datasets far removed from training data like CUB and Meta-iNat.
From Memorization to Generalization: A “Grokking” Shift
During training, CAMeLU exhibits a fascinating pattern reminiscent of grokking: a sudden transition from memorization to generalization.

Figure 4: CAMeLU’s training curve shows three distinct phases: memorization, learning, and generalization.
The learning trajectory unfolds in three stages:
- Memorization (epochs 0–20): The model memorizes its training tasks with weak cross-domain performance.
- Learning (epochs 20–80): It shifts toward reasoning by analogy — “learning to learn.”
- Generalization (epochs 80–100): The model stabilizes, demonstrating strong task-independent adaptation.
This transition mirrors the learning behavior observed in large language models, providing compelling evidence of in-context learning emerging in vision tasks.
Generalization on Small-Scale Datasets
Can CAMeLU still excel with limited training data? Absolutely.

Figure 5: CAMeLU maintains strong generalization even when trained on small datasets.
When trained on the much smaller miniImageNet dataset, CAMeLU continues to outperform other UML methods — both in-domain and across new datasets — showing remarkable scalability. Even with fewer examples, the model exhibits the same three learning phases seen during large-scale training.

Figure 6: CAMeLU (orange) shows clear progress from learning to generalization; CAML (blue) remains flat. The task-creation mechanism drives generalization.
Comparing with Self-Supervised Learning (SSL) Methods
How does CAMeLU fare against powerful SSL models like SimCLR and SwAV?

Figure 7: Comparison with SSL methods. CAMeLU demonstrates greater robustness across domains.
SSL methods excel on datasets similar to their pre-training domain but tend to falter when faced with new, fine-grained categories. Moreover, they require domain-specific fine-tuning before evaluation. CAMeLU, on the other hand, learns to generalize directly — outperforming SSL models on diverse datasets without any fine-tuning.
Conclusion and Implications
CAMeLU marks a significant leap for unsupervised meta-learning. By pairing in-context learning with a mixup-inspired task creation mechanism, it establishes a new frontier for learning directly from unlabeled data.
Key Takeaways
- A New Paradigm: Reframes meta-learning as sequence modeling, harnessing the Transformer’s in-context capability for vision tasks.
- State-of-the-Art Performance: Achieves top results in unsupervised learning — even rivaling supervised and self-supervised methods.
- True Generalization: Its unique task generation mechanism compels the model to learn robust, transferable representations.
- Efficiency: CAMeLU performs inference with a single forward pass and trains comfortably on consumer GPUs, making it accessible beyond high-end research labs.
CAMeLU opens new possibilities for AI that can learn in the wild — adapting to unseen tasks and domains with minimal human supervision. Future work may explore integrating self-supervised feature extractors or refining task generation strategies, moving toward a fully unsupervised learning ecosystem.
Machine learning models may soon no longer depend on labeled datasets — they could simply learn from context, much like humans do.
](https://deep-paper.org/en/paper/2405.16124/images/cover.png)