CAMeLU: Teaching Transformers to Learn from Unlabeled Data with In-Context Learning

Introduction: The Unlabeled Data Dilemma

In machine learning, data is king — but not all data is created equal. Most state-of-the-art models thrive on massive, meticulously labeled datasets, which take enormous time and resources to collect. Unfortunately, most real-world data isn’t neatly labeled; it’s messy, unlabeled, and abundant. This mismatch between what AI needs and what the world provides is a major bottleneck for deploying machine learning systems in everyday applications.

What if a model could learn to learn — adapting to new tasks with only a handful of examples, without ever being told what things are? This is meta-learning, or learning to learn. The idea is to train a model across many small tasks so that it can tackle new, unseen tasks with very few labeled examples.

However, most meta-learning methods rely heavily on labeled data to generate those training tasks. In domains where labeling is impractical — such as medical imaging, wildlife monitoring, or satellite analysis — even creating these meta-learning datasets becomes impossible.

Enter Unsupervised Meta-Learning (UML), which aims to learn transferable representations directly from unlabeled data. But this introduces a new challenge: without labels, how can a model create meaningful training tasks that mimic real-world learning?

In a groundbreaking study, researchers introduce CAMeLU — Context-Aware Meta-Learning in Unsupervised Scenarios — a new way to teach a Transformer to learn from unlabeled data. Inspired by the in-context learning capabilities of large language models (LLMs), CAMeLU reframes visual meta-learning as a sequence modeling problem. By treating image classification tasks like text prompts, it enables the Transformer to infer task structure directly from context.

At the heart of CAMeLU lies a clever task-creation process that generates diverse and challenging classification problems from unlabeled images. This mechanism allows the model to achieve state-of-the-art results, outperforming all existing UML methods and even rivaling some supervised approaches — all without seeing a single label during training.

Background: Setting the Stage

To grasp how CAMeLU achieves this, we need to understand three key ideas: meta-learning, unsupervised task creation, and in-context learning — the Transformer’s secret weapon.

Meta-Learning in a Nutshell

Meta-learning operates one level above regular machine learning: instead of learning to classify specific objects, it learns a strategy for classification itself. The standard setup uses N-way K-shot tasks, where the model must identify N categories using K examples per category.

Support Set: The small collection of labeled examples per class — the “study material.”
Query Set: New examples from the same classes — the “test.”

Training involves thousands of such small tasks, enabling the model to extract general principles for quick adaptation later. At meta-test time, given a few examples from a new task, a meta-learned model should classify new instances almost instantly.

The Unsupervised Twist

Traditional meta-learning assumes access to labeled data — but in the unsupervised setting, labels don’t exist. UML must somehow manufacture pseudo-tasks from unlabeled images.

Earlier methods have tried approaches such as:

Clustering: Group visually similar images using unsupervised embeddings, and assign cluster IDs as pseudo-labels.
Data Augmentation: Treat augmented variants of the same image (rotated, cropped, blurred) as belonging to the same pseudo-class.

These strategies work to an extent but tend to produce easy or overly similar tasks that fail to capture the diversity of real-world problems. Consequently, models trained this way often overfit and struggle to generalize, especially to cross-domain scenarios — testing on datasets unlike what they were trained on.

In-Context Learning: The Transformer’s Superpower

Recent advances in Transformer models have revealed something astonishing: in-context learning. Large language models like GPT-3 can infer a new task from a textual prompt — no weight updates, no retraining, just reasoning from context.

For example, you can give a model a few translation examples:

1
2
3
4
Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
cheese => ?

The model correctly outputs fromage — not because it was fine-tuned for French, but because it inferred the translation task from the pattern.

This paper takes inspiration from that idea: can a Transformer visually infer a classification rule from a handful of examples, just like an LLM does with text?

The Core Method: Inside CAMeLU

CAMeLU combines two cohesive components:

A task-creation mechanism that fabricates diverse few-shot learning tasks from unlabeled data.
A Transformer-based in-context learner that treats each task as a sequence and performs classification by reasoning from context.

Figure 1 shows the complete pipeline of CAMeLU. On the left, it details the task creation process from an unlabeled dataset, including augmentation and a mixup strategy for the query set. On the right, it shows how the created task is fed as a sequence into a Transformer encoder, which uses the support set as context to predict the label of the query image.

Figure 1: Overview of the CAMeLU pipeline. Each unlabeled image generates augmented support samples and mixup-based query samples. These representations are encoded and fed as a sequence into a Transformer, which learns task context to classify the query image.

Part 1: Crafting Tasks from Scratch

The key to UML is generating meaningful tasks without labels. CAMeLU’s approach ensures that each constructed task mirrors the structure of real few-shot learning problems.

Creating the Support Set

Sample N images randomly from the unlabeled training dataset.
Assume distinct classes. In large datasets such as ImageNet, the chance that two random samples share a class is negligible (≈1% for 5-way tasks).
Apply K augmentations per image using transformations like random cropping, rotation, and color jittering.
Assign pseudo-labels 1…N. All augmented versions of one image share the same pseudo-label.

This produces an unlabeled but semantically varied support set for each task.

Creating the Query Set: A Mixup-Inspired Strategy

A naive approach would generate query examples by augmenting the same support images. However, this doesn’t replicate the true test-time challenge — where query samples are different instances of the same classes.

CAMeLU introduces a mixup-inspired method to make tasks more realistic:

Select one support sample image \(x_n\).
Apply a new augmentation, obtaining \(\tilde{x}_{n,j}\).
Randomly sample another image \(z_j\) from the dataset.
Combine them using: \[ x_j^{(qr)} = \lambda z_j + (1 - \lambda)\tilde{x}_{n,j} \] where \(\lambda \in (0, 0.5]\) ensures that the resulting query still mostly reflects \(x_n\)’s class characteristics.

The mixed image \(x_j^{(qr)}\) inherits the same pseudo-label as its originating support image, encouraging the model to focus on robust, generalizable features instead of superficial image details. This is a crucial step toward closing the gap between training and test conditions.

Part 2: The In-Context Learning Engine

Once these pseudo-tasks are built, CAMeLU transforms them into sequences suitable for a Transformer encoder.

From Task to Sequence

Each task comprises its support set and a query image organized as:

\[ S_{i,j} = \left( (x_1^{(sp)}, y_1^{(sp)}), \dots, (x_{NK}^{(sp)}, y_{NK}^{(sp)}), x_j^{(qr)} \right) \]

The Transformer receives this sequence, treating the support samples (and their labels) as context, and predicts the label for the final query element.

Model Architecture

CAMeLU’s architecture includes:

Fixed Feature Extractor (\(f_\psi\)) — a pre-trained ResNet-50 converts each image into a feature embedding. This extractor remains frozen during training.
Learned Class Encoder (\(g_\phi\)) — transforms pseudo-labels into dense learned embeddings. For the query image, a special learnable “unknown” token replaces the missing label.
Transformer Encoder (\(M_\theta\)) — combines image and label embeddings into a sequence. Through self-attention, it infers relationships between supports and queries.

The transformer’s output corresponding to the query position goes through a linear classifier head to predict its label. The training objective minimizes the cross-entropy loss over query predictions:

\[ \min_{\theta,\phi} \mathbb{E}_{S_i} \left[\frac{1}{Q} \sum_{j=1}^{Q} \ell(M_{\theta}(S_{i,j}), y_j^{(qr)}) \right] \]

At inference, there’s no fine-tuning. The model predicts in one forward pass by leveraging the context from the support set — true in-context learning for vision.

Visualizing In-Context Learning

Figure 2 shows two t-SNE plots of image embeddings for a 5-way, 5-shot task. The left plot, from the feature extractor, shows scattered and overlapping class clusters. The right plot, from the Transformer encoder’s output, shows compact, well-separated clusters, with the query sample correctly positioned near its true class.

Figure 2: The Transformer learns tighter, well-separated clusters based on task context, transforming ambiguous embeddings into clear class boundaries.

The difference is striking: the pre-trained feature extractor provides weakly separated clusters, but after the Transformer’s contextual refinement, those clusters become compact and distinct. The query image moves closer to its correct class centroid, illustrating how in-context reasoning boosts discrimination.

Experiments and Results: Putting CAMeLU to the Test

The researchers conducted extensive experiments across five benchmark datasets — miniImageNet, CIFAR-fs, CUB, Aircraft, and Meta-iNat — comparing CAMeLU with leading UML baselines, a supervised counterpart (CAML), and state-of-the-art self-supervised methods.

Outperforming the Unsupervised Baselines

Table 1 shows a performance comparison across five datasets for 5-way 1-shot and 5-way 5-shot tasks. CAMeLU consistently outperforms all other unsupervised meta-learning (UML) methods, establishing a new state-of-the-art.

Figure 3: Performance comparison across datasets. CAMeLU consistently achieves top results across both 1-shot and 5-shot tasks.

Key takeaways:

New state-of-the-art: CAMeLU surpasses all previous UML models (CACTUs, UMTRA, Meta-GMVAE, PsCo) across every dataset.
Cross-domain resilience: While prior models perform well only when trained and tested on the same domain, CAMeLU thrives even when test datasets differ markedly from training data.
No fine-tuning required: PsCo needs an additional adaptation step; CAMeLU predicts instantly — ideal for real-time applications.
Competitive with supervision: Astonishingly, CAMeLU matches or even outperforms CAML, its supervised counterpart, especially on datasets far removed from training data like CUB and Meta-iNat.

From Memorization to Generalization: A “Grokking” Shift

During training, CAMeLU exhibits a fascinating pattern reminiscent of grokking: a sudden transition from memorization to generalization.

Figure 3 plots the relative validation accuracy of CAMeLU over 100 epochs. The curve shows three distinct phases: an initial ‘Memorization’ phase with little improvement, a steep ‘Learning’ phase, and a final ‘Generalization’ phase where performance plateaus at a high level.

Figure 4: CAMeLU’s training curve shows three distinct phases: memorization, learning, and generalization.

The learning trajectory unfolds in three stages:

Memorization (epochs 0–20): The model memorizes its training tasks with weak cross-domain performance.
Learning (epochs 20–80): It shifts toward reasoning by analogy — “learning to learn.”
Generalization (epochs 80–100): The model stabilizes, demonstrating strong task-independent adaptation.

This transition mirrors the learning behavior observed in large language models, providing compelling evidence of in-context learning emerging in vision tasks.

Generalization on Small-Scale Datasets

Can CAMeLU still excel with limited training data? Absolutely.

Table 2 shows the performance of CAMeLU and other methods when trained on the smaller miniImageNet dataset. CAMeLU still demonstrates strong in-domain and cross-domain generalization, outperforming competitors by a large margin.

Figure 5: CAMeLU maintains strong generalization even when trained on small datasets.

When trained on the much smaller miniImageNet dataset, CAMeLU continues to outperform other UML methods — both in-domain and across new datasets — showing remarkable scalability. Even with fewer examples, the model exhibits the same three learning phases seen during large-scale training.

Figure 4 compares the learning curves of CAMeLU and its supervised counterpart CAML on miniImageNet. CAMeLU’s curve shows the characteristic jump to a high-generalization state, while CAML’s learning is flat, highlighting the impact of CAMeLU’s task generation.

Figure 6: CAMeLU (orange) shows clear progress from learning to generalization; CAML (blue) remains flat. The task-creation mechanism drives generalization.

Comparing with Self-Supervised Learning (SSL) Methods

How does CAMeLU fare against powerful SSL models like SimCLR and SwAV?

Table 3 compares CAMeLU with SSL methods SimCLR and SwAV. While SSL methods perform well on datasets similar to their pre-training data, CAMeLU shows superior performance and robustness on more challenging, fine-grained cross-domain datasets.

Figure 7: Comparison with SSL methods. CAMeLU demonstrates greater robustness across domains.

SSL methods excel on datasets similar to their pre-training domain but tend to falter when faced with new, fine-grained categories. Moreover, they require domain-specific fine-tuning before evaluation. CAMeLU, on the other hand, learns to generalize directly — outperforming SSL models on diverse datasets without any fine-tuning.

Conclusion and Implications

CAMeLU marks a significant leap for unsupervised meta-learning. By pairing in-context learning with a mixup-inspired task creation mechanism, it establishes a new frontier for learning directly from unlabeled data.

Key Takeaways

A New Paradigm: Reframes meta-learning as sequence modeling, harnessing the Transformer’s in-context capability for vision tasks.
State-of-the-Art Performance: Achieves top results in unsupervised learning — even rivaling supervised and self-supervised methods.
True Generalization: Its unique task generation mechanism compels the model to learn robust, transferable representations.
Efficiency: CAMeLU performs inference with a single forward pass and trains comfortably on consumer GPUs, making it accessible beyond high-end research labs.

CAMeLU opens new possibilities for AI that can learn in the wild — adapting to unseen tasks and domains with minimal human supervision. Future work may explore integrating self-supervised feature extractors or refining task generation strategies, moving toward a fully unsupervised learning ecosystem.

Machine learning models may soon no longer depend on labeled datasets — they could simply learn from context, much like humans do.

Introduction: The Unlabeled Data Dilemma#

Background: Setting the Stage#

Meta-Learning in a Nutshell#

The Unsupervised Twist#

In-Context Learning: The Transformer’s Superpower#

The Core Method: Inside CAMeLU#

Part 1: Crafting Tasks from Scratch#

Creating the Support Set#

Creating the Query Set: A Mixup-Inspired Strategy#

Part 2: The In-Context Learning Engine#

From Task to Sequence#

Model Architecture#

Visualizing In-Context Learning#

Experiments and Results: Putting CAMeLU to the Test#

Outperforming the Unsupervised Baselines#

From Memorization to Generalization: A “Grokking” Shift#

Generalization on Small-Scale Datasets#

Comparing with Self-Supervised Learning (SSL) Methods#

Conclusion and Implications#

Key Takeaways#