Reinforcement Learning (RL) has given us agents that can master complex video games, control simulated robots, and even grasp real-world objects. However, there’s a catch that has long plagued the field: RL is notoriously data-hungry.
An agent often needs millions of interactions with its environment to learn a task. In a fast simulation, that’s fine—but in the real world, where a robot arm might take seconds to perform a single action, this can translate to months or years of training.
When the agent is learning directly from raw pixels—like the feed from a camera—the problem gets worse. While humans can effortlessly make sense of visual input, for an RL agent, an image is just a high-dimensional array of numbers. It’s far easier for an agent to learn if it’s given clean, structured “state” information, like joint angles or positions. But in many real-world scenarios, such perfect state information does not exist. We only have pixels.
This sample inefficiency has been a major bottleneck, holding back RL in domains like robotics. But what if we could teach an agent to understand those pixels—to extract the meaningful, high-level features—before it even tries to learn control?
That’s the core idea behind CURL (Contrastive Unsupervised Representations for Reinforcement Learning), developed by researchers at UC Berkeley. CURL marries standard RL algorithms with a powerful contrastive learning technique from computer vision. The result: an agent that learns from pixels with unprecedented sample efficiency, nearly matching the performance of agents with privileged access to the true state of the environment.
Background: RL, Self-Supervision, and Contrastive Learning
Before diving into CURL, let’s review some fundamentals.
Reinforcement Learning from Pixels
Most image-based RL algorithms follow an actor-critic framework:
- Actor (Policy): Chooses actions to maximize expected future rewards.
- Critic (Value Function): Evaluates how good each action is, predicting future rewards.
The actor and critic train each other—the critic guides the actor, and the actor’s exploration generates experiences for the critic.
For continuous control tasks in the DeepMind Control Suite, CURL uses Soft Actor-Critic (SAC), a powerful off-policy algorithm. SAC optimizes not only for high reward but also for high policy entropy, encouraging exploration.
The SAC critic \( Q_{\phi_i} \) minimizes the Bellman error:
\[ \mathcal{L}(\phi_i, \mathcal{B}) = \mathbb{E}_{t \sim \mathcal{B}} \left[ \left( Q_{\phi_i}(o, a) - (r + \gamma(1 - d)\mathcal{T}) \right)^2 \right] \]with target:
\[ \mathcal{T} = \left(\min_{i=1,2} Q^*_{\phi_i}(o',a') - \alpha \log \pi_{\psi}(a'|o')\right) \]Here, \( \alpha \) is the entropy coefficient, and \( Q^*_{\phi_i} \) denotes an exponentially moving average of the critic parameters.
The actor \( \pi_{\psi} \) maximizes:
\[ \mathcal{L}(\psi) = \mathbb{E}_{a \sim \pi} \left[ Q^{\pi}(o, a) - \alpha \log \pi_{\psi}(a|o) \right] \]For discrete control in Atari games, CURL uses a data-efficient Rainbow DQN, combining multiple DQN improvements.
Contrastive Learning in a Nutshell
We want an RL agent to process pixels into a robust feature space. Self-supervised learning, especially contrastive learning, is perfect for this.
Contrastive learning teaches an encoder to embed similar inputs close together and different inputs far apart. This is commonly seen as a dictionary lookup:
- Anchor (Query): An original image.
- Positive: A transformed version of the same image (e.g., cropped).
- Negatives: Other images in the batch.
Figure 1. Conceptual diagram of contrastive learning: augmented views of the same observation are drawn closer in the learned feature space than other observations.
The InfoNCE loss formalizes this:
\[ \mathcal{L}_{q} = \log \frac{\exp(q^{T}Wk_{+})}{\exp(q^{T}Wk_{+}) + \sum_{i=0}^{K-1}\exp(q^{T}Wk_{i})} \]where \( k_+ \) is the positive key, and others are negatives. This drives the encoder to capture semantic content, ignoring irrelevant variations.
How CURL Works
CURL weaves contrastive learning directly into the RL loop—no separate pre-training phase. Representation learning and policy learning happen together.
Figure 2. CURL couples a query encoder (updated via RL and contrastive losses) with a key encoder (updated via momentum averaging) to jointly train control and representation learning.
Step 1: Sample & Augment
From the replay buffer, CURL samples a batch of past observations. Each observation is augmented twice with a random crop to produce:
- Query \( o_q \)
- Key \( o_k \)
Figure 3. Random crops maintain temporal consistency across stacked frames.
In RL, observations are often stacks of frames. CURL applies identical random crop coordinates across all frames in the stack, preserving the temporal structure.
Step 2: Encode
- Query Encoder \( f_{\theta} \): Encodes \( o_q \) into feature vector \( q \), used by the SAC/Rainbow actor and critic.
- Key Encoder \( f_{\theta_k} \): Encodes \( o_k \) into feature vector \( k \).
Step 3: Momentum Encoder Trick
Borrowed from MoCo, the key encoder’s weights are updated via exponential moving average (EMA):
\[ \theta_k \leftarrow m \theta_k + (1 - m) \theta \]This creates stable key representations, improving contrastive learning.
Step 4: Joint Optimization
The query encoder updates from:
- RL Loss: SAC or Rainbow losses using \( q \).
- Contrastive Loss: InfoNCE computed on \( q \) vs. \( k \) (positive and negatives).
This dual pressure creates features that are semantically rich and control-relevant.
Results
The authors benchmarked CURL on DeepMind Control Suite and Atari100k.
DeepMind Control Suite
Table 1. CURL achieves state-of-the-art results on most benchmark environments at 500k steps.
Figure 4. CURL outperforms prior pixel-based RL algorithms at 100k and matches State SAC median score at 500k.
Figure 6. Dreamer requires 4.5× more steps than CURL to match its 100k performance.
Figure 7. CURL nearly matches State SAC performance across many DMControl environments—closing the state–pixels gap.
Atari100k
Table 2. CURL improves Efficient Rainbow on 19/26 games, surpassing human-level scores on JamesBond and Krull.
Ablation Insights
CURL’s design choices were key to its success:
Momentum Encoder & Bilinear Similarity: Swapping them for a normal encoder or cosine similarity hurt performance.
Figure 5. EMA and bilinear similarity yield superior and more stable learning.
Temporal Dynamics: Contrasting frame stacks (red) beats single frames (green) on tasks where motion matters.
Figure 8. Temporal information is vital for coordination-intensive tasks.
Reward-Agnostic Representations: Detaching the encoder from RL loss still produces near-optimal policies.
Figure 9. Contrastive-only training of the encoder yields general-purpose features useful for control.
State Prediction Gap: Tasks with high error in predicting true state from pixels (e.g., humanoid-run) are exactly where CURL lags most compared to state-based SAC.
Figure 11. Predictability of state from pixels correlates with CURL’s relative performance.
Conclusion
CURL delivers a simple yet powerful recipe for overcoming sample inefficiency in pixel-based RL:
- Minimal Complexity, Maximum Gain: Just add a contrastive objective with data augmentation to a strong RL baseline.
- Robust Representation Learning: Instance discrimination shapes an encoder that understands visual semantics.
- Closing the Gap: Achieves near-oracle performance from pixels—closing in on state-based baselines.
This work sets a new standard for sample-efficient RL from pixels and has inspired a wave of research into data augmentation and self-supervision in RL. By shrinking learning times from weeks to hours, CURL is a leap forward in making high-performance RL agents practical in the real world.