Reinforcement Learning (RL) has given us agents that can master complex video games, control simulated robots, and even grasp real-world objects. However, there’s a catch that has long plagued the field: RL is notoriously data-hungry.

An agent often needs millions of interactions with its environment to learn a task. In a fast simulation, that’s fine—but in the real world, where a robot arm might take seconds to perform a single action, this can translate to months or years of training.

When the agent is learning directly from raw pixels—like the feed from a camera—the problem gets worse. While humans can effortlessly make sense of visual input, for an RL agent, an image is just a high-dimensional array of numbers. It’s far easier for an agent to learn if it’s given clean, structured “state” information, like joint angles or positions. But in many real-world scenarios, such perfect state information does not exist. We only have pixels.

This sample inefficiency has been a major bottleneck, holding back RL in domains like robotics. But what if we could teach an agent to understand those pixels—to extract the meaningful, high-level features—before it even tries to learn control?

That’s the core idea behind CURL (Contrastive Unsupervised Representations for Reinforcement Learning), developed by researchers at UC Berkeley. CURL marries standard RL algorithms with a powerful contrastive learning technique from computer vision. The result: an agent that learns from pixels with unprecedented sample efficiency, nearly matching the performance of agents with privileged access to the true state of the environment.


Background: RL, Self-Supervision, and Contrastive Learning

Before diving into CURL, let’s review some fundamentals.

Reinforcement Learning from Pixels

Most image-based RL algorithms follow an actor-critic framework:

  • Actor (Policy): Chooses actions to maximize expected future rewards.
  • Critic (Value Function): Evaluates how good each action is, predicting future rewards.

The actor and critic train each other—the critic guides the actor, and the actor’s exploration generates experiences for the critic.

For continuous control tasks in the DeepMind Control Suite, CURL uses Soft Actor-Critic (SAC), a powerful off-policy algorithm. SAC optimizes not only for high reward but also for high policy entropy, encouraging exploration.

The SAC critic \( Q_{\phi_i} \) minimizes the Bellman error:

\[ \mathcal{L}(\phi_i, \mathcal{B}) = \mathbb{E}_{t \sim \mathcal{B}} \left[ \left( Q_{\phi_i}(o, a) - (r + \gamma(1 - d)\mathcal{T}) \right)^2 \right] \]

with target:

\[ \mathcal{T} = \left(\min_{i=1,2} Q^*_{\phi_i}(o',a') - \alpha \log \pi_{\psi}(a'|o')\right) \]

Here, \( \alpha \) is the entropy coefficient, and \( Q^*_{\phi_i} \) denotes an exponentially moving average of the critic parameters.

The actor \( \pi_{\psi} \) maximizes:

\[ \mathcal{L}(\psi) = \mathbb{E}_{a \sim \pi} \left[ Q^{\pi}(o, a) - \alpha \log \pi_{\psi}(a|o) \right] \]

For discrete control in Atari games, CURL uses a data-efficient Rainbow DQN, combining multiple DQN improvements.


Contrastive Learning in a Nutshell

We want an RL agent to process pixels into a robust feature space. Self-supervised learning, especially contrastive learning, is perfect for this.

Contrastive learning teaches an encoder to embed similar inputs close together and different inputs far apart. This is commonly seen as a dictionary lookup:

  1. Anchor (Query): An original image.
  2. Positive: A transformed version of the same image (e.g., cropped).
  3. Negatives: Other images in the batch.

A high-level overview of contrastive learning. An observation ‘o’ is augmented to create a query ‘o_q’ and a key ‘o_k’. These are fed into two encoders to produce feature vectors q and k, which are then used in a contrastive loss.

Figure 1. Conceptual diagram of contrastive learning: augmented views of the same observation are drawn closer in the learned feature space than other observations.

The InfoNCE loss formalizes this:

\[ \mathcal{L}_{q} = \log \frac{\exp(q^{T}Wk_{+})}{\exp(q^{T}Wk_{+}) + \sum_{i=0}^{K-1}\exp(q^{T}Wk_{i})} \]

where \( k_+ \) is the positive key, and others are negatives. This drives the encoder to capture semantic content, ignoring irrelevant variations.


How CURL Works

CURL weaves contrastive learning directly into the RL loop—no separate pre-training phase. Representation learning and policy learning happen together.

The CURL architecture diagram. A batch is sampled from the replay buffer, augmented into query and key observations, and passed through respective encoders. The query is used for RL, while both query and key pairs are used for the contrastive loss.

Figure 2. CURL couples a query encoder (updated via RL and contrastive losses) with a key encoder (updated via momentum averaging) to jointly train control and representation learning.

Step 1: Sample & Augment

From the replay buffer, CURL samples a batch of past observations. Each observation is augmented twice with a random crop to produce:

  • Query \( o_q \)
  • Key \( o_k \)

A visual example of the random crop augmentation. An input image is cropped twice to create an anchor (query) and a positive (key).

Figure 3. Random crops maintain temporal consistency across stacked frames.

In RL, observations are often stacks of frames. CURL applies identical random crop coordinates across all frames in the stack, preserving the temporal structure.

Step 2: Encode

  • Query Encoder \( f_{\theta} \): Encodes \( o_q \) into feature vector \( q \), used by the SAC/Rainbow actor and critic.
  • Key Encoder \( f_{\theta_k} \): Encodes \( o_k \) into feature vector \( k \).

Step 3: Momentum Encoder Trick

Borrowed from MoCo, the key encoder’s weights are updated via exponential moving average (EMA):

\[ \theta_k \leftarrow m \theta_k + (1 - m) \theta \]

This creates stable key representations, improving contrastive learning.

Step 4: Joint Optimization

The query encoder updates from:

  1. RL Loss: SAC or Rainbow losses using \( q \).
  2. Contrastive Loss: InfoNCE computed on \( q \) vs. \( k \) (positive and negatives).

This dual pressure creates features that are semantically rich and control-relevant.


Results

The authors benchmarked CURL on DeepMind Control Suite and Atari100k.

DeepMind Control Suite

Table of results for DMControl at 500k and 100k steps. CURL consistently achieves top scores across various environments.

Table 1. CURL achieves state-of-the-art results on most benchmark environments at 500k steps.

A bar chart comparing the median scores on DMControl.

Figure 4. CURL outperforms prior pixel-based RL algorithms at 100k and matches State SAC median score at 500k.

A bar chart showing Dreamer vs. CURL data efficiency.

Figure 6. Dreamer requires 4.5× more steps than CURL to match its 100k performance.

A grid of learning curves for CURL vs. State SAC.

Figure 7. CURL nearly matches State SAC performance across many DMControl environments—closing the state–pixels gap.


Atari100k

Table of results for the Atari100k benchmark. CURL improves upon Efficient Rainbow and reaches superhuman performance on some games.

Table 2. CURL improves Efficient Rainbow on 19/26 games, surpassing human-level scores on JamesBond and Krull.


Ablation Insights

CURL’s design choices were key to its success:

  • Momentum Encoder & Bilinear Similarity: Swapping them for a normal encoder or cosine similarity hurt performance. Comparison of EMA vs no EMA; bilinear vs cosine similarity.

    Figure 5. EMA and bilinear similarity yield superior and more stable learning.

  • Temporal Dynamics: Contrasting frame stacks (red) beats single frames (green) on tasks where motion matters. Stacked frame vs individual frame learning curves.

    Figure 8. Temporal information is vital for coordination-intensive tasks.

  • Reward-Agnostic Representations: Detaching the encoder from RL loss still produces near-optimal policies. Detached encoder experiment results.

    Figure 9. Contrastive-only training of the encoder yields general-purpose features useful for control.

  • State Prediction Gap: Tasks with high error in predicting true state from pixels (e.g., humanoid-run) are exactly where CURL lags most compared to state-based SAC. State prediction error bars.

    Figure 11. Predictability of state from pixels correlates with CURL’s relative performance.


Conclusion

CURL delivers a simple yet powerful recipe for overcoming sample inefficiency in pixel-based RL:

  1. Minimal Complexity, Maximum Gain: Just add a contrastive objective with data augmentation to a strong RL baseline.
  2. Robust Representation Learning: Instance discrimination shapes an encoder that understands visual semantics.
  3. Closing the Gap: Achieves near-oracle performance from pixels—closing in on state-based baselines.

This work sets a new standard for sample-efficient RL from pixels and has inspired a wave of research into data augmentation and self-supervision in RL. By shrinking learning times from weeks to hours, CURL is a leap forward in making high-performance RL agents practical in the real world.