Beyond Online Training: Introducing D4RL for Real-World Offline Reinforcement Learning

The past decade has shown us the incredible power of large datasets. From ImageNet fueling the computer vision revolution to massive text corpora enabling models like GPT, it’s clear: data is the lifeblood of modern machine learning. Yet one of the most exciting fields—Reinforcement Learning (RL)—has largely been excluded from this data-driven paradigm.

Traditionally, RL agents learn through active, online interaction with an environment—playing games, controlling robots, simulating trades—building policies through trial and error. This approach is powerful but often impractical, expensive, or dangerous in real-world contexts. We can’t let a self-driving car “explore” by crashing thousands of times or experiment recklessly in healthcare.

What if we could leverage the vast amounts of existing data to train RL agents? Think logs from human-driven cars, medical treatment records, or user interaction data from a website. This is the promise of Offline Reinforcement Learning (also called batch RL): learning effective policies from a fixed dataset, without any further environment interaction. Offline RL bridges the data-rich world of supervised learning with the sequential decision-making power of RL.

The catch? Until recently, there was no proper proving ground. Offline RL algorithms were tested on datasets collected in controlled settings that didn’t reflect the complexity, messiness, and bias of real-world data. This created a misleading sense of progress. The 2020 paper “D4RL: Datasets for Deep Data-Driven Reinforcement Learning”, from UC Berkeley and Google Brain researchers, tackled this head-on. They created a benchmark designed to push offline RL to its limits—and expose its weak points.

In this post, we’ll unpack why D4RL was necessary, explore the design principles behind it, and see what its challenging tasks reveal about the current state of offline RL.

A collage of the diverse and challenging environments included in the D4RL benchmark.

Figure 1: A selection of proposed benchmark tasks includes navigation mazes, urban driving, traffic flow control, and robotic manipulation.

The Offline RL Challenge: Learning in Handcuffs

To appreciate D4RL’s impact, we need to grasp why offline RL is hard.

In standard online RL, the agent continually interacts with the environment: observe state → act → receive reward → repeat. If it encounters an unfamiliar situation, it can explore to gather new experience. The training distribution evolves along with the agent’s policy.

In offline RL, the agent is handed a fixed dataset \(\mathcal{D}\) generated by some unknown “behavior” policy \(\pi_B\). Its goal is to learn a new policy \(\pi\) that maximizes rewards—but it can only use the data in \(\mathcal{D}\). It cannot try new actions and see what happens.

This creates the distribution shift problem: the learned policy \(\pi\) will propose actions different from those in the dataset. Value estimates for these out-of-distribution actions are unreliable. In RL, errors can bootstrap—compounding until the whole policy collapses.

Early offline RL benchmarks avoided the worst of these issues, using clean datasets from online-trained agents. D4RL’s authors argue that real data is much messier, and a benchmark must reflect that.

Designing a Better Benchmark: D4RL’s Core Principles

D4RL was built to surface challenges common in real-world applications but underrepresented in prior offline RL benchmarks. The authors focused on five testbed properties:

1. Narrow and Biased Data

Real datasets often come from deterministic policies or experts following one consistent routine. Such data covers a small fraction of states and actions, making generalization difficult and overfitting more likely. D4RL includes human and controller-generated datasets to stress-test this limitation.

2. Undirected and Multitask Data

Passively logged data rarely follows the single trajectory needed to solve a target task. The dataset might contain snippets of useful behaviors without any full solution path.

An offline agent must stitch together sub-trajectories into a successful solution.
Figure 2: The “stitching” problem. An agent must learn to combine parts of different trajectories (left) to form a new, successful trajectory (right) that doesn’t exist in the original dataset.

Figure 2: Illustration of stitching—combining segments of different trajectories to create a new path.

3. Sparse Rewards

In sparse reward problems, the only feedback comes at complete success—making credit assignment tricky. In offline mode, exploration is removed from the equation, uniquely isolating an algorithm’s ability to trace rewards back through long action chains.

4. Suboptimal and Mixed-Quality Data

Many datasets blend expert behavior with middling or poor decisions. A strong offline RL algorithm should learn a policy better than the average behavior, not simply imitate it. D4RL explicitly tests this with mixed-policy datasets.

5. Realistic and Complex Data Sources

Offline RL for realistic domains means handling:

Human demonstrations with rich variability and possible non-Markovian traits.
Hand-crafted controllers that may behave deterministically.
Partial observability, as in visual autonomous driving, where state information is incomplete.

A Tour of D4RL’s Environments

D4RL integrates diverse domains, each designed to probe one or more of the above principles.

Maze2D & AntMaze: Stitching Under Pressure

Maze2D tasks involve navigating a point-mass through a maze to a goal. AntMaze adds complexity: controlling an 8-DoF quadruped robot with sparse rewards. In both, a planner generates trajectories toward random goals, unrelated to evaluation goals. Success requires stitching disparate path fragments.

The Maze2D environment, where agents must navigate to a goal. The ‘umaze’ (left) and ‘medium’ (right) layouts are shown.

Figure 3: Maze2D layouts demonstrate simple and moderate navigation challenges.

The AntMaze environment uses the same maze layouts but with a complex, high-dimensional quadruped robot, making the control problem much harder.

Figure 4: AntMaze adds dynamic complexity and sparse reward conditions.

Gym-MuJoCo: Classics, Reimagined

Hopper, HalfCheetah, and Walker2d locomotion tasks are staples in RL. D4RL reframes them for offline learning with:

random: Data from untrained policies.
medium: Partially-trained policy data.
medium-replay: Data accumulated during medium policy training.
medium-expert: 50% medium + 50% expert data, testing mixed-policy handling.

The classic Gym-MuJoCo locomotion tasks: Hopper, HalfCheetah, and Walker2d.

Figure 5: Locomotion benchmarks adapted for offline RL contexts.

Adroit: Dexterous Manipulation from Humans

Adroit tasks involve controlling a 24-DoF simulated hand for precision manipulations—hammering, opening doors, and more. Sparse rewards make online RL struggle here. D4RL includes human demonstration datasets and expert policies fine-tuned on such demos, measuring the ability to learn from narrow, high-quality human data.

The Adroit environment, featuring a simulated dexterous hand performing complex manipulation tasks.

Figure 6: Dexterous manipulation tasks with human-provided data.

Franka Kitchen & CARLA: Realism and Generalization

For realistic multitask and sensory-challenging domains:

Franka Kitchen:
A robotic arm must complete combinations of kitchen tasks. The hardest dataset (mixed) has only partial-task trajectories—requiring composition of sub-skills unseen together.

CARLA:
A high-fidelity driving simulator offering first-person RGB visual inputs. Agents must follow lanes or navigate small towns using controller-generated data, facing visual complexity and partial observability.

The Franka Kitchen environment, a multitask robotics challenge requiring generalization.

Figure 7: Franka Kitchen multitask setup.

The CARLA driving simulator provides a visually complex and partially observable challenge.

Figure 8: CARLA tests perception and control under partial observability.

How Did Existing Algorithms Perform?

D4RL’s authors didn’t just design the benchmark—they evaluated leading offline RL algorithms under a strict protocol: tune hyperparameters on specified training tasks, then measure on untouched evaluation tasks. This prevents overfitting test settings, ensuring realistic performance estimates.

Overall results (Figure 9) revealed stark gaps between top-performing methods and expert-level scores.

Figure 8: Average normalized performance of various offline RL algorithms across all D4RL tasks. Higher is better. The results show a clear gap between the best-performing methods and expert-level performance.

Figure 9: Average normalized performance across all domains. None of the methods consistently match expert-level performance.

Key insights:

Familiar Territory Wins: Highest scores came from Gym-MuJoCo and Adroit expert datasets—aligning with familiar, clean-policy data most prior work used.
Stitching is Tough: Tasks needing trajectory stitching (Maze2D, AntMaze, Franka Kitchen mixed) stumped most methods, underscoring the importance of compositionality and generalization.
Mixed-Data Pitfalls: Medium-expert datasets showed little gain over medium-only, suggesting difficulty leveraging high-quality portions amid noise.
Sparse Reward Promise: Offline methods often beat online SAC in Adroit and AntMaze, proving offline RL’s potential to thrive in exploration-challenged scenarios.

A breakdown of performance by domain. Notice how performance varies dramatically, with algorithms excelling in some domains (like Gym) but struggling in others (like the harder mazes).

Figure 10: Domain-wise performance differences highlight where algorithms excel and fail.

Conclusion: Setting a New Standard

The D4RL paper is a landmark in offline reinforcement learning. It convincingly argues that progress demands benchmarks mirroring the messy complexity of real-world data—not just clean, synthetic datasets.

By targeting stitching, mixed-quality handling, narrow expert demos, and realistic sensory inputs, D4RL paints a fuller picture of algorithm strengths and weaknesses. It shows that while distribution shift is better understood, generalization, compositionality, and credit assignment remain open problems.

D4RL is both a diagnostic tool and a development platform. With a widely accessible, challenging suite of tasks, it provides a common ground to build and test new ideas—accelerating the journey toward RL agents that can learn safely and effectively from the wealth of data already at our fingertips.

The Offline RL Challenge: Learning in Handcuffs#

Designing a Better Benchmark: D4RL’s Core Principles#

1. Narrow and Biased Data#

2. Undirected and Multitask Data#

3. Sparse Rewards#

4. Suboptimal and Mixed-Quality Data#

5. Realistic and Complex Data Sources#

A Tour of D4RL’s Environments#

Maze2D & AntMaze: Stitching Under Pressure#

Gym-MuJoCo: Classics, Reimagined#

Adroit: Dexterous Manipulation from Humans#

Franka Kitchen & CARLA: Realism and Generalization#

How Did Existing Algorithms Perform?#

Conclusion: Setting a New Standard#