The past decade has shown us the incredible power of large datasets. From ImageNet fueling the computer vision revolution to massive text corpora enabling models like GPT, it’s clear: data is the lifeblood of modern machine learning. Yet one of the most exciting fields—Reinforcement Learning (RL)—has largely been excluded from this data-driven paradigm.
Traditionally, RL agents learn through active, online interaction with an environment—playing games, controlling robots, simulating trades—building policies through trial and error. This approach is powerful but often impractical, expensive, or dangerous in real-world contexts. We can’t let a self-driving car “explore” by crashing thousands of times or experiment recklessly in healthcare.
What if we could leverage the vast amounts of existing data to train RL agents? Think logs from human-driven cars, medical treatment records, or user interaction data from a website. This is the promise of Offline Reinforcement Learning (also called batch RL): learning effective policies from a fixed dataset, without any further environment interaction. Offline RL bridges the data-rich world of supervised learning with the sequential decision-making power of RL.
The catch? Until recently, there was no proper proving ground. Offline RL algorithms were tested on datasets collected in controlled settings that didn’t reflect the complexity, messiness, and bias of real-world data. This created a misleading sense of progress. The 2020 paper “D4RL: Datasets for Deep Data-Driven Reinforcement Learning”, from UC Berkeley and Google Brain researchers, tackled this head-on. They created a benchmark designed to push offline RL to its limits—and expose its weak points.
In this post, we’ll unpack why D4RL was necessary, explore the design principles behind it, and see what its challenging tasks reveal about the current state of offline RL.
Figure 1: A selection of proposed benchmark tasks includes navigation mazes, urban driving, traffic flow control, and robotic manipulation.
The Offline RL Challenge: Learning in Handcuffs
To appreciate D4RL’s impact, we need to grasp why offline RL is hard.
In standard online RL, the agent continually interacts with the environment: observe state → act → receive reward → repeat. If it encounters an unfamiliar situation, it can explore to gather new experience. The training distribution evolves along with the agent’s policy.
In offline RL, the agent is handed a fixed dataset \(\mathcal{D}\) generated by some unknown “behavior” policy \(\pi_B\). Its goal is to learn a new policy \(\pi\) that maximizes rewards—but it can only use the data in \(\mathcal{D}\). It cannot try new actions and see what happens.
This creates the distribution shift problem: the learned policy \(\pi\) will propose actions different from those in the dataset. Value estimates for these out-of-distribution actions are unreliable. In RL, errors can bootstrap—compounding until the whole policy collapses.
Early offline RL benchmarks avoided the worst of these issues, using clean datasets from online-trained agents. D4RL’s authors argue that real data is much messier, and a benchmark must reflect that.
Designing a Better Benchmark: D4RL’s Core Principles
D4RL was built to surface challenges common in real-world applications but underrepresented in prior offline RL benchmarks. The authors focused on five testbed properties:
1. Narrow and Biased Data
Real datasets often come from deterministic policies or experts following one consistent routine. Such data covers a small fraction of states and actions, making generalization difficult and overfitting more likely. D4RL includes human and controller-generated datasets to stress-test this limitation.
2. Undirected and Multitask Data
Passively logged data rarely follows the single trajectory needed to solve a target task. The dataset might contain snippets of useful behaviors without any full solution path.
An offline agent must stitch together sub-trajectories into a successful solution.
Figure 2: Illustration of stitching—combining segments of different trajectories to create a new path.
3. Sparse Rewards
In sparse reward problems, the only feedback comes at complete success—making credit assignment tricky. In offline mode, exploration is removed from the equation, uniquely isolating an algorithm’s ability to trace rewards back through long action chains.
4. Suboptimal and Mixed-Quality Data
Many datasets blend expert behavior with middling or poor decisions. A strong offline RL algorithm should learn a policy better than the average behavior, not simply imitate it. D4RL explicitly tests this with mixed-policy datasets.
5. Realistic and Complex Data Sources
Offline RL for realistic domains means handling:
- Human demonstrations with rich variability and possible non-Markovian traits.
- Hand-crafted controllers that may behave deterministically.
- Partial observability, as in visual autonomous driving, where state information is incomplete.
A Tour of D4RL’s Environments
D4RL integrates diverse domains, each designed to probe one or more of the above principles.
Maze2D & AntMaze: Stitching Under Pressure
Maze2D tasks involve navigating a point-mass through a maze to a goal. AntMaze adds complexity: controlling an 8-DoF quadruped robot with sparse rewards. In both, a planner generates trajectories toward random goals, unrelated to evaluation goals. Success requires stitching disparate path fragments.
Figure 3: Maze2D layouts demonstrate simple and moderate navigation challenges.
Figure 4: AntMaze adds dynamic complexity and sparse reward conditions.
Gym-MuJoCo: Classics, Reimagined
Hopper, HalfCheetah, and Walker2d locomotion tasks are staples in RL. D4RL reframes them for offline learning with:
- random: Data from untrained policies.
- medium: Partially-trained policy data.
- medium-replay: Data accumulated during medium policy training.
- medium-expert: 50% medium + 50% expert data, testing mixed-policy handling.
Figure 5: Locomotion benchmarks adapted for offline RL contexts.
Adroit: Dexterous Manipulation from Humans
Adroit tasks involve controlling a 24-DoF simulated hand for precision manipulations—hammering, opening doors, and more. Sparse rewards make online RL struggle here. D4RL includes human demonstration datasets and expert policies fine-tuned on such demos, measuring the ability to learn from narrow, high-quality human data.
Figure 6: Dexterous manipulation tasks with human-provided data.
Franka Kitchen & CARLA: Realism and Generalization
For realistic multitask and sensory-challenging domains:
Franka Kitchen:
A robotic arm must complete combinations of kitchen tasks. The hardest dataset (mixed
) has only partial-task trajectories—requiring composition of sub-skills unseen together.
CARLA:
A high-fidelity driving simulator offering first-person RGB visual inputs. Agents must follow lanes or navigate small towns using controller-generated data, facing visual complexity and partial observability.
Figure 7: Franka Kitchen multitask setup.
Figure 8: CARLA tests perception and control under partial observability.
How Did Existing Algorithms Perform?
D4RL’s authors didn’t just design the benchmark—they evaluated leading offline RL algorithms under a strict protocol: tune hyperparameters on specified training tasks, then measure on untouched evaluation tasks. This prevents overfitting test settings, ensuring realistic performance estimates.
Overall results (Figure 9) revealed stark gaps between top-performing methods and expert-level scores.
Figure 9: Average normalized performance across all domains. None of the methods consistently match expert-level performance.
Key insights:
- Familiar Territory Wins: Highest scores came from Gym-MuJoCo and Adroit expert datasets—aligning with familiar, clean-policy data most prior work used.
- Stitching is Tough: Tasks needing trajectory stitching (Maze2D, AntMaze, Franka Kitchen mixed) stumped most methods, underscoring the importance of compositionality and generalization.
- Mixed-Data Pitfalls: Medium-expert datasets showed little gain over medium-only, suggesting difficulty leveraging high-quality portions amid noise.
- Sparse Reward Promise: Offline methods often beat online SAC in Adroit and AntMaze, proving offline RL’s potential to thrive in exploration-challenged scenarios.
Figure 10: Domain-wise performance differences highlight where algorithms excel and fail.
Conclusion: Setting a New Standard
The D4RL paper is a landmark in offline reinforcement learning. It convincingly argues that progress demands benchmarks mirroring the messy complexity of real-world data—not just clean, synthetic datasets.
By targeting stitching, mixed-quality handling, narrow expert demos, and realistic sensory inputs, D4RL paints a fuller picture of algorithm strengths and weaknesses. It shows that while distribution shift is better understood, generalization, compositionality, and credit assignment remain open problems.
D4RL is both a diagnostic tool and a development platform. With a widely accessible, challenging suite of tasks, it provides a common ground to build and test new ideas—accelerating the journey toward RL agents that can learn safely and effectively from the wealth of data already at our fingertips.