Introduction

Imagine trying to learn how to ride a bicycle just by watching videos of professional cyclists. You’ve never touched the pedals yourself. If you suddenly hopped on a bike, you might assume you can perform a wheelie because you saw it in a video, but in reality, you’d likely fall over.

This is the central challenge of Offline Reinforcement Learning (Offline RL). We want to train agents to make optimal decisions using only a static dataset of previously collected experiences, without letting them interact with the dangerous real world during training.

The problem is that neural networks are optimists. When an offline RL agent considers an action it hasn’t seen in the dataset (an Out-of-Distribution or OOD action), it often hallucinates that this action will yield a massive reward. This phenomenon is known as extrapolation error. The agent’s “Q-function” (which predicts future rewards) overestimates the value of unknown actions, leading the agent to attempt dangerous or nonsensical maneuvers when deployed.

In a new research paper titled “Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data”, researchers propose a remarkably effective solution called PARS. By combining two intuitive yet powerful mechanisms—Reward Scaling with Layer Normalization (RS-LN) and Penalizing Infeasible Actions (PA)—they achieve state-of-the-art results without needing complex new architectures.

In this post, we will deconstruct the extrapolation error, explain why standard neural networks fail at boundaries, and dive deep into how PARS fixes this using a clever mix of geometry and signal processing.

The Problem: The Danger of Linear Extrapolation

To understand why Offline RL is hard, we have to look at how Deep Learning models essentially “guess” what happens in unknown territory.

In standard Reinforcement Learning, we use a Q-function, \(Q(s, a)\), to estimate the expected reward of taking action \(a\) in state \(s\). We want to find the action that maximizes this value.

The issue arises from the architecture of the neural networks themselves. Most modern networks use ReLU (Rectified Linear Unit) activation functions. While effective, ReLU networks have a specific quirk: outside the range of their training data, they tend to extrapolate linearly. If the Q-values were trending upward at the edge of the dataset, the network assumes they will keep going up forever.

Composition of offline data and comparison between ground truth Q-values and learned Q-function

As shown in Figure 2 above, we can categorize actions into three buckets:

In-Distribution (ID): Actions present in our dataset (green).
OOD-in: Actions the agent hasn’t tried, but that lie between data points (inside the “convex hull” of the data) (yellow).
OOD-out: Actions completely outside the range of valid behavior (pink).

The red dotted line represents the learned Q-function. Notice how in the “OOD-out” region, the learned curve shoots upward. This is linear extrapolation. The agent sees this rising curve and thinks, “If I take this extreme action, I’ll get infinite reward!” In reality, the Ground Truth (blue dashed line) drops off because those actions are likely failures. This gap is the extrapolation error.

To fix this, we need the network to learn a “hill” shape: high values where the data is good, and a slope downwards everywhere else.

The PARS Solution

The researchers introduce PARS (Penalizing Infeasible Actions and Reward Scaling). It tackles the extrapolation problem from two angles:

Soft Guidance (RS-LN): Changing how the network perceives “similarity” so it stops generalizing high rewards to unknown areas.
Hard Constraints (PA): Explicitly punishing the network for predicting high values in impossible regions.

Let’s break these down.

Part 1: Reward Scaling and Layer Normalization (RS-LN)

This is the most counter-intuitive and fascinating part of the paper. Usually, in Deep Learning, we scale inputs and outputs down to keep training stable. PARS does the opposite for rewards: it scales them up, drastically.

The Resolution Analogy

Why would making rewards larger help? The authors offer a brilliant analogy involving function approximation.

Imagine you are trying to approximate a function \(y=x\) using a step function (a histogram). If you make the function steeper, say \(y=5x\), but keep the input range the same, your approximation error increases—unless you increase the resolution (the number of bins).

Error in approximating y = x and y = 5x

As Figure 4 illustrates, a steeper slope (high reward scale) demands a finer resolution to be approximated accurately.

In the context of a neural network, “higher resolution” means the network must learn more high-frequency features. It effectively has to use more of its neurons to fit the data precisely. If the network is forced to learn very specific, sharp peaks for the In-Distribution (ID) data, it becomes less likely to lazily smear those high values out into the OOD regions.

The Role of Layer Normalization (LN)

However, just scaling up rewards causes instability. This is where Layer Normalization (LN) comes in. LN constrains the internal feature vectors of the network to a normalized sphere.

The combination of High Reward Scaling (\(c_{reward}\)) + Layer Normalization creates a unique effect:

LN keeps the “input volume” bounded.
High reward scaling demands high expressivity within that bound.
The network is forced to treat the OOD regions as “dissimilar” to the ID regions to maintain the sharp slope required to fit the high rewards.

This reduces the Neural Tangent Kernel (NTK) similarity between valid data points and outliers. In simple terms: updating the network to predict a high reward for a valid action no longer accidentally pulls up the Q-value for a distant, invalid action.

Results of training on a toy dataset using ReLU MLPs with vanilla regression

The visual proof is in Figure 5. Look at the column “LN (Wider Range)”.

\(c_{reward}=1\): The network learns a gentle bowl shape. The OOD regions (pink) are still somewhat high relative to the center.
\(c_{reward}=100\): The network learns a sharp, distinct shape. The OOD regions are flattened out near zero, significantly lower than the ID peak.

By increasing the reward scale, the researchers effectively forced the network to stop generalizing blindly. This “soft guidance” naturally depresses the Q-values outside the data distribution.

The Dormant Neuron Phenomenon

An interesting side-effect noted in the paper is the reduction of “dormant neurons.” In standard RL, many neurons in a ReLU network eventually stop firing entirely (they die out). With RS-LN, the percentage of active neurons increases significantly. The network utilizes its full capacity to fit the high-amplitude reward landscape, leading to better feature discrimination.

Part 2: Penalizing Infeasible Actions (PA)

RS-LN prevents the Q-values from shooting up to infinity, but it usually keeps them flat (near zero). To ensure safety, we want the Q-values to actively decline as we move away from safe actions. We want the agent to know that venturing into the unknown is strictly worse than staying known.

This is where the second component, Penalizing Infeasible Actions (PA), comes in.

The idea is simple: pick points that are definitely impossible (infeasible) and train the network to output a low value (\(Q_{min}\)) for them.

Defining the Infeasible Region

We cannot simply penalize everything outside the dataset, because we might accidentally punish good actions that are slightly adjacent to our data (OOD-in). Instead, PARS defines an Infeasible Action Region (\(\mathcal{A}_I\)) that is far away from the feasible boundary.

AF and AI for n = 1

As shown in Figure 8, there is a Guard Interval between the Feasible Action Region (\(\mathcal{A}_F\)) and the Infeasible Region (\(\mathcal{A}_I\)).

\(\mathcal{A}_F\): The valid range of actions (e.g., motor torque between -1 and 1).
Guard Interval: A buffer zone to ensure we don’t interfere with the gradients at the edge of the valid space.
\(\mathcal{A}_I\): The region far out (e.g., motor torque > 100), where we sample points for the penalty.

The PA Loss Function

The algorithm adds a specific loss term to the training objective:

PA Loss Equation

This equation says: “For actions \(a\) sampled from the infeasible region, minimize the difference between the predicted Q-value and \(Q_{min}\).”

By pinning the values far away to a minimum, and having high values in the center (thanks to the data), the nature of neural networks creates a smooth downward slope connecting the two. This “ski-slope” shape ensures that if the agent tries to maximize the Q-value, gradients will naturally push it back toward the safe, data-supported region.

The Complete Algorithm: PARS

Combining these two concepts—the high-resolution feature learning of RS-LN and the boundary enforcement of PA—gives us the PARS algorithm.

The total loss function looks like this:

Total Loss Equation

It is built on top of the minimalist TD3+BC framework. TD3 is a standard actor-critic algorithm, and “+BC” adds a behavior cloning term to keep the policy close to the data. PARS enhances the Critic (the Q-function) to ensure the value landscape is well-shaped, which allows the Actor (the policy) to find better actions without falling off the “cliff” of extrapolation error.

Implementation is straightforward:

Scale Rewards: Multiply incoming rewards by a factor (e.g., 10 or 100).
Add Layer Norm: Apply LN to the Q-network layers.
Sample Infeasible Actions: Randomly pick actions far outside the valid bounds.
Apply Penalty: Add the PA loss to the standard Bellman update.

Experiments and Results

Does this geometric manipulation actually work on complex tasks? The researchers evaluated PARS on the D4RL benchmark, which includes tasks controlling varied robots (Ant, HalfCheetah, Walker) and manipulating objects (Pen, Hammer, Door).

Offline Performance

The results were impressive. PARS consistently matched or outperformed existing state-of-the-art (SOTA) algorithms.

Comparison of PARS and prior SOTA

Figure 1 shows a radar chart comparing PARS (orange) to previous SOTA methods (gray). Notice the coverage:

MuJoCo Locomotion: PARS is highly competitive.
Adroit (Dexterous Manipulation): PARS achieves very high scores, particularly on the “Cloned” datasets.
AntMaze Ultra: This is the standout result. AntMaze is a notorious “sparse reward” task where a robot ant must navigate a large maze. The “Ultra” version is huge. Most algorithms score near 0. PARS achieves scores of 66.4 and 51.4 on the play and diverse datasets, respectively—a massive leap over prior methods.

Why AntMaze Ultra?

The success on AntMaze Ultra validates the core hypothesis of PARS. In sparse reward settings with long horizons, the value signal is weak. Standard networks easily “leak” value into invalid actions, confusing the agent. PARS’s high reward scaling forces the network to hold onto those sparse signals tightly, while the infeasible penalty creates a clear corridor for the agent to follow.

Component Analysis: Do we need both?

The authors performed ablation studies to verify if RS-LN and PA are both necessary.

PARS offline performance with varying Creward and application of LN and PA

Figure 9 reveals the synergy:

None (Yellow): Performance collapses as the reward scale (\(c_{reward}\)) increases. The network becomes unstable.
LN only (Blue): Performance improves significantly as reward scale increases (validating the RS-LN theory).
LN & PA (Orange): Adding the penalty yields the best performance, starting high even at lower reward scales and maintaining robustness.

Offline-to-Online Fine-Tuning

One of the holy grails of RL is to train an agent offline and then deploy it online to fine-tune and get even better. Many offline algorithms are too conservative; they clamp the agent so tightly to the data that it can’t learn anything new online.

PARS excels here. Because it shapes the value landscape (downward slope) rather than just masking actions, it provides a good starting point for exploration.

Performance graph of online fine-tuning

In Figure 13 (look for the red lines representing PARS), we see the online fine-tuning performance. In difficult tasks like AntMaze-Ultra (bottom row) or Walker2d, PARS adapts quickly, often rising faster and higher than competitors like CQL or IQL.

Conclusion

The paper “Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data” provides a refreshing perspective on Offline RL. Rather than inventing complicated new loss functions or generative models, the authors looked at the fundamental properties of ReLU networks.

They identified that linear extrapolation is the enemy. Their solution, PARS, fights this enemy with two distinct weapons:

Reward Scaling + Layer Norm: Forces the network to increase its resolution, preventing valid high values from bleeding into invalid regions.
Infeasible Action Penalty: Anchors distant regions to low values, creating a safe “basin of attraction” around the valid data.

For students and practitioners, PARS demonstrates that understanding the inductive biases of your neural network (like how it extrapolates) is just as important as the RL algorithm itself. By simply scaling rewards and adding a boundary penalty, we can turn a hallucinating agent into a robust, state-of-the-art decision maker.

Introduction#

The Problem: The Danger of Linear Extrapolation#

The PARS Solution#

Part 1: Reward Scaling and Layer Normalization (RS-LN)#

The Resolution Analogy#

The Role of Layer Normalization (LN)#

The Dormant Neuron Phenomenon#

Part 2: Penalizing Infeasible Actions (PA)#

Defining the Infeasible Region#

The PA Loss Function#

The Complete Algorithm: PARS#

Experiments and Results#

Offline Performance#

Why AntMaze Ultra?#

Component Analysis: Do we need both?#

Offline-to-Online Fine-Tuning#

Conclusion#