Introduction
Large Language Models (LLMs) like GPT-4, PaLM, and Llama-2 have revolutionized how we interact with information. They can translate languages, write code, and answer complex questions with eerie fluency. However, this versatility comes with a significant caveat: safety. Because these models are trained on vast swathes of the internet, they inadvertently learn to generate harmful content, including hate speech, misinformation, and instructions for illegal acts.
To combat this, developers use alignment techniques like Reinforcement Learning from Human Feedback (RLHF) to teach models to refuse harmful queries. But these safety mechanisms are fragile. A model that is perfectly safe today can be “jailbroken” with a clever prompt tomorrow. Furthermore, as users fine-tune these models for specific tasks (like coding or math) or edit them to update their knowledge, the original safety alignment often deteriorates.
This creates a complex problem: How do we ensure a model stays safe across different lifecycles—whether it is a base model, a fine-tuned specialist, or a knowledge-edited version—without having to retrain it from scratch every time?
Researchers from the Singapore University of Technology and Design and IIT Kharagpur have proposed a novel solution called Safety Arithmetic. Their framework treats safety not as a rule book, but as a mathematical operation. By identifying “harm vectors” in the model’s parameters and “safety vectors” in its activations, they can mathematically subtract harmful behaviors and add safe ones.

As illustrated above, Safety Arithmetic is designed to wrap around the model regardless of how it is being used—whether it’s a base model, a supervised fine-tuned (SFT) model, or an edited model. In this post, we will explore the mechanics of this framework, breaking down the linear algebra that allows us to “steer” models toward safety without expensive retraining.
Background: Why Traditional Alignment Isn’t Enough
Before diving into the solution, we need to understand the fragility of current LLMs. When an organization releases a “safe” model, it has usually undergone extensive Supervised Fine-Tuning (SFT) and RLHF to align it with human values.
However, the “alignment tax” is real. Heavily censored models often refuse to answer benign questions (a phenomenon known as over-safety). Worse, when a user downloads an open-weights model like Llama-2 and fine-tunes it on a dataset of math problems or medical records, the model often “forgets” its safety training. This is known as catastrophic forgetting of safety alignment.
Furthermore, techniques like Model Editing (e.g., ROME), which allow us to surgically inject new facts into a model without retraining, can unintentionally disrupt the delicate balance of weights that keeps the model safe.
The researchers behind Safety Arithmetic lean on two emerging concepts to solve this:
- Task Arithmetic: The idea that specific capabilities (or tasks) of a neural network can be isolated as vectors in the parameter space. If you can isolate the “harmful” task vector, you can theoretically subtract it.
- In-Context Learning (ICL) Steering: The concept that a model’s internal state (activations) during inference can be nudged in a specific direction to alter its behavior.
The Core Method: Safety Arithmetic
The Safety Arithmetic framework is a “training-free” approach. It doesn’t require updating the model through gradient descent in the traditional sense. Instead, it operates in two distinct stages:
- Harm Direction Removal (HDR): Cleaning the model’s parameters (weights) to remove inherent biases.
- Safety Alignment (Safe-Align): Guiding the model’s activations (hidden states) during inference to ensure safe generation.

Let’s break down these two stages mathematically.
Stage 1: Harm Direction Removal (HDR)
The goal of HDR is to identify the specific weights in the neural network responsible for generating harmful content and neutralize them.
Step 1: Identifying the Harm Vector
To find “harm” in a neural network, the researchers use a technique involving task analogies. They first take a safe base model (\(\theta_b\)) and fine-tune it on a small dataset of harmful question-answer pairs. This creates an intentionally “bad” or unsafe model, denoted as \(\theta_H\).
The difference between this unsafe model and the original base model represents the “direction” of harm in the parameter space. We calculate the harm vector (\(\tau_H\)) using simple subtraction:

Step 2: Pruning Redundant Parameters
Neural networks are massive, and modifying every single parameter based on this vector could damage the model’s general utility (e.g., its ability to speak English or do math). The researchers found that “harm” is often concentrated in parameters with high magnitude changes.
To preserve the model’s utility, they select only the top \(k\)% of parameters with the highest absolute values in the harm vector. This is defined by the set \(S_k\):

They then create a pruned harm vector (\(\tau'_H\)), where all parameters not in the top \(k\) are set to zero. This ensures that we only target the most significant weights responsible for the harmful behavior.

Step 3: Subtracting the Harm
Finally, to create a safer version of the target model (\(\theta_t\)), we subtract this pruned harm vector from it. A scaling factor, \(\lambda\), controls how aggressively we remove the harm.

The result is a model (\(\hat{\theta}_t\)) that has mathematically “unlearned” the directions in parameter space associated with generating toxic or harmful content.
Special Case: Edited Models
If the target model has been edited (e.g., to update a fact about a president), applying the harm vector globally might undo the edit. In these cases, Safety Arithmetic uses a mask to apply the harm vector only to the layers that were edited and their immediate neighbors.
First, they identify the modified layers:

Then, they create a mask (\(\mathcal{E}\)) to isolate those layers:

Finally, the harm vector is applied only where the mask allows:

Stage 2: Safety Alignment (Safe-Align)
The second stage happens during inference (when the model is actually generating text). Even after parameter steering, some harmful behaviors might persist in the model’s “thought process” (its latent space).
To fix this, the researchers use In-Context Learning (ICL) to calculate a “Safety Vector” that steers the model’s hidden states toward safety in real-time.
Step 1: Collecting Exemplars
They prepare a dataset of prompts containing pairs of unsafe and safe responses.
- Unsafe Prompt (\(p_{usf}\)): A harmful question paired with a harmful answer.
- Safe Prompt (\(p_{sf}\)): The same question paired with a safe, refusal-style answer.
They run these prompts through the model and capture the hidden state representations (\(h\)) at the last token position for every layer.

Step 2: Computing the In-Context Safety Vector (ICV)
The goal is to find a vector (\(h_{ICV}\)) that, when added to the model’s state, pushes the representation closer to the safe examples and further from the unsafe examples.
This is formulated as an optimization problem:

Using the L2 norm as the distance metric, this simplifies to finding the direction that maximizes the difference between the safe and unsafe representations.

In practice, the optimal solution is the first principal direction (via PCA) of the differences between the safe and unsafe hidden states. This vector captures the essential “essence” of safety for the model.
Step 3: Steering the Activations
During inference, as the model processes a user’s query, this Safety Vector (\(ICV\)) is added to the hidden states (\(h_l^t\)) at every layer \(l\) and token step \(t\). A hyperparameter \(\alpha\) controls the strength of this steering.

To prevent this addition from distorting the “energy” or magnitude of the signal (which could result in gibberish output), the new hidden state is normalized to match the length of the original hidden state.

The result is \(\theta_{sf}\)—a fully aligned model that has been cleaned at the parameter level and steered at the activation level.
Experiments and Results
The researchers tested Safety Arithmetic on several models (Llama-2, Mistral, WizardMath) using rigorous benchmarks like AdvBench, DangerousQA, and HarmfulQA. They compared their method against the original base models and other alignment baselines.
Does it actually reduce harm?
The results were statistically significant. The metric used was Attack Success Rate (ASR)—the percentage of times the model failed to refuse a harmful prompt. Lower is better.
For Supervised Fine-Tuned (SFT) models, which are notoriously prone to losing their safety guardrails, Safety Arithmetic restored safety dramatically.

Looking at WizardMath (WM) in the table above:
- The Original model had an ASR of 79.62% on AdvBench. It was generating harmful content nearly 80% of the time.
- Safety Arithmetic slashed this to 37.69%.
- On DangerousQA, it dropped from 76.50% to 50.00%.
Similar trends were observed for LlamaMath (LM), where the ASR on DangerousQA dropped from 27.00% to a remarkable 6.00%. This proves that the framework effectively restores safety in specialized models without needing to retrain them on safety data.
What about Unintentional Edits?
The researchers also introduced a new dataset called NOINTENTEDIT. This dataset highlights a subtle risk: sometimes, editing a model with benign facts (like cultural norms or business strategies) can inadvertently make it more susceptible to generating biased or harmful content.

When applied to these unintentionally edited models, Safety Arithmetic again proved superior.

In the table above, the Edited Model saw its ASR spike to 25.19% on AdvBench. Safety Arithmetic brought this down to 5.96%, outperforming other baselines significantly.
The Trade-off: Safety vs. Utility
A common criticism of safety techniques is that they make models “dumber.” If you prune too many parameters to remove harm, you might also lobotomize the model’s ability to reason.
The researchers analyzed this trade-off by varying the \(k\) percentage (the amount of parameters modified in the HDR stage).

As shown in Figure 3, increasing the number of modified parameters (moving right on the X-axis) lowers the Attack Success Rate (ASR - red bars). However, eventually, the model’s utility (MMLU score - blue line) starts to drop. The researchers found that modifying the top 10% of parameters was the “sweet spot”—it drastically reduced harm while keeping the MMLU score (general knowledge) almost identical to the original model.
This is further confirmed in their utility tables:

Table 4 shows that on benchmarks like MMLU, Hellaswag, and GSM8K (math), the performance of models treated with Safety Arithmetic is statistically indistinguishable from the base models.
Reducing Over-Safety
Finally, a major annoyance with aligned models is “over-safety”—refusing to answer harmless questions because they sound dangerous (e.g., “How do I kill… a Python process?”).

Using the XS Test dataset, the researchers showed that Safety Arithmetic actually lowered the refusal rate on benign prompts compared to the base Llama-2 model (from 17.8% down to 8.6%). By targeting specific harm vectors rather than applying a blanket refusal filter, the model becomes more nuanced in what it blocks.
Conclusion and Implications
Safety Arithmetic represents a significant step forward in the field of AI alignment. It moves away from the “black box” approach of Reinforcement Learning and toward a more interpretative, linear-algebraic understanding of how LLMs store and process concepts like “harm.”
The key takeaways are:
- Versatility: It works for base models, fine-tuned specialists, and knowledge-edited models alike.
- Efficiency: It is training-free. You don’t need a massive GPU cluster to realign your model; you just need to perform some vector subtraction and addition.
- Precision: It surgically removes harmful tendencies without degrading the model’s intelligence or causing it to panic at benign queries.
As open-source models continue to proliferate, frameworks like Safety Arithmetic will be crucial. They provide a “safety wrapper” that developers can apply at test time, ensuring that even as we specialize and edit our AI models, they remain aligned with human values.
](https://deep-paper.org/en/paper/2406.11801/images/cover.png)