Large Language Models (LLMs) like Llama and Mistral are incredible feats of engineering, capable of fluent reasoning and creativity. However, they are also prone to hallucinations, biases, and toxic outputs. When we want to fix these behaviors, our traditional toolkit—like fine-tuning—can be computationally expensive and sometimes compromises the model’s general capabilities.

Recently, a technique called Activation Editing (or Representation Engineering) has emerged as a surgical alternative. Instead of retraining the model weights, we intervene during inference, tweaking the internal “thoughts” (activations) of the model to guide it toward honesty or safety.

Most current methods treat these activations as points on a map and try to “shift” them by adding a steering vector. In this post, we will dive into a new research paper that argues this approach is geometrically flawed. The authors propose a new method, Householder Pseudo-Rotation (HPR), which treats activations not as points to be moved, but as vectors to be rotated.

We will explore why preserving the “magnitude” of these vectors is crucial for model stability and how a clever linear algebra trick—the Householder transformation—allows us to edit model behavior more effectively than ever before.

The Problem with “Steering” Vectors

To understand the innovation of HPR, we first need to understand how standard Activation Editing works.

When an LLM processes a prompt, it passes data through layers of neurons. The output of a specific layer is an activation vector. Researchers have found that specific directions in this high-dimensional space correspond to concepts like “truthfulness” or “toxicity.”

The dominant approach, such as Inference-Time Intervention (ITI), identifies a “steering vector” (a direction representing the desired behavior) and simply adds it to the model’s activation.

Think of this as the Points-in-Space View. You treat the activation as a dot on a graph, and you push it to the right or left.

Comparison of points-in-space view vs direction-magnitude view.

As shown in Figure 1(a) above, traditional methods shift points from a “negative” region (red) to a “positive” region (green) by adding a vector.

However, the authors of this paper argue for a Direction-Magnitude View (Figure 1(b)). They posit that the semantic information (the meaning) is held in the direction of the vector, while the magnitude (length) represents intensity.

The Magnitude Consistency Property

Why does this distinction matter? It turns out that LLMs maintain a very strict internal geometry. The researchers discovered that within any given layer, activation vectors tend to have roughly the same length (norm), regardless of the content. They call this Magnitude Consistency.

Let’s look at the data. The chart below shows the distribution of activation norms across different layers for three popular models.

Activation norms across layers showing consistency.

Notice the tight box plots. Whether the model is processing positive or negative concepts, the “length” of the activation vector is remarkably stable within a layer.

The problem with the “Steering Vector” approach (adding a vector) is that it disrupts this stability. By adding a steering vector, you inevitably change the length of the activation.

  • If you steer too little, you don’t change the behavior.
  • If you steer enough to change behavior, you often stretch the vector far beyond its natural length.

This disruption can break the model. As shown below in Figure 4, the ITI method (blue line) can cause activation norms to spike unnaturally (see the spike at 100 in the middle graph), which essentially pushes the model into undefined territory.

Norm distributions showing how ITI disrupts consistency while HPR preserves it.

When the norm is destroyed (Figure 4b), the model often starts outputting complete gibberish. The goal, therefore, is to change the direction (the behavior) without changing the magnitude (the stability). We need rotation, not addition.

The Solution: Householder Pseudo-Rotation (HPR)

Rotating a vector in high-dimensional space (e.g., 4096 dimensions) to a specific target is computationally expensive. Calculating a full rotation matrix involves complexity of \(\mathcal{O}(d^3)\), which is too slow for real-time inference.

The authors propose a clever workaround called Householder Pseudo-Rotation (HPR). The idea is to approximate a rotation using two steps:

  1. Reflection: Flip the vector across a hyperplane (like a mirror) into the “positive” region.
  2. Adjustment: Fine-tune the angle to land exactly where we want.

This method is computationally efficient and, crucially, preserves the vector norm perfectly.

Step 1: The Linear Probe (Finding the Mirror)

First, we need to know where the “positive” (e.g., truthful) and “negative” (e.g., hallucinated) regions are. The researchers train a simple Linear Probe—a classifier—on the activations of a specific layer.

Probe accuracy across layers.

As seen in Figure 2, a linear probe can distinguish positive and negative activations with high accuracy (around 80% in middle layers). The decision boundary of this classifier acts as our Separating Hyperplane.

The probe gives us a normal vector \(\theta_{probe}\). We can use this to construct a Householder Matrix \(H\). In linear algebra, a Householder matrix performs a reflection across a plane.

The reflection is defined as:

Householder reflection equation.

Here, \(a\) is the original (negative) activation, and \(\dot{a}\) is the reflected version. Because this is a reflection, \(|\dot{a}| = |a|\). We have successfully moved the vector to the “positive” side without changing its length.

Step 2: Angle Prediction (Fine-tuning the Rotation)

Simply reflecting the vector might be too crude—it might overshoot the optimal direction. We need to rotate the original vector \(a\) towards the reflected vector \(\dot{a}\), but stop at exactly the right angle.

To do this, the authors introduce an Angle Prediction Module (a small neural network) that predicts the optimal rotation angle \(\gamma_1\).

Angle prediction equation.

This module takes the activation and learns to predict how much rotation is needed to align it with the ground-truth positive activation seen during training.

Step 3: The Geometric Calculation

Now we have:

  1. The original vector \(a\).
  2. The reflected vector \(\dot{a}\) (which serves as a guide for the direction).
  3. The desired rotation angle \(\gamma_1\).

Since both vectors have the same length, we can perform a rotation on the 2D plane formed by \(a\) and \(\dot{a}\). The authors derive a formula using the law of sines to calculate the final target vector \(\hat{a}\).

Let’s visualize the geometry:

Geometric illustration of the rotation adjustment.

In Figure 5, the red vector is the original input (\(a\)). The orange vector is the reflection (\(\dot{a}\)). The green vector is our desired target (\(\hat{a}\)).

The final formula to compute this target vector is an elegant application of trigonometry:

Final calculation of the target activation.

Here, \(\gamma_2\) is the total angle between the original vector and its reflection. This formula allows the model to compute the new activation \(\hat{a}\) efficiently. Importantly, this mathematical operation guarantees that the length of \(\hat{a}\) is identical to the length of \(a\).

Experimental Results

The researchers tested HPR on several benchmarks, primarily TruthfulQA, which measures a model’s tendency to mimic human falsehoods. They compared HPR against the standard base models (Llama2, Llama3, Mistral) and the leading steering method (ITI).

Accuracy on TruthfulQA

The results show a massive improvement.

Table 1: Performance on TruthfulQA.

Looking at Table 1:

  • Base Llama2 scores 29.58% on MC1 (single choice accuracy).
  • ITI improves this to 33.74%.
  • HPR jumps to 51.83%.

This is a substantial margin. The pattern holds across Llama3 and Mistral as well. The method isn’t just slightly better; it’s unlocking capabilities that the steering vector approach couldn’t reach.

Safety and Bias

The method generalizes beyond just truthfulness. The authors applied HPR to datasets regarding bias (BBQ), ethics (SEQ), and toxicity (Toxigen).

Table 2: Performance on Bias, Ethics, and Toxicity.

As shown in Table 2, HPR consistently improves scores across these safety benchmarks. For example, on the SEQ (Simple Ethical Questions) dataset with Mistral-7B, accuracy improved from 69.57% to 86.96%.

Quality of Generation (Avoiding Gibberish)

One of the strongest arguments for HPR is its stability. Because standard steering methods (ITI) alter the vector magnitude, pushing them too hard destroys the model’s fluency.

The table below measures Perplexity (lower is better) on the WikiText-2 dataset. Perplexity is a proxy for how natural and fluent the text sounds.

Table 5: Perplexity scores.

In Table 5, look at the row for ITI50 (ITI with a strong steering strength). The perplexity explodes to 133.7 (or 4303 for Llama3!), meaning the model is outputting nonsense.

In contrast, HPR maintains a perplexity almost identical to the base model (around 11.95 for Llama3). This confirms that by preserving the activation norms (Magnitude Consistency), HPR allows for strong interventions without breaking the model’s ability to speak English.

Conclusion

The shift from “Steering” to “Rotating” represents a maturation in how we think about the internal geometry of Large Language Models.

The Householder Pseudo-Rotation (HPR) method offers a compelling theoretical and practical advancement:

  1. Theoretical Alignment: It respects the “Magnitude Consistency” property of LLMs, acknowledging that information is encoded in direction, not length.
  2. Computational Efficiency: By using reflections (Householder matrices) instead of full rotation matrices, it remains fast enough for inference.
  3. Superior Performance: It significantly outperforms additive steering methods on truthfulness and safety benchmarks while maintaining generation quality.

As we continue to rely on LLMs for critical tasks, efficient alignment techniques like HPR will be essential. They allow us to “fix” models at runtime without the massive cost of retraining, ensuring AI that is not only smart but also truthful and safe.