Large Language Models (LLMs) are powerful, but controlling them—making sure they follow instructions, avoid toxicity, or adhere to specific themes—remains one of the biggest challenges in AI safety. Currently, the industry relies heavily on prompting (asking the model nicely) and finetuning (retraining the model on new data). While effective, these methods have significant drawbacks: prompting can be circumvented by “jailbreaks,” and finetuning is computationally expensive and opaque.
Enter Representation Engineering. This emerging field hopes to open the “black box” of the neural network, identify the specific internal activations (or “neurons”) responsible for a concept, and manually tweak them to steer the model’s behavior. The holy grail of this field has recently been Sparse Autoencoders (SAEs)—unsupervised tools that decompose model activations into interpretable features.
But does this actually work better than the old methods?
A new paper titled AXBENCH: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders drops a bombshell on the interpretability community. The researchers introduce a massive benchmark, AXBENCH, to rigorously test these methods. Their findings? The hyped unsupervised methods (like SAEs) are currently being outperformed by simple linear baselines and a new method they introduce called ReFT-r1.
In this post, we will tear down the paper, explain the AXBENCH framework, dissect the math behind the new ReFT-r1 method, and look at the surprising results that might reshape how we think about controlling AI.
The Problem: The Illusion of Control
To understand why this paper matters, we first need to understand the premise of steering.
When an LLM processes the word “Paris,” a specific pattern of numbers fires inside its hidden layers. If we could identify the specific “direction” in that mathematical space that represents “France,” we could theoretically add that vector to the model’s brain while it talks about “London,” forcing it to hallucinate a French version of the city.
This is Model Steering.
Researchers have proposed various ways to find these steering vectors:
- Supervised methods: Like taking the average difference between “happy” and “sad” sentences (Difference-in-Means).
- Unsupervised methods (SAEs): Training a separate, smaller network to scan the LLM’s brain and catalog millions of “features” (like “references to the 1990s” or “HTML code”) without human labels.
The problem is that, until now, we haven’t had a standardized way to measure if these fancy vectors are actually better than just prompting the model. Are we finding real concepts, or just noise?
Introducing AXBENCH
To solve the measurement crisis, the authors introduce AXBENCH. It is a large-scale benchmark designed to evaluate model control methods along two critical axes:
- Concept Detection (C): Can the method accurately find where a concept (e.g., “The Golden Gate Bridge”) exists in the model’s internal activations?
- Model Steering (S): Can the method use that information to force the model to talk about that concept?

As shown in Figure 2 above, the benchmark relies on Synthetic Data. The researchers used GPT-4 to generate thousands of examples for 500 specific concepts (derived from GemmaScope).
For a concept like “The Golden Gate Bridge,” they generate:
- Positive examples: Sentences about the bridge.
- Negative examples: Sentences about totally different things.
- Hard Negatives: Sentences about the “Bay Bridge” (to ensure the method isn’t just detecting “bridges” in general).
This creates a ground truth. If a method claims to have found the “Golden Gate Bridge” neuron, AXBENCH checks if that neuron actually fires on the positive examples and stays silent on the negatives.
The Contenders: From Simple to Complex
The paper pits several methods against each other. It is crucial to understand the hierarchy here:
- The Baselines:
- Prompting: Just telling the model “Talk about the Golden Gate Bridge.”
- SFT / LoRA: Finetuning the model weights on the data.
- The Simple Steering Methods:
- DiffMean (Difference-in-Means): You take the average representation of positive examples and subtract the average of negative examples.
- Probe: Training a simple linear classifier (logistic regression) to separate positive and negative examples.
- The Complex/Unsupervised Method:
- SAE (Sparse Autoencoders): Using massive, pretrained dictionaries of features to find the concept without looking at the labels.
- The New Challenger:
- ReFT-r1 (Rank-1 Representation Finetuning): A novel method introduced in this paper.
Deep Dive: What is ReFT-r1?
The authors propose ReFT-r1 as a “Supervised Dictionary Learning” (SDL) method. It is designed to bridge the gap between the accuracy of supervised methods and the interpretability of representation engineering.
ReFT-r1 is unique because it learns a specific direction (a rank-1 subspace) for a concept by optimizing two things simultaneously: detection and steering.
First, it defines a detection score. For a hidden state \(h_i\), the presence of the concept is calculated using a learned vector \(\mathbf{w}_{\text{ReFT-r1}}\):

The ReLU ensures that negative matches result in zero activation—we only care if the concept is present, not if it’s “anti-present.”
Next, it defines how to steer. If the concept is detected, ReFT-r1 modifies the hidden state by adding the vector back into the stream. The magnitude of this addition depends on how strong the detection was (specifically, the L1 norm of the top-k activations):

Finally, how do we train this? The objective function tries to minimize the standard Language Modeling loss (making the model predict the next token correctly) while the intervention is active. It also includes a sparsity penalty to ensure the concept vector is clean and doesn’t activate on irrelevant things.

This elegant formulation allows ReFT-r1 to learn a vector that is good at identifying the concept and good at influencing the model’s output, using only a small amount of labeled data.
Axis 1: Concept Detection Results
The first test is Concept Detection. Given a stream of text, can the method identify which tokens relate to the target concept?
The results, shown in the table below, were surprising. The “dumb” method, DiffMean, performed incredibly well, achieving an AUROC (Area Under the Receiver Operating Characteristic curve) of 0.942. The new method, ReFT-r1, matched this performance.

However, look at the SAE score: 0.695.
This is a significant gap. It suggests that unsupervised autoencoders, despite their computational cost and complexity, struggle to isolate specific concepts as cleanly as supervised linear methods. Even when the researchers tried to “cheat” by picking the absolute best SAE feature using the labels (SAE-A), it still underperformed the simple linear probe and ReFT-r1.
We can visualize this performance gap using Receiver Operating Characteristic (ROC) curves. A perfect method would hug the top-left corner.

The ROC curves confirm that supervised methods (Green/Red lines) offer a much better trade-off between true positives and false positives than SAEs (Purple/Blue lines).
Axis 2: Model Steering Results
Detection is one thing, but Steering is the ultimate goal. Can we take control of the model?
To evaluate this, the authors used an LLM judge to grade the steered outputs on three criteria:
- Fluency: Is the text coherent?
- Instruct Score: Did it answer the user’s question?
- Concept Score: Did it successfully incorporate the target concept?
The overall results are summarized in this scatter plot:

Key Takeaways from the Steering Results:
- Prompting is King: The simple act of prompting (the dot labeled “Prompt” in the top right) outperforms essentially every representation engineering method. It achieves high steering scores without sacrificing the model’s ability to follow instructions.
- ReFT-r1 leads the interventionists: Among methods that actually intervene on activations, ReFT-r1 (the red dots) is the clear winner, significantly outperforming SAEs and DiffMean.
- The SAE Failure: SAEs (blue/purple dots) fall into the bottom left. This means they often fail to inject the concept, or when they do, they damage the model’s fluency or instruction-following ability.
The table below breaks down the overall steering scores. Note how Prompt achieves 0.894, while ReFT-r1 manages 0.543, and SAE lags at 0.165.

The Trade-off: Coherence vs. Control
Why do steering methods struggle? There is an inherent trade-off. As you push the model harder (increase the “steering factor”) to mention a concept, you risk breaking its brain. It starts repeating words, hallucinating, or ignoring the user’s actual question.
The chart below visualizes this trade-off. The X-axis is the Instruct Score (did it answer the question?), and the Y-axis is the Concept Score (did it mention the concept?).

Ideally, we want to be in the top right. You can see that as most methods push for a higher Concept Score (going up), they drift to the left (losing instruction following). ReFT-r1 (Red Line) maintains the best balance, tracing a “Pareto-optimal” path that stays higher and further right than the competitors.
Why Did SAEs Fail?
The poor performance of Sparse Autoencoders in this benchmark is a sobering moment for the interpretability field. The authors suggest a few reasons:
- Label Mismatch: SAE features are discovered unsupervised. The “concept” the SAE finds might be a specific token pattern or a grammatical feature, not the high-level semantic concept (like “Golden Gate Bridge”) humans care about.
- Feature Splitting: A single human concept might be split across 10 different SAE features. Activating just one doesn’t trigger the full concept.
- Noise: Unsupervised features often contain “polysemantic” noise—they fire for the concept we want, but also for three other random things, making steering unpredictable.
The Power of Weak Supervision
On the flip side, why did ReFT-r1 succeed? The answer lies in Supervised Dictionary Learning (SDL).
By using even a small amount of labeled data (which can be synthetically generated by GPT-4 for pennies), we can force the model to learn a vector that is explicitly aligned with the concept we want.
The authors analyzed the scaling laws of ReFT-r1 and found that it is incredibly efficient. It doesn’t need thousands of examples.

As shown in Figure 10, the performance (Overall Score) ramps up quickly and plateaus. With just roughly 50 to 100 examples, ReFT-r1 achieves near-optimal performance. This makes it a highly practical alternative to training massive SAEs, which require processing billions of tokens.
Furthermore, ReFT-r1 learns a semantically rich space. When the authors visualized the subspaces learned by ReFT-r1 using UMAP, they found clear clustering by genre.

The blue dots (Code), green dots (Math), and red dots (Text) form distinct islands. This suggests that ReFT-r1 isn’t just memorizing data; it is tapping into the model’s internal organization of knowledge.
Conclusion
The AXBENCH paper serves as a reality check for the AI control and interpretability fields. It demonstrates that while the idea of unsupervised feature discovery (SAEs) is elegant, the reality is that simple, supervised baselines are currently far more effective for practical control.
The introduction of ReFT-r1 offers a compelling path forward. It combines the precision of supervised learning with the mechanism of representation steering. While standard prompting still reigns supreme for general tasks, representation-based methods like ReFT-r1 are essential for safety cases where we cannot trust the model to “listen” to a prompt (such as preventing jailbreaks or removing biases).
For students and researchers, the takeaway is clear: Evaluation is everything. Without rigorous benchmarks like AXBENCH, we risk chasing complex methods that look impressive on paper but fail to outperform a simple average.
All figures and tables referenced are from Wu et al., 2025.
](https://deep-paper.org/en/paper/12232_axbench_steering_llms_ev-1645/images/cover.png)