Introduction
Large Language Models (LLMs) have become ubiquitous, demonstrating incredible prowess in reasoning, coding, and creative writing. However, this power comes with a significant “dual-use” risk. The same model that can write a helpful medical summary can, if prompted maliciously, generate hate speech, instructions for illegal acts, or biological weapon recipes.
To combat this, the AI community has largely relied on methods like Reinforcement Learning from Human Feedback (RLHF). While effective to a degree, RLHF has fundamental limitations. Models can learn to “game” the reward function—optimizing for the metric rather than genuine safety—and the process is expensive, requiring massive amounts of annotated data and retraining. Furthermore, adversarial attacks (or “jailbreaks”) have proven that even aligned models can be tricked into bypassing their safety filters using cleverly crafted prompts.
What if, instead of just training the model to “prefer” safe answers, we could mathematically define a “safe region” inside the model’s own brain? What if we could construct a geometric cage that prevents the model’s internal thoughts from wandering into dangerous territory?
This is the premise of SaP (Safety Polytope), a novel research framework proposed by Chen et al. SaP treats safety not as a preference, but as a constraint. By identifying a safe geometric region within the model’s representation space and “steering” the model back whenever it tries to leave, SaP offers a robust, interpretable, and post-hoc defense mechanism.

As illustrated above, SaP operates dynamically. When a user asks for something harmful (like a persuasive article on a conspiracy theory), the model’s internal activation vectors move toward a “Misinformation” region. SaP detects this boundary crossing and corrects the vector, forcing the model to generate a refusal or a safe response.
In this deep dive, we will explore how SaP transforms the abstract concept of “safety” into concrete geometry, how it defends against state-of-the-art attacks, and what its internal structure tells us about how LLMs understand harm.
Background: Language as a Trajectory
To understand how SaP works, we first need to look at language generation through the lens of sequential decision-making.
The MDP Perspective
We usually think of LLMs as predicting the next word. However, we can also view this process as a Markov Decision Process (MDP).
- State: The sequence of words generated so far.
- Action: Choosing the next token from the vocabulary.
- Policy: The LLM itself, which decides which token to pick based on the current context.
In standard training, we want the model to maximize a reward (e.g., helpfulness). But safety is different. Safety is not about maximization; it is about constraints. You don’t want to be “maximally safe” (which might mean saying nothing at all); you want to be helpful subject to the constraint that you do not violate ethical guidelines.
This framing leads us to Constrained MDPs (CMDPs). In a CMDP, we want to maximize rewards while ensuring that the cost of violating safety rules remains below a certain threshold.
From Constraints to Geometry
Recent theoretical work has shown that the set of all “safe” policies in a CMDP forms a convex shape—specifically, a polyhedron—in the space of feature expectations.

The equation above essentially states that any safe behavior can be represented as a weighted combination of other safe behaviors, bounded by linear constraints. This is the “Eureka” moment for the SaP researchers: If safety constraints are geometric, we can learn the shape of safety inside the LLM’s activation space.
Instead of retraining the whole model to learn safety (which changes the weights), we can simply map out the “safe polytope” in the activation layer. If the model’s thought vector stays inside this shape, it’s safe. If it crosses a face of the polytope (a facet), it has become unsafe.
The SaP Framework: Constructing the Shield
The SaP method consists of three primary stages: extraction, construction, and steering. Let’s break down the architecture.
1. Feature Extraction & The Concept Encoder
LLM representations are high-dimensional and “polysemantic”—meaning a single neuron might activate for multiple unrelated concepts (e.g., a neuron might fire for both “cats” and “cars”). If we try to build a safety wall directly on raw activations, the wall might accidentally block safe concepts too.
To solve this, the authors introduce a Concept Encoder. This is a trainable linear layer followed by a non-linearity (ReLU) that projects the raw model activations (\(h\)) into a sparse feature space (\(\tilde{f}\)).
The goal of the Concept Encoder is to disentangle the messy raw activations into cleaner, distinct features. As we will see later in the experiments, this component is crucial for making the safety boundaries precise and interpretable.
2. Learning the Polytope
Once we have our features, we need to learn the polytope. A polytope is defined by a set of flat boundaries (hyperplanes). Mathematically, we are looking for a set of directions (\(\phi\)) and thresholds (\(\xi\)) such that for any safe input, the dot product of the features and the direction is less than the threshold.
The researchers use an algorithm called the Convex Polytope Machine (CPM). This treats geometry as a classification problem. We feed the system labeled data (sentences marked “safe” or “unsafe”).
The training objective is complex but intuitive. It tries to do three things simultaneously:
- Maximize the Margin: Push the safety boundaries away from the safe data points (like a Support Vector Machine).
- Minimize Violations: Ensure unsafe data points fall outside the boundaries.
- Induce Sparsity: Use regularization to keep the solution simple.

In the equation above:
- The first sum ensures safe examples stay inside the polytope (below the threshold \(\tilde{\xi}\)).
- The second sum ensures unsafe examples are pushed outside by at least one facet.
- The \(\lambda\) terms are regularizers that encourage the model to use fewer features and keep the weights manageable.
3. SafeFlow: Steering Representations
The final piece of the puzzle is Steering. We have trained our polytope; we know what the safe region looks like. Now, we deploy the model.
During inference, for every token the model generates, we intercept the internal activation vector. We check: Is this vector inside the Safety Polytope?
- Yes: Proceed as normal.
- No: We solve an optimization problem to find the closest point inside the polytope and replace the current vector with that safe point.

This process, detailed in Algorithm 1 of the paper, is essentially a real-time correction mechanism. It’s like a lane-keeping assist for LLMs—if the model starts drifting into the oncoming traffic of “hate speech,” SafeFlow gently nudges the steering wheel back into the safe lane before the accident (the harmful token) happens.
Experiments & Results
The researchers evaluated SaP on three major models: Llama2-7B, Ministral-8B, and Qwen2-1.5B. They tested the system against a battery of adversarial attacks using the HarmBench framework.
Defense Against Adversarial Attacks
The primary metric here is Attack Success Rate (ASR)—lower is better. However, a defense is useless if it destroys the model’s intelligence. So, they also tracked MMLU Accuracy (general knowledge)—higher is better.

The results in Figure 2 are striking:
- Llama2-7B: The original model had an ASR of ~13%. SaP reduced this to 0.26% while maintaining identical MMLU scores.
- Ministral-8B: ASR dropped from ~56% to 3.25%.
- Comparison: SaP outperforms other methods like “Response Check” or “SmoothLLM.” While some baselines (like Rejection Sampling) offer good safety, they often hurt the model’s general utility (as seen in the Ministral plot where the purple cross drops significantly in accuracy).
SaP achieves the “holy grail” of safety: it stops attacks without making the model stupid.
The Importance of the Concept Encoder
Recall the Concept Encoder—the layer added to untangle features. Is it actually necessary? The ablation study below gives a definitive “Yes.”

For Llama2 and Ministral, removing the Concept Encoder (the red bars) causes the defense to collapse. Ministral’s ASR jumps to over 50% without it. This confirms that raw activation spaces are too messy to draw simple linear safety boundaries around; we need that disentanglement step.
Interpretability: Peering Inside the Box
One of the strongest arguments for SaP is interpretability. Because the “Safe Set” is a polytope, each face (facet) of the shape represents a specific constraint. We can analyze what triggers these constraints.
The researchers calculated the Mutual Information between specific facets and varying categories of harm (like Violence, Fraud, or Child Abuse).

The difference is night and day. Without the encoder (left), facets are “polysemantic”—Facet 36 might trigger for eight different types of harm. With the encoder (right), we see a cleaner diagonalization. This implies specialization.
- Facet 7 might be the “Kidnapping detector.”
- Facet 26 might be the “Bullying detector.”
This is further validated by looking at KL Divergence when masking specific words.

As shown in Figure 5, Facet 7 stays quiet for terms related to abuse, sex, or killing, but screams when it sees “kidnap.” This granularity allows human auditors to understand exactly why a model blocked a response. It wasn’t just a generic “unsafe” flag; it was a specific geometric boundary related to kidnapping that was crossed.
How Many Facets Do We Need?
A polytope can have infinite sides. How many are needed to define “safety”?

The data suggests that LLM safety is not infinitely complex. With just 20 to 30 facets (linear constraints), the model achieves near-perfect defense performance. Adding more constraints beyond this point yields diminishing returns. This is promising for efficiency—checking 30 linear constraints during inference is computationally cheap.
Conclusion & Implications
The Safety Polytope (SaP) represents a paradigm shift in AI safety. Rather than treating safety as a fuzzy reward to be maximized, it treats it as a hard geometric boundary to be respected.
Key Takeaways:
- Post-Hoc Control: SaP works on pre-trained models. You don’t need to re-run expensive training processes to enforce new safety rules.
- Geometric Defense: By modeling safety as a polytope in the latent space, we can mathematically guarantee that the model’s representation lies within the safe region before a token is generated.
- Untangling Meaning: The Concept Encoder is vital. It proves that to control a model safely, we must first disentangle its internal representations.
- Interpretability: SaP transforms the “black box” of safety filtering into a transparent set of rules. We can identify exactly which facet is responsible for blocking specific types of harm.
The Road Ahead
While SaP shows immense promise, it is not without limitations. The authors note that for some models, aggressive steering can sometimes lead to incoherent outputs (though general benchmarks remain high). Furthermore, the current approach relies on supervised labels to learn the polytope.
Future work lies in unsupervised constraint learning—can the model figure out the shape of safety just by observing safe human conversations? Additionally, leveraging more advanced geometric shapes beyond linear polytopes could provide even tighter, more nuanced defenses.
As we move toward autonomous agents and more powerful LLMs, methods like SaP offers a blueprint for guaranteed safe AI. By embedding safety into the very geometry of the model’s thinking process, we move one step closer to systems that are not just smart, but reliably and transparently safe.
](https://deep-paper.org/en/paper/2505.24445/images/cover.png)