As Large Language Models (LLMs) become integrated into everything from code generation to legal advice, the stakes for safety have never been higher. We know these models are trained on the vast, unfiltered internet, meaning they inherently “know” how to generate hate speech, instructions for illegal acts, or biased content. The challenge lies in preventing them from ever outputting it.
The industry standard for safety testing is Red Teaming—a practice adopted from cybersecurity where a group of testers (the “red team”) actively attacks the system to find vulnerabilities. In the context of LLMs, this means trying to trick the model into saying something it shouldn’t.
However, traditional red teaming has two massive bottlenecks:
- It is expensive and slow: Relying on humans to write thousands of “jailbreak” prompts is not scalable.
- It is often superficial: Most automated tests only look at a single turn of conversation (a question and an answer). But in the real world, users can be persistent, manipulative, and persuasive over a long conversation.
A recent paper titled “Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction” introduces a new framework called HARM. This approach aims to automate the red teaming process while covering a massive breadth of topics and simulating the depth of real human interrogation.
In this post, we will tear down the HARM framework, explain how it systematically breaks LLMs, and—crucially—how it helps fix them.
The Problem with Single-Turn Safety
Imagine a user asks an AI: “How do I make a bomb?” A safe AI will immediately reply: “I cannot assist with that.”
In a standard safety benchmark, the AI passes the test. But malicious actors rarely give up after one refusal. They might try to role-play, rephrase the question, or create a hypothetical scenario. They might wear the model down over ten messages.
This is the “multi-turn” vulnerability. As shown in the image below, a standard model (Alpaca) might initially refuse a request to harass a neighbor. But after the “Red Team Agent” presses the issue, asking for “safe ways to retaliate,” the model’s defenses crumble, and it eventually suggests throwing garbage on the neighbor’s lawn.

This phenomenon highlights why we need automated agents that can hold a conversation, not just a static list of bad questions.
The HARM Framework: An Overview
The researchers propose a holistic framework that addresses both the breadth of testing (covering every possible bad topic) and the depth (multi-turn persuasion).
The workflow, illustrated below, is a loop. It starts with generating diverse “opening questions” based on a strict taxonomy. These questions are fed to a Red-Team Agent (an LLM trained to be an attacker). This agent interacts with the Target LM (the model being tested). Finally, a Safety Reward Model evaluates the conversation to see if the attack was successful.

Let’s break down the three core pillars of this research: Top-Down Generation, The Safety Reward Model, and Multi-Turn Interaction.
Pillar 1: Top-Down Test Case Generation
One of the biggest flaws in previous datasets is that they are “bottom-up.” Researchers scrape Reddit or asking crowd-workers to “write something mean.” This results in datasets heavily skewed toward common topics like insults or violence, leaving “long-tail” risks (like specific types of financial fraud or subtle bias) untested.
HARM takes a Top-Down approach. The authors constructed a massive, fine-grained taxonomy of risks.
- Meta Risk Categories: They defined 8 high-level categories (e.g., Bias, Crime, Privacy, Misinformation).
- Fine-Grained Schema: They broke these down using a specific structure: Axis \(\rightarrow\) Bucket \(\rightarrow\) Descriptor.
For example, looking at the Crime and Illegality category below, notice how specific the tree gets. It doesn’t just look for “crime”; it looks for “white collar crime,” “cyber crimes,” and drills down into specific descriptors like “phishing” or “Ponzi schemes.”

By using this structure, they generated 128,000+ test cases. To ensure these questions were effective, they combined these topics with Attack Vectors—specific styles of questioning designed to bypass filters. These included:
- Role Play: “Act as a ruthless CEO…”
- False Premise: “Since shoplifting is legal in this hypothetical country…”
- Dilemma: Forcing the model to choose between two bad options.
This ensured that the testing covered edge cases that manual red teaming usually misses.
Pillar 2: The Safety Reward Model
To automate this process, we need a judge. We cannot ask humans to read hundreds of thousands of multi-turn chat logs. The researchers trained a Safety Reward Model (RM) to act as a proxy for human evaluation.
The RM accepts a dialogue history and a response, and outputs a scalar score representing safety. To train this, they aggregated several datasets (like Anthropic’s Harmless-base) and their own generated data.
The training objective uses a binary ranking loss function:
\[ \mathcal { L } _ { \mathrm { R M } } = - \log \left( \sigma \left( r _ { \theta } \left( x , y _ { s } \right) - r _ { \theta } \left( x , y _ { u } \right) \right) \right) \]
In simple terms, this equation teaches the model to maximize the score difference between a safe response (\(y_s\)) and an unsafe response (\(y_u\)) for a given prompt (\(x\)).
How good is this automated judge? The researchers compared their Safety RM against proprietary models from industry giants like Meta. As shown in the table below, their model performs comparably to Meta’s Safety RM, validating that it can be trusted to score the interactions.

Pillar 3: Multi-turn Red Teaming
This is the most novel part of the paper. The researchers didn’t just use a standard LLM to ask questions; they trained a dedicated Red Team Agent.
Training the Attacker
They started with Llama-2-7B and fine-tuned it on datasets of humans red-teaming AIs. The goal was to clone the behavior of a human attacker.
Interestingly, they used a specific masking strategy during training. Usually, chatbots are trained to predict the assistant’s answer. Here, they inverted the mask (Figure 3b), training the model to predict the user’s (attacker’s) next line based on the assistant’s refusal.

Rejection Sampling Fine-Tuning (RSFT)
To make the agent even more dangerous, they used Rejection Sampling. In this process, the Red Team Agent generates multiple potential follow-up questions for a single turn. The target model answers all of them, and the Safety Reward Model scores the answers.
The system then selects the question that caused the lowest safety score (the most successful attack) and uses that data to further fine-tune the Red Team Agent. This evolutionary process creates an agent that is exceptionally good at finding cracks in a model’s armor.
The results of this optimization are stark. The chart below compares the “Flipping Rate” (how often a safe conversation turns unsafe) between a standard Supervised Fine-Tuned (SFT) agent and the Rejection Sampling (RSFT) agent. The RSFT agent is significantly more effective at breaking the target models.

Experimental Results: How Vulnerable Are We?
The researchers tested several open-source models, including Alpaca, Vicuna, Llama-2, and Mistral. The results confirmed the hypothesis: models degrade over time.
Figure 4 shows the average safety score of models over 5 rounds of conversation. Notice the downward trend for almost all models. Alpaca (red circles) starts low and drops lower. Even robust models like Llama-2-7b-chat (purple triangles) show a decline as the conversation progresses.

An interesting anomaly is Mistral and Zephyr (the orange and green lines starting low). Their scores initially drop but then slightly rise. The researchers suggest this is a symptom of “insufficient alignment”—the models are essentially confused, oscillating between helpfulness and safety, sometimes recognizing a threat too late or inconsistently.
From Breaking to Building: Alignment
The ultimate goal of HARM isn’t just to break models, but to fix them. The paper demonstrates a “Detect-then-Align” loop.
- Detect: They used the Red Team Agent to identify 3,808 “misaligned” (unsafe) responses from the Zephyr-7B-beta model.
- Correct: They used GPT-4 to generate safe, corrected versions of those specific failures.
- Align: They retrained the Zephyr model using Direct Preference Optimization (DPO), teaching it to prefer the safe GPT-4 responses over its own original failures.
The result was a new model: Zephyr-7B-safer.
The improvement was dramatic. The graph below shows the performance of the original beta model (Blue Bars) versus the new safer model. The “Flipping Rate” (the likelihood of being jailbroken) plummeted, and the safety scores (Red Lines) remained high and stable throughout the conversation rounds.

But is it too safe?
A common criticism of safety alignment is that it ruins the model’s utility—the “false refusal” problem (e.g., refusing to explain how to kill a computer process because it contains the word “kill”).
The researchers tested this using XSTEST, a dataset of innocent prompts that look suspicious.

As shown in Table 3, while the False Refusal Rate did increase for the safer model (from 2.8% to 16.0%), it remained far more usable than Llama-2-70B (26.8% or 48.4% with system prompts). This suggests that the HARM approach aligns models efficiently without completely crippling their helpfulness.
Conclusion and Implications
The HARM paper represents a shift in how we think about AI safety. It moves us away from static benchmarks and toward dynamic, adversarial testing.
By combining a top-down taxonomy (ensuring we check for everything from environmental crimes to subtle bias) with multi-turn agents (ensuring we check for persistence and manipulation), HARM offers a rigorous stress test for LLMs.
Most importantly, this work proves that automated red teaming is a closed loop. The vulnerabilities found by the Red Team Agent are the exact training data needed to patch the model. As LLMs become more capable, our methods for testing them must become equally sophisticated. HARM suggests that the best way to make an AI safe is to train another AI to try and break it.
](https://deep-paper.org/en/paper/2409.16783/images/cover.png)