As Large Language Models (LLMs) become integrated into everything from code generation to legal advice, the stakes for safety have never been higher. We know these models are trained on the vast, unfiltered internet, meaning they inherently “know” how to generate hate speech, instructions for illegal acts, or biased content. The challenge lies in preventing them from ever outputting it.

The industry standard for safety testing is Red Teaming—a practice adopted from cybersecurity where a group of testers (the “red team”) actively attacks the system to find vulnerabilities. In the context of LLMs, this means trying to trick the model into saying something it shouldn’t.

However, traditional red teaming has two massive bottlenecks:

  1. It is expensive and slow: Relying on humans to write thousands of “jailbreak” prompts is not scalable.
  2. It is often superficial: Most automated tests only look at a single turn of conversation (a question and an answer). But in the real world, users can be persistent, manipulative, and persuasive over a long conversation.

A recent paper titled “Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction” introduces a new framework called HARM. This approach aims to automate the red teaming process while covering a massive breadth of topics and simulating the depth of real human interrogation.

In this post, we will tear down the HARM framework, explain how it systematically breaks LLMs, and—crucially—how it helps fix them.

The Problem with Single-Turn Safety

Imagine a user asks an AI: “How do I make a bomb?” A safe AI will immediately reply: “I cannot assist with that.”

In a standard safety benchmark, the AI passes the test. But malicious actors rarely give up after one refusal. They might try to role-play, rephrase the question, or create a hypothetical scenario. They might wear the model down over ten messages.

This is the “multi-turn” vulnerability. As shown in the image below, a standard model (Alpaca) might initially refuse a request to harass a neighbor. But after the “Red Team Agent” presses the issue, asking for “safe ways to retaliate,” the model’s defenses crumble, and it eventually suggests throwing garbage on the neighbor’s lawn.

Excerpt from the dialogue between our red team agent and Alpaca (Taori et al., 2O23), demonstrating a continuous increase in the harmfulness of Alpaca’s responses over multiple rounds.

This phenomenon highlights why we need automated agents that can hold a conversation, not just a static list of bad questions.

The HARM Framework: An Overview

The researchers propose a holistic framework that addresses both the breadth of testing (covering every possible bad topic) and the depth (multi-turn persuasion).

The workflow, illustrated below, is a loop. It starts with generating diverse “opening questions” based on a strict taxonomy. These questions are fed to a Red-Team Agent (an LLM trained to be an attacker). This agent interacts with the Target LM (the model being tested). Finally, a Safety Reward Model evaluates the conversation to see if the attack was successful.

The overview of our HARM framework. The red-team agent utilizes top-down generated test cases as opening questions and engages in multiple rounds of dialogue with the target language model,aiming to minimize the safety score of each round of the target LM’s responses.

Let’s break down the three core pillars of this research: Top-Down Generation, The Safety Reward Model, and Multi-Turn Interaction.

Pillar 1: Top-Down Test Case Generation

One of the biggest flaws in previous datasets is that they are “bottom-up.” Researchers scrape Reddit or asking crowd-workers to “write something mean.” This results in datasets heavily skewed toward common topics like insults or violence, leaving “long-tail” risks (like specific types of financial fraud or subtle bias) untested.

HARM takes a Top-Down approach. The authors constructed a massive, fine-grained taxonomy of risks.

  1. Meta Risk Categories: They defined 8 high-level categories (e.g., Bias, Crime, Privacy, Misinformation).
  2. Fine-Grained Schema: They broke these down using a specific structure: Axis \(\rightarrow\) Bucket \(\rightarrow\) Descriptor.

For example, looking at the Crime and Illegality category below, notice how specific the tree gets. It doesn’t just look for “crime”; it looks for “white collar crime,” “cyber crimes,” and drills down into specific descriptors like “phishing” or “Ponzi schemes.”

Fine-grained schema for the risk category Crime and Illegality, employing a three-tier structure of Axis-Bucket-Descriptor. Due to space constraints, only a portion of the content is displayed here.

By using this structure, they generated 128,000+ test cases. To ensure these questions were effective, they combined these topics with Attack Vectors—specific styles of questioning designed to bypass filters. These included:

  • Role Play: “Act as a ruthless CEO…”
  • False Premise: “Since shoplifting is legal in this hypothetical country…”
  • Dilemma: Forcing the model to choose between two bad options.

This ensured that the testing covered edge cases that manual red teaming usually misses.

Pillar 2: The Safety Reward Model

To automate this process, we need a judge. We cannot ask humans to read hundreds of thousands of multi-turn chat logs. The researchers trained a Safety Reward Model (RM) to act as a proxy for human evaluation.

The RM accepts a dialogue history and a response, and outputs a scalar score representing safety. To train this, they aggregated several datasets (like Anthropic’s Harmless-base) and their own generated data.

The training objective uses a binary ranking loss function:

\[ \mathcal { L } _ { \mathrm { R M } } = - \log \left( \sigma \left( r _ { \theta } \left( x , y _ { s } \right) - r _ { \theta } \left( x , y _ { u } \right) \right) \right) \]

Equation 1: Binary ranking loss function

In simple terms, this equation teaches the model to maximize the score difference between a safe response (\(y_s\)) and an unsafe response (\(y_u\)) for a given prompt (\(x\)).

How good is this automated judge? The researchers compared their Safety RM against proprietary models from industry giants like Meta. As shown in the table below, their model performs comparably to Meta’s Safety RM, validating that it can be trusted to score the interactions.

Table 2: Our safety reward model (RM) performance compared to those from the Llama-2 technical report (Touvron et al., 2023).

Pillar 3: Multi-turn Red Teaming

This is the most novel part of the paper. The researchers didn’t just use a standard LLM to ask questions; they trained a dedicated Red Team Agent.

Training the Attacker

They started with Llama-2-7B and fine-tuned it on datasets of humans red-teaming AIs. The goal was to clone the behavior of a human attacker.

Interestingly, they used a specific masking strategy during training. Usually, chatbots are trained to predict the assistant’s answer. Here, they inverted the mask (Figure 3b), training the model to predict the user’s (attacker’s) next line based on the assistant’s refusal.

Figure 3: (a) Masking strategy for supervised finetuning of a general assistant. (b) Masking strategy for supervised fine-tuning of our red-team agent.

Rejection Sampling Fine-Tuning (RSFT)

To make the agent even more dangerous, they used Rejection Sampling. In this process, the Red Team Agent generates multiple potential follow-up questions for a single turn. The target model answers all of them, and the Safety Reward Model scores the answers.

The system then selects the question that caused the lowest safety score (the most successful attack) and uses that data to further fine-tune the Red Team Agent. This evolutionary process creates an agent that is exceptionally good at finding cracks in a model’s armor.

The results of this optimization are stark. The chart below compares the “Flipping Rate” (how often a safe conversation turns unsafe) between a standard Supervised Fine-Tuned (SFT) agent and the Rejection Sampling (RSFT) agent. The RSFT agent is significantly more effective at breaking the target models.

Flipping Rates with SFT and RSFTRed-Team Agents Figure 6: Comparison of flipping rates between two redteam agent versions (SFT vs. RSFT) in multi-turn red teaming across three models, with lighter bars indicating the magnitude of improvement in flipping rates.

Experimental Results: How Vulnerable Are We?

The researchers tested several open-source models, including Alpaca, Vicuna, Llama-2, and Mistral. The results confirmed the hypothesis: models degrade over time.

Figure 4 shows the average safety score of models over 5 rounds of conversation. Notice the downward trend for almost all models. Alpaca (red circles) starts low and drops lower. Even robust models like Llama-2-7b-chat (purple triangles) show a decline as the conversation progresses.

Figure 4: Average safety scores for different models across five dialogue rounds.

An interesting anomaly is Mistral and Zephyr (the orange and green lines starting low). Their scores initially drop but then slightly rise. The researchers suggest this is a symptom of “insufficient alignment”—the models are essentially confused, oscillating between helpfulness and safety, sometimes recognizing a threat too late or inconsistently.

From Breaking to Building: Alignment

The ultimate goal of HARM isn’t just to break models, but to fix them. The paper demonstrates a “Detect-then-Align” loop.

  1. Detect: They used the Red Team Agent to identify 3,808 “misaligned” (unsafe) responses from the Zephyr-7B-beta model.
  2. Correct: They used GPT-4 to generate safe, corrected versions of those specific failures.
  3. Align: They retrained the Zephyr model using Direct Preference Optimization (DPO), teaching it to prefer the safe GPT-4 responses over its own original failures.

The result was a new model: Zephyr-7B-safer.

The improvement was dramatic. The graph below shows the performance of the original beta model (Blue Bars) versus the new safer model. The “Flipping Rate” (the likelihood of being jailbroken) plummeted, and the safety scores (Red Lines) remained high and stable throughout the conversation rounds.

Figure 8: Blue bar graphs showing flipping rate changes with varying threshold values (T6 denotes a threshold of 6),and red line graphs illustrating the evolution of safety scores across different rounds.

But is it too safe?

A common criticism of safety alignment is that it ruins the model’s utility—the “false refusal” problem (e.g., refusing to explain how to kill a computer process because it contains the word “kill”).

The researchers tested this using XSTEST, a dataset of innocent prompts that look suspicious.

Table 3: False Refusal Rates of different models. Lower rates indicate better performance.Models with a‘-sys’ suffix denote the use of a safety-emphasising system prompt during inference.

As shown in Table 3, while the False Refusal Rate did increase for the safer model (from 2.8% to 16.0%), it remained far more usable than Llama-2-70B (26.8% or 48.4% with system prompts). This suggests that the HARM approach aligns models efficiently without completely crippling their helpfulness.

Conclusion and Implications

The HARM paper represents a shift in how we think about AI safety. It moves us away from static benchmarks and toward dynamic, adversarial testing.

By combining a top-down taxonomy (ensuring we check for everything from environmental crimes to subtle bias) with multi-turn agents (ensuring we check for persistence and manipulation), HARM offers a rigorous stress test for LLMs.

Most importantly, this work proves that automated red teaming is a closed loop. The vulnerabilities found by the Red Team Agent are the exact training data needed to patch the model. As LLMs become more capable, our methods for testing them must become equally sophisticated. HARM suggests that the best way to make an AI safe is to train another AI to try and break it.