Introduction

We are living in the golden age of Large Language Models (LLMs). From ChatGPT to Llama, these models draft our emails, debug our code, and even write poetry. Yet, for all their brilliance, they suffer from a persistent and often dangerous flaw: hallucination. They can confidently state historical inaccuracies, invent legal precedents, or generate code that looks perfect but crashes immediately.

To solve this, the industry has largely turned to a “fight fire with fire” approach. We use the most powerful models available (like GPT-4) to police the output of other models. While effective, this creates a bottleneck. These “giant” models are closed-source, expensive to run via API, and computationally heavy.

But what if we didn’t need a giant? What if a smaller, open-source model—specifically trained to act as an autonomous agent—could do the job just as well, if not better?

This is the premise of the research paper “Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector.” The researchers propose HaluAgent, a framework that empowers a relatively small 7-billion parameter model (Baichuan2-Chat) to actively detect hallucinations across text, code, and math. By treating the model not as a passive reader but as an active agent equipped with a toolbox, they demonstrate that size isn’t everything when you have the right strategy.

In this deep dive, we will unpack how HaluAgent works, the “trajectory tuning” that makes it smart, and the results that suggest small agents are ready for the big leagues.

The Hallucination Landscape

Before understanding the solution, we must define the problem. Hallucinations in LLMs aren’t just one thing. They generally fall into several categories:

  1. Factual Errors: Getting a date or name wrong in a QA session.
  2. Instruction Drift: Failing to follow constraints (e.g., “write a response in exactly 50 words”).
  3. Logical/Math Errors: Performing incorrect arithmetic in a word problem.
  4. Code Errors: Writing code that is syntactically wrong or uses non-existent libraries.
  5. Semantic Inconsistency: Contradicting itself within the same paragraph.

The Limitation of Current Detectors

Most existing methods for detecting these errors are limited. Some rely purely on the model’s “internal knowledge”—asking the model, “Are you sure?” This often fails because if the model didn’t know the fact in the first place, it likely won’t know it’s wrong.

Other methods use external tools (like Google Search) but bolt them onto massive models like GPT-4. This works, but it precludes developers with limited resources from deploying effective hallucination checkers.

Table 1: Comparison of different methods. Task Agnostic means whether the method is designed to specific tasks; Fine Grained describes whether providing detailed hallucination sentences; Extensibility means whether the method can extend to more tasks and tools.

As shown in Table 1 above, HaluAgent attempts to check all the boxes: it uses a base model that is open-source (Baichuan2-Chat), it is agnostic to the specific task (text vs. code), it uses tools, and it is extensible. Unlike methods like SelfCheckGPT which are restricted to the model’s own probabilities, or FacTool which relies on GPT-4, HaluAgent aims for a sweet spot: high capability with low resource cost.

The HaluAgent Framework

The core innovation of this paper is shifting the perspective of hallucination detection from a classification task (True/False) to an agentic task (Plan, Act, Observe, Reflect).

Instead of simply feeding a sentence to an LLM and asking “Is this true?”, HaluAgent breaks the process down into a sophisticated pipeline. The architecture is designed to mimic how a human fact-checker works: we don’t just stare at a sentence until we know the truth; we break it down, search for evidence, do the math, and then form a conclusion.

Figure 1: The overview of our proposed HaluAgent. The left part shows the process of fine-tuning open-source models and detecting hallucinations.The right part illustrates the hallucination detection pipeline of HaluAgent.

Figure 1 provides the high-level roadmap. On the right side, you can see the detection pipeline. It operates in three distinct stages: Sentence Segmentation, Tool Selection & Verification, and Reflection.

Let’s break down the machinery under the hood.

1. The Multi-Functional Toolbox

A distinct feature of HaluAgent is that it acknowledges LLMs are bad at certain things (like multiplying large numbers) and shouldn’t be trusted to do them “in their head.” Instead, the agent is given a toolbox.

Table 5: Instructions of the toolbox in HaluAgent.

Table 5 details the specific tools provided to the agent:

  • Web Search: For verifying factual claims (Knowledge-based QA).
  • Calculator: For checking arithmetic in math problems.
  • Code Interpreter: For executing code snippets to see if they crash.
  • Word Count: For verifying length constraints in text generation.
  • Match (Semantic Checker): A specialized tool to check if a sentence contradicts the context.
  • System Tools: Utilities like split_text to break paragraphs down.

The agent doesn’t just randomly use these. It must learn which tool applies to which type of hallucination.

2. The Three-Stage Workflow

The “brain” of HaluAgent follows a specific loop to ensure fine-grained detection.

Stage 1: Sentence Segmentation (Divide and Conquer)

LLM responses can be long and rambling, mixing facts, opinions, and code. Checking the whole block at once is prone to error. HaluAgent first uses split_text to chop the response into individual, self-contained sentences (or logical units). This isolates the claims, making them easier to verify one by one.

Stage 2: Tool Selection and Verification

This is where the “Agent” aspect shines. For every segmented sentence, the model decides: Does this need external verification?

  • If the sentence says “Paris is the capital of France,” it invokes web_search.
  • If the sentence says “25 * 4 = 100,” it invokes calculator.
  • If the sentence provides a Python function, it invokes code_interpreter.

The agent generates a “Thought” (reasoning what to do), performs an “Action” (calling the tool), and receives an “Observation” (the tool’s output).

Stage 3: Reflection and Memory

This is the most critical step for accuracy. Even with tools, an agent can get confused. HaluAgent employs a Memory Mechanism. It stores the result of every sentence check as a triple: (sentence, hallucination_label, supporting_evidence).

Once all sentences are checked, the agent enters the Reflection phase. It looks at the “local” evidence (does this specific fact match the search result?) and the “global” context (does this correct calculation make sense in the broader logic of the answer?). Only after this reflection does it output the final verdict.

How to Train a Small Giant: Trajectory Tuning

Here is the challenge: Small open-source models (like 7B parameters) are generally not smart enough out-of-the-box to handle complex tool use, multi-step reasoning, and memory management. If you ask a standard small model to “use a calculator tool,” it might just hallucinate the calculator’s output!

To bridge this gap, the researchers used a technique called Trajectory Fine-Tuning.

Synthesizing “Expert” Behavior

Since they couldn’t rely on the small model to figure this out, they first used a “teacher” model—GPT-4. They fed GPT-4 the HaluAgent instructions and the diverse datasets (Knowledge QA, Math, Code).

GPT-4 acted as the perfect agent, generating step-by-step logs called trajectories. A trajectory looks like this:

  1. Observation: Input text received.
  2. Thought: “I need to split this text.”
  3. Action: split_text(...)
  4. Observation: [Sentence 1, Sentence 2…]
  5. Thought: “Sentence 1 is a historical fact. I will search Google.”
  6. Action: web_search(...) … and so on.

The researchers filtered these logs to ensure they only kept the high-quality ones where GPT-4 correctly identified the hallucinations.

Table 2: Statistics of synthetic detection trajectories.

As shown in Table 2, they didn’t need millions of examples. They synthesized roughly 2,000 trajectories across varied domains (WebQA, Math, Code, etc.). This is a surprisingly small dataset, which highlights the efficiency of this method.

Fine-Tuning the Student

Once the data was ready, they fine-tuned the Baichuan2-Chat model on these trajectories. The goal was to teach the small model to mimic the process of the teacher.

()\n{ \\mathcal { L } } = - \\log \\sum _ { i = 1 } ^ { n } \\operatorname* { P r } ( t _ { i } , a _ { i } | c _ { i } ) .\n()

The training objective, shown in the equation above, is a standard language modeling loss, but applied specifically to the agent’s Thoughts (\(t_i\)) and Actions (\(a_i\)), conditioned on the context (\(c_i\)). Essentially, the model is penalized if it fails to “think” or “act” exactly like the expert trajectory at each step.

Experimental Results: Does it Work?

The researchers tested HaluAgent against several baselines. The most important comparisons were:

  1. Baichuan2-Chat (Base): The small model without agent tuning (using a simple prompt).
  2. GPT-4 (Prompt): The giant model using a standard “Is this a hallucination?” prompt.
  3. GPT-4 (Pipeline): The giant model using the full HaluAgent tool pipeline.

They tested on both In-domain datasets (similar to training data) and Out-of-domain datasets (new types of questions unseen during training).

Response-Level Detection

The first test was binary: Can the model correctly flag a response as “Hallucinated” or “Not Hallucinated”?

Table 3:Evaluation results at Accuracy and Fl score on in-domain and out-of-domain datasets.Bold denotes the best methods among open-source models; underline denotes the best methods among closed-source models.

Table 3 reveals several striking insights:

  1. Small Models Struggle Alone: Look at the “Baichuan2-Chat 7B” column. Its performance is poor (e.g., 49% accuracy on Ape210K math). Without tools or training, the small model is guessing.
  2. HaluAgent Rivals GPT-4: Now look at the “HaluAgent 7B” column. On the WebQA dataset, it jumps to 80.00% accuracy, almost matching GPT-4’s 82.00%. On the Math dataset (Ape210K), it achieves 72.00%, effectively matching GPT-4’s 72.33%.
  3. Beating the Giant? On the “WordCnt” dataset (checking length constraints), HaluAgent achieves 100% accuracy, vastly outperforming the standard GPT-4 Prompt (56%). This is because HaluAgent learned to use the word_count tool, whereas GPT-4 (in prompt mode) tried to count words “mentally”—a known weakness of LLMs.

Sentence-Level Granularity

It’s not enough to say “this paragraph is wrong.” A good detector must point to the exact sentence.

Table 4: Evaluation results of sentence-level detection on the four subsets of FacTool.

Table 4 shows the results on the FacTool benchmark. HaluAgent (especially the 13B version) shows massive improvements over the base model.

  • Math: The base 7B model had an F1 score of 19.51%. HaluAgent 7B skyrocketed to 68.80%.
  • Science: The improvement is even more dramatic, from 17.54% to 94.16%.

This confirms that the “Divide and Conquer” strategy (segmentation + verification) is highly effective for precise error localization.

Case Study: Seeing HaluAgent in Action

To truly appreciate the difference, let’s look at a qualitative comparison.

Figure 3: Case study between GPT-4 with a simple prompt and single tool, and the HaluAgent framework.

Figure 3 illustrates a tricky math word problem about average speed.

  • The Hallucination: The answer claims “3 hours + 0.5 hours + 1.5 hours = 4 hours.” (The correct sum is 5).
  • GPT-4 (Simple Prompt): It reads the text and says “No” (no hallucination found). It missed the arithmetic error.
  • Single Tool Approach: Even using just a search engine didn’t help, because the error was calculation-based, not fact-based.
  • HaluAgent:
  1. It split the text.
  2. It recognized a calculation was needed.
  3. It invoked the calculator tool on “3 + 0.5 + 1.5”.
  4. It compared the tool output (5.0) with the text (4).
  5. It flagged the error accurately.

This demonstrates the power of autonomous tool selection. HaluAgent didn’t just search; it recognized which tool was needed for the specific sentence.

Extensibility: Learning New Tricks

A major concern with specialized models is rigidity. Can HaluAgent only do what it was trained to do? To test this, the researchers added two new tools after training: a Calendar tool and a Translator tool. They simply provided instructions in the prompt (zero-shot) without re-training the model.

Figure 2: The usage rate of new tools and the proportion of successful detection.

Figure 2 shows the results. HaluAgent successfully adopted the new tools.

  • Calendar: It achieved 100% tool usage and detection success.
  • Translator: It used the tool 98% of the time with ~90% success.

This proves that the “Agent” training didn’t just memorize the specific tools in the training set (like calculator or search). It learned the concept of using tools based on instructions, allowing it to generalize to new capabilities.

Conclusion

The paper “Small Agent Can Also Rock!” presents a compelling argument for the democratization of AI safety. It challenges the assumption that we need the largest, most expensive proprietary models to keep AI in check.

By combining a fine-grained detection framework (Split, Verify, Reflect) with trajectory fine-tuning, the researchers created HaluAgent. This system allows a modest 7B parameter model to outperform its base version significantly and often rival GPT-4.

The implications are significant for students and developers:

  1. Cost-Efficiency: Hallucination detection can be run locally or cheaply on open-source hardware.
  2. Specialization: Small models, when fine-tuned as agents, can punch well above their weight class.
  3. Trust: By using explicit tools (calculators, search), the model’s verification process is transparent and explainable, unlike the “black box” confidence scores of larger models.

As we move forward, the “Agent” paradigm—where models actively use tools rather than just predicting tokens—seems to be the key to unlocking reliability in AI, regardless of the model size.