Introduction
We are living in the golden age of Large Language Models (LLMs). From ChatGPT to Llama, these models draft our emails, debug our code, and even write poetry. Yet, for all their brilliance, they suffer from a persistent and often dangerous flaw: hallucination. They can confidently state historical inaccuracies, invent legal precedents, or generate code that looks perfect but crashes immediately.
To solve this, the industry has largely turned to a “fight fire with fire” approach. We use the most powerful models available (like GPT-4) to police the output of other models. While effective, this creates a bottleneck. These “giant” models are closed-source, expensive to run via API, and computationally heavy.
But what if we didn’t need a giant? What if a smaller, open-source model—specifically trained to act as an autonomous agent—could do the job just as well, if not better?
This is the premise of the research paper “Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector.” The researchers propose HaluAgent, a framework that empowers a relatively small 7-billion parameter model (Baichuan2-Chat) to actively detect hallucinations across text, code, and math. By treating the model not as a passive reader but as an active agent equipped with a toolbox, they demonstrate that size isn’t everything when you have the right strategy.
In this deep dive, we will unpack how HaluAgent works, the “trajectory tuning” that makes it smart, and the results that suggest small agents are ready for the big leagues.
The Hallucination Landscape
Before understanding the solution, we must define the problem. Hallucinations in LLMs aren’t just one thing. They generally fall into several categories:
- Factual Errors: Getting a date or name wrong in a QA session.
- Instruction Drift: Failing to follow constraints (e.g., “write a response in exactly 50 words”).
- Logical/Math Errors: Performing incorrect arithmetic in a word problem.
- Code Errors: Writing code that is syntactically wrong or uses non-existent libraries.
- Semantic Inconsistency: Contradicting itself within the same paragraph.
The Limitation of Current Detectors
Most existing methods for detecting these errors are limited. Some rely purely on the model’s “internal knowledge”—asking the model, “Are you sure?” This often fails because if the model didn’t know the fact in the first place, it likely won’t know it’s wrong.
Other methods use external tools (like Google Search) but bolt them onto massive models like GPT-4. This works, but it precludes developers with limited resources from deploying effective hallucination checkers.

As shown in Table 1 above, HaluAgent attempts to check all the boxes: it uses a base model that is open-source (Baichuan2-Chat), it is agnostic to the specific task (text vs. code), it uses tools, and it is extensible. Unlike methods like SelfCheckGPT which are restricted to the model’s own probabilities, or FacTool which relies on GPT-4, HaluAgent aims for a sweet spot: high capability with low resource cost.
The HaluAgent Framework
The core innovation of this paper is shifting the perspective of hallucination detection from a classification task (True/False) to an agentic task (Plan, Act, Observe, Reflect).
Instead of simply feeding a sentence to an LLM and asking “Is this true?”, HaluAgent breaks the process down into a sophisticated pipeline. The architecture is designed to mimic how a human fact-checker works: we don’t just stare at a sentence until we know the truth; we break it down, search for evidence, do the math, and then form a conclusion.

Figure 1 provides the high-level roadmap. On the right side, you can see the detection pipeline. It operates in three distinct stages: Sentence Segmentation, Tool Selection & Verification, and Reflection.
Let’s break down the machinery under the hood.
1. The Multi-Functional Toolbox
A distinct feature of HaluAgent is that it acknowledges LLMs are bad at certain things (like multiplying large numbers) and shouldn’t be trusted to do them “in their head.” Instead, the agent is given a toolbox.

Table 5 details the specific tools provided to the agent:
- Web Search: For verifying factual claims (Knowledge-based QA).
- Calculator: For checking arithmetic in math problems.
- Code Interpreter: For executing code snippets to see if they crash.
- Word Count: For verifying length constraints in text generation.
- Match (Semantic Checker): A specialized tool to check if a sentence contradicts the context.
- System Tools: Utilities like
split_textto break paragraphs down.
The agent doesn’t just randomly use these. It must learn which tool applies to which type of hallucination.
2. The Three-Stage Workflow
The “brain” of HaluAgent follows a specific loop to ensure fine-grained detection.
Stage 1: Sentence Segmentation (Divide and Conquer)
LLM responses can be long and rambling, mixing facts, opinions, and code. Checking the whole block at once is prone to error. HaluAgent first uses split_text to chop the response into individual, self-contained sentences (or logical units). This isolates the claims, making them easier to verify one by one.
Stage 2: Tool Selection and Verification
This is where the “Agent” aspect shines. For every segmented sentence, the model decides: Does this need external verification?
- If the sentence says “Paris is the capital of France,” it invokes
web_search. - If the sentence says “25 * 4 = 100,” it invokes
calculator. - If the sentence provides a Python function, it invokes
code_interpreter.
The agent generates a “Thought” (reasoning what to do), performs an “Action” (calling the tool), and receives an “Observation” (the tool’s output).
Stage 3: Reflection and Memory
This is the most critical step for accuracy. Even with tools, an agent can get confused. HaluAgent employs a Memory Mechanism. It stores the result of every sentence check as a triple: (sentence, hallucination_label, supporting_evidence).
Once all sentences are checked, the agent enters the Reflection phase. It looks at the “local” evidence (does this specific fact match the search result?) and the “global” context (does this correct calculation make sense in the broader logic of the answer?). Only after this reflection does it output the final verdict.
How to Train a Small Giant: Trajectory Tuning
Here is the challenge: Small open-source models (like 7B parameters) are generally not smart enough out-of-the-box to handle complex tool use, multi-step reasoning, and memory management. If you ask a standard small model to “use a calculator tool,” it might just hallucinate the calculator’s output!
To bridge this gap, the researchers used a technique called Trajectory Fine-Tuning.
Synthesizing “Expert” Behavior
Since they couldn’t rely on the small model to figure this out, they first used a “teacher” model—GPT-4. They fed GPT-4 the HaluAgent instructions and the diverse datasets (Knowledge QA, Math, Code).
GPT-4 acted as the perfect agent, generating step-by-step logs called trajectories. A trajectory looks like this:
- Observation: Input text received.
- Thought: “I need to split this text.”
- Action:
split_text(...) - Observation: [Sentence 1, Sentence 2…]
- Thought: “Sentence 1 is a historical fact. I will search Google.”
- Action:
web_search(...)… and so on.
The researchers filtered these logs to ensure they only kept the high-quality ones where GPT-4 correctly identified the hallucinations.

As shown in Table 2, they didn’t need millions of examples. They synthesized roughly 2,000 trajectories across varied domains (WebQA, Math, Code, etc.). This is a surprisingly small dataset, which highlights the efficiency of this method.
Fine-Tuning the Student
Once the data was ready, they fine-tuned the Baichuan2-Chat model on these trajectories. The goal was to teach the small model to mimic the process of the teacher.

The training objective, shown in the equation above, is a standard language modeling loss, but applied specifically to the agent’s Thoughts (\(t_i\)) and Actions (\(a_i\)), conditioned on the context (\(c_i\)). Essentially, the model is penalized if it fails to “think” or “act” exactly like the expert trajectory at each step.
Experimental Results: Does it Work?
The researchers tested HaluAgent against several baselines. The most important comparisons were:
- Baichuan2-Chat (Base): The small model without agent tuning (using a simple prompt).
- GPT-4 (Prompt): The giant model using a standard “Is this a hallucination?” prompt.
- GPT-4 (Pipeline): The giant model using the full HaluAgent tool pipeline.
They tested on both In-domain datasets (similar to training data) and Out-of-domain datasets (new types of questions unseen during training).
Response-Level Detection
The first test was binary: Can the model correctly flag a response as “Hallucinated” or “Not Hallucinated”?

Table 3 reveals several striking insights:
- Small Models Struggle Alone: Look at the “Baichuan2-Chat 7B” column. Its performance is poor (e.g., 49% accuracy on Ape210K math). Without tools or training, the small model is guessing.
- HaluAgent Rivals GPT-4: Now look at the “HaluAgent 7B” column. On the WebQA dataset, it jumps to 80.00% accuracy, almost matching GPT-4’s 82.00%. On the Math dataset (Ape210K), it achieves 72.00%, effectively matching GPT-4’s 72.33%.
- Beating the Giant? On the “WordCnt” dataset (checking length constraints), HaluAgent achieves 100% accuracy, vastly outperforming the standard GPT-4 Prompt (56%). This is because HaluAgent learned to use the
word_counttool, whereas GPT-4 (in prompt mode) tried to count words “mentally”—a known weakness of LLMs.
Sentence-Level Granularity
It’s not enough to say “this paragraph is wrong.” A good detector must point to the exact sentence.

Table 4 shows the results on the FacTool benchmark. HaluAgent (especially the 13B version) shows massive improvements over the base model.
- Math: The base 7B model had an F1 score of 19.51%. HaluAgent 7B skyrocketed to 68.80%.
- Science: The improvement is even more dramatic, from 17.54% to 94.16%.
This confirms that the “Divide and Conquer” strategy (segmentation + verification) is highly effective for precise error localization.
Case Study: Seeing HaluAgent in Action
To truly appreciate the difference, let’s look at a qualitative comparison.

Figure 3 illustrates a tricky math word problem about average speed.
- The Hallucination: The answer claims “3 hours + 0.5 hours + 1.5 hours = 4 hours.” (The correct sum is 5).
- GPT-4 (Simple Prompt): It reads the text and says “No” (no hallucination found). It missed the arithmetic error.
- Single Tool Approach: Even using just a search engine didn’t help, because the error was calculation-based, not fact-based.
- HaluAgent:
- It split the text.
- It recognized a calculation was needed.
- It invoked the
calculatortool on “3 + 0.5 + 1.5”. - It compared the tool output (5.0) with the text (4).
- It flagged the error accurately.
This demonstrates the power of autonomous tool selection. HaluAgent didn’t just search; it recognized which tool was needed for the specific sentence.
Extensibility: Learning New Tricks
A major concern with specialized models is rigidity. Can HaluAgent only do what it was trained to do? To test this, the researchers added two new tools after training: a Calendar tool and a Translator tool. They simply provided instructions in the prompt (zero-shot) without re-training the model.

Figure 2 shows the results. HaluAgent successfully adopted the new tools.
- Calendar: It achieved 100% tool usage and detection success.
- Translator: It used the tool 98% of the time with ~90% success.
This proves that the “Agent” training didn’t just memorize the specific tools in the training set (like calculator or search). It learned the concept of using tools based on instructions, allowing it to generalize to new capabilities.
Conclusion
The paper “Small Agent Can Also Rock!” presents a compelling argument for the democratization of AI safety. It challenges the assumption that we need the largest, most expensive proprietary models to keep AI in check.
By combining a fine-grained detection framework (Split, Verify, Reflect) with trajectory fine-tuning, the researchers created HaluAgent. This system allows a modest 7B parameter model to outperform its base version significantly and often rival GPT-4.
The implications are significant for students and developers:
- Cost-Efficiency: Hallucination detection can be run locally or cheaply on open-source hardware.
- Specialization: Small models, when fine-tuned as agents, can punch well above their weight class.
- Trust: By using explicit tools (calculators, search), the model’s verification process is transparent and explainable, unlike the “black box” confidence scores of larger models.
As we move forward, the “Agent” paradigm—where models actively use tools rather than just predicting tokens—seems to be the key to unlocking reliability in AI, regardless of the model size.
](https://deep-paper.org/en/paper/2406.11277/images/cover.png)