Introduction
In recent years, Large Language Models (LLMs) like GPT-4 and LLaMA have graduated from being simple chatbots to becoming “agents.” They don’t just talk; they act. Through Tool Learning, these models can interact with the physical and digital world—checking the weather, booking flights, or querying databases by calling external APIs (Application Programming Interfaces).
Most research in this field assumes a perfect world. In these idealized benchmarks, API tools are named logically (e.g., Get_Weather), parameters are clearly defined, and documentation is pristine. But anyone who has worked in software engineering knows that the real world is rarely so clean. Real-world systems are messy. They have legacy code, typos, cryptic naming conventions (e.g., func_01_a), and redundant parameters.
What happens when a state-of-the-art LLM encounters this “noise”?
This is the core question behind the paper “RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning.” The authors explore a critical vulnerability: while LLMs are great at using tools in structured environments, their performance collapses when faced with the inevitable noise of reality.

As illustrated in Figure 1, an LLM might perfectly utilize a tool named Get_Weather. But if that same tool is named ABC (a scenario common in obfuscated or poorly named codebases), the model fails to recognize it, even if the functionality is identical.
In this post, we will dissect the RoTBench framework, analyze why even the most powerful models like GPT-4 struggle with noise, and introduce RoTTuning, a novel training strategy designed to build robust, noise-resistant agents.
Background: The Fragility of Tool Learning
Tool learning empowers LLMs to extend their capabilities. Instead of relying solely on internal knowledge, the model is given a list of “tools” (function descriptions). When a user asks a question, the model must:
- Select the correct tool.
- Identify the necessary arguments (parameters) for that tool.
- Fill the content for those parameters based on the user’s intent.
Existing benchmarks like ToolBench or API-Bank test these steps, but they use “clean” environments. The authors of this paper argue that robustness—the ability to maintain performance despite perturbations—is just as important as accuracy. If an AI agent can’t handle a typo in an API definition or a scrambled parameter list, it isn’t ready for real-world deployment.
RoTBench: Stress-Testing LLMs
To measure this robustness gap, the researchers created RoTBench. This is a multi-level benchmark that systematically injects different types of “noise” into tool environments to see how models cope.
The benchmark defines five distinct environments, ranging from perfect to chaotic.

As shown in Figure 2, the environments are categorized as follows:
- Clean Level: The baseline. Tool names match their functions (e.g.,
cat_breed), and parameters are clear. This mimics standard benchmarks. - Slight Level: Simulates typographical errors common in human input or hasty coding.
- Insertion:
cat_breedbecomescat_t_breeds. - Omission:
cat_breedbecomesc_at_breed. - Substitution:
cat_breedbecomesbat_breo_d.
- Medium Level: Introduces more significant structural noise.
- Reversal:
dog_breedbecomesdeerb_god. - Nonsense: Names are replaced with random strings like
abcDF. This tests if the model reads the description rather than relying solely on the name.
- Heavy Level: Complex structural changes.
- Exchange: Parameter roles might be swapped or shuffled.
- Addendum: Random, mandatory parameters are added that require the model to infer values or use placeholders.
- Union Level: The ultimate stress test. It combines all the above noise types simultaneously.
The Three Stages of Failure
The authors didn’t just look at the final output; they analyzed where the models failed. They broke the evaluation down into three stages:
- Tool Selection (\(s_{TS}\)): Did the model pick the right function?

- Parameter Identification (\(s_{PI}\)): Did it recognize which arguments were required?

- Content Filling (\(s_{CF}\)): Did it extract the correct values from the user prompt to fill those arguments?

By isolating these stages, the researchers could pinpoint whether a model failed because it couldn’t find the tool or because it couldn’t format the parameters correctly.
Experiments and Analysis
The researchers tested six models: four open-source (versions of ToolLLaMA and NexusRaven) and two closed-source (GPT-3.5-turbo and GPT-4). The results were stark.
1. Robustness is a Major Weakness
While human evaluators maintained high performance regardless of noise (showing that the tasks were solvable), LLM performance degraded rapidly.

Figure 3 illustrates the performance drop (the “delta” between Clean and Noisy). You can see that Tool Selection (the left group) suffers massively when tool names are noisy. If an API is named Get_Weather, the model finds it. If it’s named x_99_z, the model often ignores it, even if the description explicitly says “Returns weather data.”
2. The GPT Paradox: Too Smart for its Own Good
One of the most fascinating findings in the paper involves the Slight noise level. Logic suggests that “Slight” noise (typos) should be easier to handle than “Medium” noise (nonsense strings). However, the GPT family of models actually performed worse on Slight noise than on Medium noise in some cases.
Why? Because GPT models are trained to be helpful assistants that correct user mistakes.
When GPT-4 sees a tool named predOict_aTge (a noisy version of predict_age), it assumes the user made a typo. It “corrects” the noise and attempts to call the function predict_age. However, in a programmatic environment, exactness matters. The API is actually registered as predOict_aTge. By “fixing” the name, GPT-4 calls a function that doesn’t exist, causing the execution to fail.

Table 13 shows this phenomenon in action. When the tool is defined as predOict_aTge, GPT-3.5 outputs predict_age, resulting in a system error. This “noise correction” capability, usually a strength in natural conversation, becomes a liability in rigid tool utilization.
3. Training Examples Don’t Fix Robustness
The researchers attempted to fix this by giving GPT-4 examples of previous interactions (few-shot learning).

As Figure 4 shows, while providing examples (Third Turn) improved the average performance (the shape gets larger), it also increased the variance (the standard deviation). The model became better overall but remained unstable across different noise levels. Few-shot prompting wasn’t enough to solve the robustness problem.
The Solution: RoTTuning
Recognizing that inherent model capabilities and standard prompting weren’t sufficient, the authors proposed a new training strategy called RoTTuning. The goal was to simulate the messiness of the real world during the training phase, essentially inoculating the model against noise.
The RoTTuning process consists of four phases:

Phase 1: Query Expansion
Using a small set of seed data, they used GPT-4 to generate thousands of new, diverse user queries. This ensures the model sees a wide variety of requests, not just standard “What is the weather?” questions.
Phase 2: Trajectory Generation
They fed these queries into GPT-4 (using a clean environment) to generate correct “thought-action” trajectories. This creates a “Golden Dataset” of how a model should think and act to solve problems using tools.
Phase 3: Environment Augmentation
This is the critical step. Instead of training on just the clean data, they took the “Golden” trajectories and artificially injected noise. They rewrote the tool definitions in the training data to mimic the Slight, Medium, and Heavy environments.
- Example: If the original training data had a tool
search_news, the augmented data might rename its_ear_ch_new_sbut keep the correct logic usage.
Phase 4: Generalizability Training
Finally, they fine-tuned the LLaMA-2-7B model using LoRA (Low-Rank Adaptation). This allows the model to adapt to these noisy patterns without overwriting all its original knowledge.
Results: A More Robust Model
The resulting model, named RoTLLaMA, demonstrated significantly better stability than even GPT-4.

Figure 6 compares RoTLLaMA (blue bars) against versions without augmentation or LoRA.
- Performance: RoTLLaMA achieves the highest scores across Tool Selection, Parameter Identification, and Content Filling.
- Stability: The standard deviation (error bars) is smaller, meaning the model performs consistently regardless of whether the environment is clean or chaotic.

Table 5 details the specific scores. While absolute performance drops slightly as noise increases (which is expected), the drop-off is much smoother compared to standard models. Most importantly, RoTLLaMA learned to trust the function descriptions over the function names, effectively overcoming the “GPT Paradox” where models hallucinate corrected names.
Conclusion
The transition from “Chatbot” to “Agent” relies on the ability to use tools reliably. The RoTBench paper highlights a significant blind spot in current LLM development: models are brittle. A single typo in an API definition or a legacy naming convention can break a multi-million dollar model.
The key takeaways from this research are:
- Robustness is not inherent: High reasoning capability (like in GPT-4) does not guarantee robustness against structural noise.
- Helpfulness can be harmful: The tendency of LLMs to “autocorrect” input can cause failures in strict programmatic environments.
- Diversity is the cure: By using RoTTuning—training on environments that are intentionally noisy and diverse—we can build models that look past the messy syntax of real-world tools and focus on their semantic utility.
As we move toward autonomous agents that interact with the messy, unstructured web, benchmarks like RoTBench will be essential for ensuring that our AI assistants don’t just work in the lab, but also in the real world.
](https://deep-paper.org/en/paper/2401.08326/images/cover.png)