Introduction: The Dream of the “Meta-Planner”
Imagine asking your digital assistant to plan a weekend getaway. You say: “Find me a train from Portland to Vancouver departing next Saturday, and then book a hotel in Vancouver for two people with a rating of at least 4.2.”
To a human, this is a straightforward sequence of tasks. To a Large Language Model (LLM), however, this is a nightmare of dependencies, context switching, and permission management. The model cannot simply “know” the answers. It must interface with the real world using tools—specifically, Application Programming Interfaces (APIs).
While we have seen LLMs use calculators or search engines, the scenario above represents a massive leap in complexity. The model must realize that the “check-in date” for the hotel depends entirely on the “arrival date” of the train. It must switch from a transportation app to a lodging app, carrying information across that boundary accurately.
This capability is known as Meta-Planning. It is the frontier where LLMs transition from chatbots to true autonomous agents. But how good are today’s state-of-the-art models at this task?
In this post, we dive deep into AppBench, a research paper that exposes the limitations of current LLMs when faced with complex, multi-app ecosystems. We will explore how the benchmark was built, the graph-like nature of API dependencies, and why even GPT-4o struggles to achieve a success rate higher than 2% on the hardest tasks.

As shown in Figure 1, the user’s request requires the model to identify “Visible APPs,” select the correct APIs (like findtrains and searchhouse), and plan a path where data flows correctly between them.
The Problem: Why Current Benchmarks Fall Short
Before understanding the solution, we must understand the gap in existing research. Previous benchmarks like APIBench or ToolBench have been instrumental in teaching LLMs to use tools. However, they generally focus on two simpler scenarios:
- Single API calls: The user asks a question, and the model calls one specific function.
- Limited Arguments: The APIs are simple, often requiring only one or two inputs.
Real-world software development and user interaction are rarely this isolated. The researchers identified two major challenges that existing benchmarks overlook:
1. Graph Structures (Complex Dependencies)
In the real world, APIs are interdependent. You cannot book_hotel until you search_hotel to get an ID. You cannot pay_bill until you generate_invoice. This creates a graph structure where some tasks run in parallel while others must be sequential. The output of one API becomes the input argument for the next.
2. Permission Isolation
Your phone contains dozens of apps (Uber, Airbnb, WhatsApp). These are distinct ecosystems (“Sources”). An LLM acting as a meta-planner must respect these boundaries. It cannot ask Uber to book a room, nor can it ask Airbnb for a ride. It must obtain “permission” by selecting the correct trusted agent (App) for each specific sub-task.
To visualize how AppBench compares to previous work, consider the table below. Note the “MM” (Multiple Apps, Multiple APIs) and “DP” (Dependency) columns—this is where AppBench stands out.

Constructing AppBench: A Benchmark for Complexity
Creating a dataset of complex, interdependent API calls is difficult to do manually. The authors of AppBench devised a clever automated pipeline leveraging existing task-oriented dialogue datasets, specifically the SGD (Schema-Guided Dialogue) dataset.
The construction process, illustrated below, mimics the cognitive load of a human assistant.

The Pipeline
- Source Material: They take multi-turn dialogues between humans and systems (e.g., a long conversation about booking a flight and a hotel).
- Summarization: An LLM compresses this multi-turn conversation into a single, complex User Instruction.
- Dependency Building: Python scripts analyze the original dialogue’s logic to construct the “Ground Truth” API calls. Crucially, they identify where arguments overlap—for instance, if the
destinationin the flight search matches thecityin the hotel search, a dependency is recorded. - Quality Control: The generated instructions are filtered for fluency and logic to ensuring the benchmark is fair and solvable.
The Four Levels of Difficulty
The resulting dataset is categorized into four levels of increasing complexity. This categorization allows us to pinpoint exactly where an LLM’s reasoning breaks down.
- SS (Single App, Single API): The simplest case. “Find a restaurant.”
- SM (Single App, Multiple APIs): “Find a movie and then buy tickets for it.” Dependencies exist, but within one domain.
- MS (Multiple Apps, Single API): “Check the weather (App A) and play music (App B).” Two distinct domains, potentially parallel execution.
- MM (Multiple Apps, Multiple APIs): The “Boss Level.” “Find a flight (App A), use the arrival time to book a taxi (App B), and reserve a restaurant (App C).” This involves complex graph structures and cross-app data flow.
Figure 3 provides concrete examples of these categories. Notice the bolded text in the output paths—these represent arguments that are dependent on previous outputs (e.g., #restaurant_name).

The Scale of Complexity
What makes the MM category so difficult? It’s not just about length; it’s about the geometry of the task. As shown in the statistics table below, MM tasks have higher “Sequential” and “Parallel” sizes. This means the model has to hold more information in working memory and reason through longer chains of causality.

The Core Method: How to Evaluate a Meta-Planner
To evaluate how well an LLM performs as a meta-planner, the researchers define the task as follows: Given a user instruction and a “virtual mobile environment” containing various Apps (each with its own APIs), the model must generate an Executable Path.
This path is a list where each item specifies:
- The App to be used.
- The API function within that App.
- The Input Arguments, which can be literal values (e.g., “Portland”) or variables from previous steps (e.g.,
output_of_step_1).
Metrics
The paper uses rigorous metrics to grade the models. It’s not enough to just “chat” about the solution; the model must write valid code.
1. F1 Score for App Selection (\(F1_{app}\)): Did the model pick the right Apps? To calculate this, we look at Precision (\(P_{app}\)) and Recall (\(R_{app}\)).


2. F1 Score for API Selection (\(F1_{api}\)):
Once the App is chosen, did the model pick the correct function? (e.g., calling find_train instead of buy_ticket when the user just wants to search).
3. Success Rate (Succ): This is the “Hard Mode” metric. A task is considered a success only if:
- All Apps are identified correctly.
- All APIs are identified correctly.
- All Arguments (inputs and outputs) match the ground truth, including correct dependency linking.
If the model gets everything right but misses one start time or passes the wrong variable name, the score is 0.
Experiments & Results: The Reality Check
The researchers tested 9 distinct LLMs, ranging from open-source models like Mistral-7B and LLaMA-3 to proprietary giants like GPT-3.5 and GPT-4o.
The main results, presented in the table below, reveal a stark reality about the current state of LLMs.
(Note: Referencing Table 3 from the paper context)
Key Takeaways from the Results:
- Complexity Kills Performance: Look at the drop-off for almost every model.
- SS (Simple): GPT-4o achieves 70.92% Success.
- MM (Complex): GPT-4o achieves only 2.00% Success.
- This is a catastrophic drop. It indicates that while models are great at simple tool use, they lose coherence when juggling multiple apps and dependencies.
GPT-4o is King (but still fails): GPT-4o significantly outperforms other models, likely due to its superior reasoning and instruction-following capabilities. However, a 2% success rate on complex tasks suggests it is not yet reliable for autonomous agent workflows.
Open Source Lags Behind: Models like Mistral and Qwen (7B/14B) struggle to even format the output correctly, often scoring 0% on complex tasks. However, larger open-source models like LLaMA-3-70B perform competitively, sometimes beating GPT-3.5.
Deep Dive Analysis
Why are these tasks so hard? The researchers conducted several analyses to pinpoint the bottlenecks.
1. The Enemy is Depth, Not Just Width
The researchers analyzed how “Sequential Scale” (how many steps must happen one after another) versus “Parallel Scale” (how many independent tasks exist) affects performance.
As shown in Figure 4, performance degrades in both directions, but the Sequential Scale is particularly punishing. The deeper the dependency chain, the more likely the model is to “lose the thread” of the logic or hallucinate a variable that doesn’t exist yet.

2. Prompting Strategy: Hierarchical vs. Flat
There are two ways to present the available tools to an LLM:
- Hierarchical: First, ask the LLM to pick an App. Then, show it only the APIs for that App. (Good for saving context window).
- Flat: Dump every API from every App into the prompt at once.
Surprisingly, GPT-4o performed better with Flat prompting (green lines below), likely because its massive context window allows it to see the “whole picture” and plan dependencies more effectively. GPT-3.5, however, suffered from “information overload” in the Flat setting, performing better when the task was broken down hierarchically (red lines).

3. Where Exactly Do They Fail?
The error analysis in Table 4 is revealing. The models aren’t usually failing to pick the right App (App selection is relatively easy).
The failure happens at the Argument Level, specifically with Dependent Values (D-Values). The models struggle to correctly link the output of API_A to the input of API_B. They also struggle with “Time/Space” reasoning—for example, calculating that if a train arrives at 2:00 PM, the car pickup should be at 2:15 PM.

Can We Fix It? Fine-Tuning and In-Context Learning
If the base models are failing, can we teach them to be better?
Fine-Tuning
The researchers fine-tuned LLaMA-3-8B on a training set of AppBench data. As expected, fine-tuning improved the “formatting” metrics (F1 App and API). The model learned the syntax of the task perfectly.
However, looking at the Success Rate (Succ) in the bottom graphs of Figure 6, the improvement is marginal for complex tasks. Fine-tuning helps the model “speak the language” of the benchmark, but it doesn’t necessarily grant it the reasoning power to solve complex dependency logic.

In-Context Learning (Few-Shot Prompting)
What if we just give the model examples in the prompt (3-shot, 5-shot)?
Table 5 shows that for GPT-4o, providing examples helps significantly in the simple (SS) category (jumping from 70% to 81%). However, for the complex MM category, the success rate remains stagnant around 2-3%. The complexity of the planning graph seems to exceed what can be taught through a few static examples in the prompt.

Conclusion and Implications
AppBench serves as a reality check for the AI industry. It demonstrates that while we have mastered the art of the “Chatbot” and the “Single-Tool Agent,” we are still in the early stages of building Meta-Planners capable of navigating the messy, interconnected ecosystem of real-world applications.
The paper highlights that the primary bottlenecks are not just identifying user intent, but managing dependencies (Graph Structures) and handling diverse sources (Permission Isolation).
For students and researchers, this opens up exciting avenues for future work:
- Agent Frameworks: Can we build “System 2” thinking layers that explicitly map out dependency graphs before generating code?
- Self-Correction: Can agents learn to “compile” their plans, catch dependency errors, and retry?
- Hybrid Architectures: Combining LLMs with symbolic planners to handle the rigid logic of API data flows.
Until we solve these challenges, our AI assistants will likely remain helpful for checking the weather, but untrustworthy for planning our vacations.
The full list of apps and APIs used in this benchmark can be seen below, illustrating the diverse domains the models were tested against.

](https://deep-paper.org/en/paper/2410.19743/images/cover.png)