Introduction: The Dream of the “Meta-Planner”

Imagine asking your digital assistant to plan a weekend getaway. You say: “Find me a train from Portland to Vancouver departing next Saturday, and then book a hotel in Vancouver for two people with a rating of at least 4.2.”

To a human, this is a straightforward sequence of tasks. To a Large Language Model (LLM), however, this is a nightmare of dependencies, context switching, and permission management. The model cannot simply “know” the answers. It must interface with the real world using tools—specifically, Application Programming Interfaces (APIs).

While we have seen LLMs use calculators or search engines, the scenario above represents a massive leap in complexity. The model must realize that the “check-in date” for the hotel depends entirely on the “arrival date” of the train. It must switch from a transportation app to a lodging app, carrying information across that boundary accurately.

This capability is known as Meta-Planning. It is the frontier where LLMs transition from chatbots to true autonomous agents. But how good are today’s state-of-the-art models at this task?

In this post, we dive deep into AppBench, a research paper that exposes the limitations of current LLMs when faced with complex, multi-app ecosystems. We will explore how the benchmark was built, the graph-like nature of API dependencies, and why even GPT-4o struggles to achieve a success rate higher than 2% on the hardest tasks.

An example of a user instruction requiring APIs from two different Apps.

As shown in Figure 1, the user’s request requires the model to identify “Visible APPs,” select the correct APIs (like findtrains and searchhouse), and plan a path where data flows correctly between them.

The Problem: Why Current Benchmarks Fall Short

Before understanding the solution, we must understand the gap in existing research. Previous benchmarks like APIBench or ToolBench have been instrumental in teaching LLMs to use tools. However, they generally focus on two simpler scenarios:

  1. Single API calls: The user asks a question, and the model calls one specific function.
  2. Limited Arguments: The APIs are simple, often requiring only one or two inputs.

Real-world software development and user interaction are rarely this isolated. The researchers identified two major challenges that existing benchmarks overlook:

1. Graph Structures (Complex Dependencies)

In the real world, APIs are interdependent. You cannot book_hotel until you search_hotel to get an ID. You cannot pay_bill until you generate_invoice. This creates a graph structure where some tasks run in parallel while others must be sequential. The output of one API becomes the input argument for the next.

2. Permission Isolation

Your phone contains dozens of apps (Uber, Airbnb, WhatsApp). These are distinct ecosystems (“Sources”). An LLM acting as a meta-planner must respect these boundaries. It cannot ask Uber to book a room, nor can it ask Airbnb for a ride. It must obtain “permission” by selecting the correct trusted agent (App) for each specific sub-task.

To visualize how AppBench compares to previous work, consider the table below. Note the “MM” (Multiple Apps, Multiple APIs) and “DP” (Dependency) columns—this is where AppBench stands out.

Comparison with existing evaluation benchmarks.

Constructing AppBench: A Benchmark for Complexity

Creating a dataset of complex, interdependent API calls is difficult to do manually. The authors of AppBench devised a clever automated pipeline leveraging existing task-oriented dialogue datasets, specifically the SGD (Schema-Guided Dialogue) dataset.

The construction process, illustrated below, mimics the cognitive load of a human assistant.

A high-level processing flow to collect the AppBench dataset.

The Pipeline

  1. Source Material: They take multi-turn dialogues between humans and systems (e.g., a long conversation about booking a flight and a hotel).
  2. Summarization: An LLM compresses this multi-turn conversation into a single, complex User Instruction.
  3. Dependency Building: Python scripts analyze the original dialogue’s logic to construct the “Ground Truth” API calls. Crucially, they identify where arguments overlap—for instance, if the destination in the flight search matches the city in the hotel search, a dependency is recorded.
  4. Quality Control: The generated instructions are filtered for fluency and logic to ensuring the benchmark is fair and solvable.

The Four Levels of Difficulty

The resulting dataset is categorized into four levels of increasing complexity. This categorization allows us to pinpoint exactly where an LLM’s reasoning breaks down.

  1. SS (Single App, Single API): The simplest case. “Find a restaurant.”
  2. SM (Single App, Multiple APIs): “Find a movie and then buy tickets for it.” Dependencies exist, but within one domain.
  3. MS (Multiple Apps, Single API): “Check the weather (App A) and play music (App B).” Two distinct domains, potentially parallel execution.
  4. MM (Multiple Apps, Multiple APIs): The “Boss Level.” “Find a flight (App A), use the arrival time to book a taxi (App B), and reserve a restaurant (App C).” This involves complex graph structures and cross-app data flow.

Figure 3 provides concrete examples of these categories. Notice the bolded text in the output paths—these represent arguments that are dependent on previous outputs (e.g., #restaurant_name).

Examples of different sample types in AppBench: SS, SM, MS, and MM.

The Scale of Complexity

What makes the MM category so difficult? It’s not just about length; it’s about the geometry of the task. As shown in the statistics table below, MM tasks have higher “Sequential” and “Parallel” sizes. This means the model has to hold more information in working memory and reason through longer chains of causality.

Data statistics of the proposed AppBench.

The Core Method: How to Evaluate a Meta-Planner

To evaluate how well an LLM performs as a meta-planner, the researchers define the task as follows: Given a user instruction and a “virtual mobile environment” containing various Apps (each with its own APIs), the model must generate an Executable Path.

This path is a list where each item specifies:

  1. The App to be used.
  2. The API function within that App.
  3. The Input Arguments, which can be literal values (e.g., “Portland”) or variables from previous steps (e.g., output_of_step_1).

Metrics

The paper uses rigorous metrics to grade the models. It’s not enough to just “chat” about the solution; the model must write valid code.

1. F1 Score for App Selection (\(F1_{app}\)): Did the model pick the right Apps? To calculate this, we look at Precision (\(P_{app}\)) and Recall (\(R_{app}\)).

Equation for Precision of App selection.

Equation for Recall of App selection.

2. F1 Score for API Selection (\(F1_{api}\)): Once the App is chosen, did the model pick the correct function? (e.g., calling find_train instead of buy_ticket when the user just wants to search).

3. Success Rate (Succ): This is the “Hard Mode” metric. A task is considered a success only if:

  • All Apps are identified correctly.
  • All APIs are identified correctly.
  • All Arguments (inputs and outputs) match the ground truth, including correct dependency linking.

If the model gets everything right but misses one start time or passes the wrong variable name, the score is 0.

Experiments & Results: The Reality Check

The researchers tested 9 distinct LLMs, ranging from open-source models like Mistral-7B and LLaMA-3 to proprietary giants like GPT-3.5 and GPT-4o.

The main results, presented in the table below, reveal a stark reality about the current state of LLMs.

Main results of different LLMs on AppBench. (Note: Referencing Table 3 from the paper context)

Key Takeaways from the Results:

  1. Complexity Kills Performance: Look at the drop-off for almost every model.
  • SS (Simple): GPT-4o achieves 70.92% Success.
  • MM (Complex): GPT-4o achieves only 2.00% Success.
  • This is a catastrophic drop. It indicates that while models are great at simple tool use, they lose coherence when juggling multiple apps and dependencies.
  1. GPT-4o is King (but still fails): GPT-4o significantly outperforms other models, likely due to its superior reasoning and instruction-following capabilities. However, a 2% success rate on complex tasks suggests it is not yet reliable for autonomous agent workflows.

  2. Open Source Lags Behind: Models like Mistral and Qwen (7B/14B) struggle to even format the output correctly, often scoring 0% on complex tasks. However, larger open-source models like LLaMA-3-70B perform competitively, sometimes beating GPT-3.5.

Deep Dive Analysis

Why are these tasks so hard? The researchers conducted several analyses to pinpoint the bottlenecks.

1. The Enemy is Depth, Not Just Width

The researchers analyzed how “Sequential Scale” (how many steps must happen one after another) versus “Parallel Scale” (how many independent tasks exist) affects performance.

As shown in Figure 4, performance degrades in both directions, but the Sequential Scale is particularly punishing. The deeper the dependency chain, the more likely the model is to “lose the thread” of the logic or hallucinate a variable that doesn’t exist yet.

The relationship between GPT-4o’s performance with parallel and sequential scaling.

2. Prompting Strategy: Hierarchical vs. Flat

There are two ways to present the available tools to an LLM:

  • Hierarchical: First, ask the LLM to pick an App. Then, show it only the APIs for that App. (Good for saving context window).
  • Flat: Dump every API from every App into the prompt at once.

Surprisingly, GPT-4o performed better with Flat prompting (green lines below), likely because its massive context window allows it to see the “whole picture” and plan dependencies more effectively. GPT-3.5, however, suffered from “information overload” in the Flat setting, performing better when the task was broken down hierarchically (red lines).

Performance gap between hierarchical and flat prompting.

3. Where Exactly Do They Fail?

The error analysis in Table 4 is revealing. The models aren’t usually failing to pick the right App (App selection is relatively easy).

The failure happens at the Argument Level, specifically with Dependent Values (D-Values). The models struggle to correctly link the output of API_A to the input of API_B. They also struggle with “Time/Space” reasoning—for example, calculating that if a train arrives at 2:00 PM, the car pickup should be at 2:15 PM.

Error analysis of GPT-4o on AppBench.

Can We Fix It? Fine-Tuning and In-Context Learning

If the base models are failing, can we teach them to be better?

Fine-Tuning

The researchers fine-tuned LLaMA-3-8B on a training set of AppBench data. As expected, fine-tuning improved the “formatting” metrics (F1 App and API). The model learned the syntax of the task perfectly.

However, looking at the Success Rate (Succ) in the bottom graphs of Figure 6, the improvement is marginal for complex tasks. Fine-tuning helps the model “speak the language” of the benchmark, but it doesn’t necessarily grant it the reasoning power to solve complex dependency logic.

Performance gap between original LLaMA3-8B and fine-tuned version.

In-Context Learning (Few-Shot Prompting)

What if we just give the model examples in the prompt (3-shot, 5-shot)?

Table 5 shows that for GPT-4o, providing examples helps significantly in the simple (SS) category (jumping from 70% to 81%). However, for the complex MM category, the success rate remains stagnant around 2-3%. The complexity of the planning graph seems to exceed what can be taught through a few static examples in the prompt.

In-context learning results of GPT-4o on AppBench.

Conclusion and Implications

AppBench serves as a reality check for the AI industry. It demonstrates that while we have mastered the art of the “Chatbot” and the “Single-Tool Agent,” we are still in the early stages of building Meta-Planners capable of navigating the messy, interconnected ecosystem of real-world applications.

The paper highlights that the primary bottlenecks are not just identifying user intent, but managing dependencies (Graph Structures) and handling diverse sources (Permission Isolation).

For students and researchers, this opens up exciting avenues for future work:

  1. Agent Frameworks: Can we build “System 2” thinking layers that explicitly map out dependency graphs before generating code?
  2. Self-Correction: Can agents learn to “compile” their plans, catch dependency errors, and retry?
  3. Hybrid Architectures: Combining LLMs with symbolic planners to handle the rigid logic of API data flows.

Until we solve these challenges, our AI assistants will likely remain helpful for checking the weather, but untrustworthy for planning our vacations.


The full list of apps and APIs used in this benchmark can be seen below, illustrating the diverse domains the models were tested against.

List of All Apps and their corresponding APIs in AppBench.