Beyond Agent-Level Search: How Tool-to-Agent Retrieval Supercharges LLM Systems

Large Language Models (LLMs) are rapidly evolving from simple chatbots into complex reasoning engines. One of the most exciting frontiers is the development of multi-agent systems—architectures in which a primary LLM orchestrates a team of specialized sub-agents. One agent might specialize in code analysis, another in database queries, and a third in web search. Each agent can be equipped with dozens or even thousands of tools—functions or APIs they can call to complete tasks.

This scalability is powerful but introduces a critical bottleneck: the routing problem. When a user asks, “What were our top-selling products in the Northeast region last quarter, and how does that compare to our main competitor’s press release?”, how does the system decide which agent or tool should respond?

Should the orchestrator look for a “database agent”? That seems plausible, but what if the most relevant function—say, a quarterly_sales_report(region) tool—is hidden inside an agent with a generic “Business Analytics” description? The system might miss it entirely. Conversely, sending every tool description from every agent to the LLM would be wildly inefficient, consuming thousands of tokens.

This is the problem tackled by Tool-to-Agent Retrieval, a method introduced by researchers at PwC in their paper “Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems.” Their approach avoids this all-or-nothing trade-off by creating a unified system where agents and tools coexist in a shared searchable space, enabling both granular and high-level retrieval within a single operation.

The Old Way: Agent-First vs. Tool-Only Retrieval

Before exploring the new approach, it helps to understand why previous methods failed.

Agent-First Retrieval – The system indexes only agent-level descriptions. When a query arrives, it searches this list and selects the most relevant agent. It’s like asking a front desk clerk, “Who can help fix my computer?” and being sent to “IT.” But the person who actually knows your system is buried inside that department. This approach suffers from context dilution: when an agent bundles hundreds of tools, its summary becomes so generic that the unique capabilities of those tools are lost.
Tool-Only Retrieval – The opposite approach flattens everything into a massive list of tools, ignoring agent groupings. It captures detail but loses context. Many tasks require sequences of operations that are conveniently grouped within one agent—like a database agent that can connect_to_database, run_sql_query, and format_results. A tool-only search might find one piece of that sequence but miss the rest that make the workflow coherent.

A diagram comparing agent-only retrieval with Tool-to-Agent Retrieval. The left side shows agents isolated in their own search space. The right side shows tools and agents co-existing in a shared space, where a query can match a tool and then traverse a link to find its parent agent.

Figure 1. Traditional agent-only retrieval (left) vs. Tool-to-Agent Retrieval (right). The new method embeds tools and agents in a shared vector space for joint retrieval and metadata traversal.

The Core Method: Tool-to-Agent Retrieval

The researchers’ central insight is to treat tools and agents as connected entities within one unified search space. Instead of running separate retrievals for each, the system embeds both levels—tools and agents—in a shared vector space linked by metadata.

1. Building a Unified Catalog

First, the team constructs a combined tool–agent catalog containing both granular and broad descriptions:

Agent Corpus (\(C_A\)) – Names and high-level descriptions of agents (e.g., “A code analysis agent capable of linting, debugging, and executing Python scripts.”).
Tool Corpus (\(C_T\)) – Detailed descriptions of individual tools (e.g., “Function execute_python_script that takes Python code as input and returns the standard output.”).

Every tool entry in \(C_T\) includes a metadata link to its parent agent—think of it as an employee directory where each profile lists the department:

\[ owner(tool) = agent \]

This simple mapping is the bridge between granular and contextual search.

2. Embedding in a Shared Vector Space

Each catalog entry—tools and agents alike—is converted into a numeric vector via an embedding model. These embeddings place entities in a high-dimensional semantic space: similar meanings cluster close together.

In that space, a user query like “run this Python code” would align closely with the vector for the execute_python_script tool. A broader query like “help me with my program” would fall nearer to the “Code Analysis Agent.” This unified space allows flexible matching across specific and general intents simultaneously.

3. The Retrieval Algorithm

The process, summarized in Algorithm 1 of the paper, unfolds in two stages:

Initial Retrieval: Convert the user query into a vector and search the unified catalog. Retrieve the top \(N\) most similar entities (a mix of tools and agents).
Aggregation and Ranking: From these \(N\) items, derive the top \(K\) unique agents:

If the retrieved entity is an agent, add it directly to the candidate list.
If it’s a tool, follow the metadata link to its parent agent.
Remove duplicates, rank by similarity, and return the final top \(K\) agents.

This approach neatly merges precision and efficiency. It recognizes when a specific tool matches a user need but automatically elevates its containing agent, enabling routing at the right level of abstraction. For multi-step queries, the same process runs step-by-step (Step-wise Querying) to dynamically switch agents as tasks evolve.

Experiments and Results: Does It Actually Work?

To test their hypothesis, the authors benchmarked Tool-to-Agent Retrieval against leading agent retrievers on realistic datasets.

The Evaluation Benchmark: LiveMCPBench

They used LiveMCPBench, a dataset built for evaluating LLM agent retrieval. It includes:

70 MCP servers (agents)
527 tools
95 real-world, multi-step questions annotated with step-level breakdowns and mappings between tools and agents

On average, each question spans 2.68 steps involving 2.82 tools and 1.4 agents—exactly the type of environment where fine-grained context matters.

Metrics of Success

The evaluation used standard information retrieval metrics:

Recall@K – How often the correct agent appears in the top K results.
mAP@K – Rewards ranking correct agents near the top.
nDCG@K – Measures ranking quality, emphasizing correct items higher in the list.

Performance Gains

Table 1 shows results on the LiveMCPBench benchmark. The ‘Tool-to-Agent Retrieval’ row at the bottom has the highest scores in almost every column, particularly for Recall@5 (0.83) and nDCG@5 (0.46), significantly outperforming baselines like MCPZero and ScaleMCP.

Table 1. Benchmark results on LiveMCPBench. Tool-to-Agent Retrieval achieves leading scores across Recall, mAP, and nDCG compared to baselines.

As Figure 2 shows, Tool-to-Agent Retrieval achieved remarkable improvements—a Recall@5 of 0.83, representing a 19.4% relative gain over MCPZero, and an nDCG@5 of 0.46, a 17.7% boost. It not only found correct agents more reliably but ranked them more appropriately.

Confirming Generality Across Embedding Models

A natural question: are these improvements due merely to better embeddings? To test robustness, the team repeated experiments across eight models from Google (Vertex AI, Gemini), Amazon (Titan), OpenAI, and the open-source MiniLM family.

Table 2 shows a comparison across eight different embedding models. For every model, the ‘Ours’ column (Tool-to-Agent Retrieval) shows a higher score for Recall@5, nDCG@5, and mAP@5 compared to the ‘MCPZero’ baseline.

Table 2. Results across eight embedding models. Tool-to-Agent Retrieval outperforms MCPZero across Recall, nDCG, and mAP—independent of embedding architecture.

No matter the model, the improvements held steady. For example, with Amazon Titan v2, Recall@5 jumped from 0.66 to 0.85—a 28% relative increase. Even lightweight open-source embeddings saw meaningful gains. This consistency proves the advantage stems from the retrieval architecture itself, not from model-specific quirks.

Interestingly, 39% of top results still originated from direct agent matches, showing the system balances granularity and abstraction: leveraging broad agent descriptions when relevant while retaining tool-level precision when needed.

Conclusion: Smarter, Unified Routing for LLM Systems

Tool-to-Agent Retrieval offers a streamlined and scalable solution to one of the biggest challenges in LLM multi-agent orchestration. By embedding tools and agents in a shared vector space and linking them via metadata, it enables flexible, one-pass retrieval that preserves both detail and context.

Key takeaways:

Unified Search Wins: A combined catalog of tools and agents consistently outperforms searching either layer alone.
Metadata Links Make It Work: Explicit tool-to-agent connections let specific matches inform broader routing decisions.
No More Context Dilution: Fine-grained tool semantics are preserved within coherent agent bundles.

As LLM systems grow to coordinate thousands of APIs and MCP servers, approaches like Tool-to-Agent Retrieval will be crucial. They allow models to think and act across both macro-level agents and micro-level tools—transforming tool selection from a bottleneck into a bridge for scalable AI reasoning.

The Old Way: Agent-First vs. Tool-Only Retrieval#

The Core Method: Tool-to-Agent Retrieval#

1. Building a Unified Catalog#

2. Embedding in a Shared Vector Space#

3. The Retrieval Algorithm#

Experiments and Results: Does It Actually Work?#

The Evaluation Benchmark: LiveMCPBench#

Metrics of Success#

Performance Gains#

Confirming Generality Across Embedding Models#

Conclusion: Smarter, Unified Routing for LLM Systems#