If you have played with Large Language Models (LLMs) recently, you’ve likely encountered the concept of Agents. We’ve moved past simple chatbots; we now have systems where LLMs use tools, browse the web, write code, and even talk to other LLMs to solve problems.
However, building these multi-agent systems is incredibly hard. Early frameworks like AutoGen or MetaGPT rely on humans manually designing the workflow. Newer methods try to automate this, searching for the “perfect” agent architecture. But they all suffer from a fatal flaw: they look for a static, one-size-fits-all solution.
Think about it: does answering “What is 12 \(\times\) 21?” require the same cognitive heavy lifting as solving a PhD-level physics problem? Of course not. Yet, most automated agent systems today would throw the same complex, token-expensive workflow at both.
In this post, we are diving deep into a new paper, “Multi-agent Architecture Search via Agentic Supernet”, which proposes MaAS. This framework doesn’t just build a single system; it learns a probability distribution of architectures—an Agentic Supernet—that dynamically assembles the perfect team of agents for every specific query.
The Dilemma of Current Multi-Agent Systems
Before understanding the solution, we need to understand the problem. The evolution of Multi-Agent Systems (MAS) has gone through three phases:
- Manual Design: Engineers hand-craft prompts and workflows (e.g., “Agent A talks to Agent B, then uses Tool C”). This is rigid and labor-intensive.
- Automated Search (The Status Quo): Systems like AFlow or ADAS use algorithms to search for an optimal workflow. They might discover that a “Debate” structure works best for a math dataset.
- The MaAS Approach: Dynamic, query-dependent instantiation.
The problem with Phase 2 is resource allocation. If you optimize a system for hard math problems, it becomes too expensive and slow for simple arithmetic. Conversely, if you optimize for speed, you fail at complex reasoning.
This creates two dilemmas:
- Dilemma 1 (Efficiency): A static system cannot balance high performance with low token costs across varied difficulties.
- Dilemma 2 (Generalization): A system optimized for coding often fails at creative writing or web browsing.
The researchers behind MaAS argue that we shouldn’t be looking for one optimal system. We should be looking for a way to generate the right system on the fly.
The Core Concept: The Agentic Supernet
At the heart of MaAS is the Agentic Supernet.
In traditional Neural Architecture Search (NAS), a “supernet” is a massive graph containing all possible neural network connections. MaAS adapts this for agents. Instead of a fixed flowchart of agents, imagine a probabilistic cloud of potential agent interactions.

As shown in Figure 1, the system consists of various building blocks (operators) like:
- CoT: Chain-of-Thought reasoning.
- Reflexion: Self-correction mechanisms.
- Debate: Multiple agents arguing to find the truth.
- Tool Use: Web search, code execution, etc.
On the right side of Figure 1, you can see the magic happen. For a simple arithmetic query (“12 \(\times\) 21”), the supernet activates a simple path (green). For a complex physics derivation, it activates a complex web of reflection and debate (red).
Defining the Search Space
To make this mathematically sound, the paper defines an Agentic Operator \(\mathcal{O}\). This isn’t just a function; it’s a composite process involving LLMs, prompts, and tools.

Here, \(\mathcal{M}\) represents the LLM (like GPT-4), \(\mathcal{P}\) is the prompt, and \(\mathcal{T}\) represents tools or temperature settings.
A Multi-Agent System (MAS) \(\mathcal{G}\) is then defined as a directed acyclic graph (DAG) of these operators:

The Agentic Supernet is not a single graph, but a sequence of probability distributions over these operators across layers.

This implies that at every “layer” of the reasoning process, the system doesn’t just have a fixed action; it has a probability of choosing specific actions based on what happened previously.
How MaAS Works: The Architecture
The MaAS framework operates in a loop of sampling, executing, and evolving. Let’s break down the workflow illustrated in Figure 2.

1. The Controller and Adaptive Sampling
When a query \(q\) arrives (left side of Figure 2), it is fed into a Controller Network. This controller determines which operators to activate.
The probability of generating a specific multi-agent system architecture \(\mathcal{G}\) is the product of the probabilities of selected operators at each layer:

However, MaAS introduces a clever twist: Mixture-of-Experts (MoE) Routing. It doesn’t just pick one operator per layer. It calculates an activation score for all available operators. It selects the top-performing operators until their cumulative score hits a threshold. This means for harder tasks, the system might activate multiple parallel strategies at once.
2. The Early Exit Mechanism
One of the biggest waste of resources in AI agents is “over-thinking” simple problems. MaAS implements an Early Exit Operator (\(\mathcal{O}_{\text{exit}}\)).
If the controller decides the problem is solved, it triggers this operator, terminating the workflow immediately. This makes the depth of the network dynamic.

This equation essentially says: The probability calculation stops if the “Exit” operator is selected. This feature alone is responsible for massive token savings.
3. Execution
Once the architecture is sampled, it is executed. The agents talk to each other, use tools, and produce an answer.

Optimizing the Supernet: Learning to be Efficient
How does MaAS learn to pick the right agents? It uses a joint optimization strategy that targets two things simultaneously:
- The Distribution (\(\pi\)): Tuning the controller to pick the right operators.
- The Operators (\(\mathbb{O}\)): Improving the prompts and settings of the agents themselves.
The objective function seeks to maximize the probability of the correct answer (\(a\)) while minimizing the Cost (\(C\)), weighted by a parameter \(\lambda\):

Gradient Estimation for Distributions
Since tool usage and LLM calls are non-differentiable (you can’t backpropagate through a Google Search), MaAS uses Empirical Bayes Monte Carlo sampling.

This looks complex, but the intuition is simple: The system samples \(K\) different architectures. It sees which ones got the right answer cheaply (\(m_k\)). It then nudges the probability distribution \(\pi\) to favor those architectures in the future.
Textual Gradients for Operators
This is the most innovative part. How do you “optimize” a prompt using gradients? You can’t use standard calculus. Instead, MaAS uses Textual Gradients.

As shown in Figure 3, the system uses a meta-agent (an “Optimizer LLM”). This agent analyzes the execution logs. If a “Debate” operator failed because the agents agreed too quickly, the Optimizer LLM generates a “gradient” in text form—essentially feedback saying, “Modify the prompt to encourage more aggressive disagreement.”

The optimization update \(\nabla_{\mathbb{O}}\) consists of updates to the Prompt (\(\mathbf{T}_{\mathcal{P}}\)), Temperature (\(\mathbf{T}_{\mathcal{T}}\)), and Node Structure (\(\mathbf{T}_{N}\)).
Experimental Results: The “Token Economy”
The researchers evaluated MaAS on major benchmarks like GSM8K and MATH (mathematics), HumanEval (coding), and GAIA (general assistant tasks).
Performance vs. Baselines
The results were overwhelming. MaAS consistently outperformed handcrafted systems (like CoT or LLM-Debate) and other automated systems (like GPTSwarm).

In Table 1, notice the MATH benchmark. MaAS achieves 51.82% accuracy, beating the next best automated system (AFlow) and significantly outperforming standard Chain-of-Thought (46.40%).
It also shines in complex tool-use scenarios. In the GAIA benchmark (Table 2 below), which requires web browsing and file processing, MaAS dominates.

On GAIA Level 1, MaAS scores 25.91%, compared to just 16.13% for AutoAgents.
The Cost Analysis
High accuracy usually comes with a high price tag. However, because MaAS uses the “Early Exit” and adaptive sampling, it is incredibly efficient.

Figure 4 is perhaps the most important chart in the paper.
- Look at the API Cost ($) plot (bottom). MaAS (the large red circle) is in the “sweet spot”: high Accuracy (y-axis) and very low Cost (x-axis).
- Compare this to LLM-Debate (purple circle), which has decent accuracy but is massively expensive.
- Compare it to ADAS (green circle), which uses huge amounts of training tokens but achieves lower accuracy.
Table 3 further quantifies this efficiency:

MaAS required only \(3.38** to train on the MATH benchmark, whereas AFlow required **\)22.50. That is nearly a 7x reduction in training costs for superior performance.
Visualizing the Adaptive Behavior
To prove that the “Supernet” actually adapts to difficulty, the researchers visualized the sampling probability for different queries.

In Figure 5, look at the difference between (a) Easy and (d) Hard:
- Easy: The probability mass spikes on “I/O” (Input/Output) and “Early Exit” almost immediately. The system looks at the query, solves it, and quits.
- Hard: The system engages “Ensemble” and “ReAct” methods, maintaining execution over multiple steps.
This dynamic behavior is further illustrated in the specific workflows generated:

Figure 6 shows the actual graphs created.
- Top-Left: A simple coding task gets a linear “CoT” flow.
- Top-Right: A complex GAIA task (searching for Asian monarchies) triggers a complex graph involving Search tools, Summarization, and Debate.
Transferability and Robustness
A common issue in AI is that a system optimized for GPT-4 might break when using Llama-3. MaAS, however, shows strong Cross-Model Transferability.

As shown in Table 7, an agentic supernet optimized using gpt-4o-mini still provides massive gains when transferred to open-source models like Qwen-2.5-72b or llama-3.1-70b.
Furthermore, MaAS exhibits Inductive Capability. The researchers ran an experiment where they hid the “Debate” operator during training but allowed it during inference.

Remarkably, as seen in Figure 9 (specifically the pie chart on the right), the system figured out how to incorporate the previously unseen “Debate” operator (the gray slice) into Layer 4 logic, demonstrating that the learned distribution is generalized enough to handle new tools.
Conclusion
MaAS represents a significant shift in how we think about Artificial Intelligence agents. We are moving away from the era of “Prompt Engineering” a single, perfect agent, and entering the era of Agentic Architecture Search.
By treating the multi-agent system as a probabilistic distribution rather than a static graph, MaAS achieves what was previously a trade-off: State-of-the-art performance with significantly reduced inference costs.
Key Takeaways:
- Dynamic is better than Static: Adjusting complexity based on the query saves money and improves accuracy.
- Supernets for Agents: The concept of a continuous distribution of architectures (borrowed from NAS) applies powerfully to agent workflows.
- Textual Gradients: We can “optimize” prompts using feedback loops, effectively allowing agents to code their own upgrades.
For students and researchers entering this field, MaAS highlights that the future isn’t just about making smarter LLMs—it’s about making smarter systems that organize those LLMs effectively.
](https://deep-paper.org/en/paper/2502.04180/images/cover.png)