If you have played with Large Language Models (LLMs) recently, you’ve likely encountered the concept of Agents. We’ve moved past simple chatbots; we now have systems where LLMs use tools, browse the web, write code, and even talk to other LLMs to solve problems.

However, building these multi-agent systems is incredibly hard. Early frameworks like AutoGen or MetaGPT rely on humans manually designing the workflow. Newer methods try to automate this, searching for the “perfect” agent architecture. But they all suffer from a fatal flaw: they look for a static, one-size-fits-all solution.

Think about it: does answering “What is 12 \(\times\) 21?” require the same cognitive heavy lifting as solving a PhD-level physics problem? Of course not. Yet, most automated agent systems today would throw the same complex, token-expensive workflow at both.

In this post, we are diving deep into a new paper, “Multi-agent Architecture Search via Agentic Supernet”, which proposes MaAS. This framework doesn’t just build a single system; it learns a probability distribution of architectures—an Agentic Supernet—that dynamically assembles the perfect team of agents for every specific query.


The Dilemma of Current Multi-Agent Systems

Before understanding the solution, we need to understand the problem. The evolution of Multi-Agent Systems (MAS) has gone through three phases:

  1. Manual Design: Engineers hand-craft prompts and workflows (e.g., “Agent A talks to Agent B, then uses Tool C”). This is rigid and labor-intensive.
  2. Automated Search (The Status Quo): Systems like AFlow or ADAS use algorithms to search for an optimal workflow. They might discover that a “Debate” structure works best for a math dataset.
  3. The MaAS Approach: Dynamic, query-dependent instantiation.

The problem with Phase 2 is resource allocation. If you optimize a system for hard math problems, it becomes too expensive and slow for simple arithmetic. Conversely, if you optimize for speed, you fail at complex reasoning.

This creates two dilemmas:

  • Dilemma 1 (Efficiency): A static system cannot balance high performance with low token costs across varied difficulties.
  • Dilemma 2 (Generalization): A system optimized for coding often fails at creative writing or web browsing.

The researchers behind MaAS argue that we shouldn’t be looking for one optimal system. We should be looking for a way to generate the right system on the fly.


The Core Concept: The Agentic Supernet

At the heart of MaAS is the Agentic Supernet.

In traditional Neural Architecture Search (NAS), a “supernet” is a massive graph containing all possible neural network connections. MaAS adapts this for agents. Instead of a fixed flowchart of agents, imagine a probabilistic cloud of potential agent interactions.

Figure 1.(Left)The buildingblocksof MaAS; (Right)When confronting diferentqueries,theagentic supernet adaptivelysamples tailored multi-agent architecture in a query-dependent manner.

As shown in Figure 1, the system consists of various building blocks (operators) like:

  • CoT: Chain-of-Thought reasoning.
  • Reflexion: Self-correction mechanisms.
  • Debate: Multiple agents arguing to find the truth.
  • Tool Use: Web search, code execution, etc.

On the right side of Figure 1, you can see the magic happen. For a simple arithmetic query (“12 \(\times\) 21”), the supernet activates a simple path (green). For a complex physics derivation, it activates a complex web of reflection and debate (red).

Defining the Search Space

To make this mathematically sound, the paper defines an Agentic Operator \(\mathcal{O}\). This isn’t just a function; it’s a composite process involving LLMs, prompts, and tools.

Equation defining an Agentic Operator

Here, \(\mathcal{M}\) represents the LLM (like GPT-4), \(\mathcal{P}\) is the prompt, and \(\mathcal{T}\) represents tools or temperature settings.

A Multi-Agent System (MAS) \(\mathcal{G}\) is then defined as a directed acyclic graph (DAG) of these operators:

Equation defining a Multi-Agent System

The Agentic Supernet is not a single graph, but a sequence of probability distributions over these operators across layers.

Equation defining the Agentic Supernet distribution

This implies that at every “layer” of the reasoning process, the system doesn’t just have a fixed action; it has a probability of choosing specific actions based on what happened previously.


How MaAS Works: The Architecture

The MaAS framework operates in a loop of sampling, executing, and evolving. Let’s break down the workflow illustrated in Figure 2.

Figure 2. The overallframework of our proposed MaAS.

1. The Controller and Adaptive Sampling

When a query \(q\) arrives (left side of Figure 2), it is fed into a Controller Network. This controller determines which operators to activate.

The probability of generating a specific multi-agent system architecture \(\mathcal{G}\) is the product of the probabilities of selected operators at each layer:

Equation for probability of a system architecture

However, MaAS introduces a clever twist: Mixture-of-Experts (MoE) Routing. It doesn’t just pick one operator per layer. It calculates an activation score for all available operators. It selects the top-performing operators until their cumulative score hits a threshold. This means for harder tasks, the system might activate multiple parallel strategies at once.

2. The Early Exit Mechanism

One of the biggest waste of resources in AI agents is “over-thinking” simple problems. MaAS implements an Early Exit Operator (\(\mathcal{O}_{\text{exit}}\)).

If the controller decides the problem is solved, it triggers this operator, terminating the workflow immediately. This makes the depth of the network dynamic.

Equation showing the Early Exit logic

This equation essentially says: The probability calculation stops if the “Exit” operator is selected. This feature alone is responsible for massive token savings.

3. Execution

Once the architecture is sampled, it is executed. The agents talk to each other, use tools, and produce an answer.

Equation for execution


Optimizing the Supernet: Learning to be Efficient

How does MaAS learn to pick the right agents? It uses a joint optimization strategy that targets two things simultaneously:

  1. The Distribution (\(\pi\)): Tuning the controller to pick the right operators.
  2. The Operators (\(\mathbb{O}\)): Improving the prompts and settings of the agents themselves.

The objective function seeks to maximize the probability of the correct answer (\(a\)) while minimizing the Cost (\(C\)), weighted by a parameter \(\lambda\):

Optimization Objective Function

Gradient Estimation for Distributions

Since tool usage and LLM calls are non-differentiable (you can’t backpropagate through a Google Search), MaAS uses Empirical Bayes Monte Carlo sampling.

Equation for gradient estimation

This looks complex, but the intuition is simple: The system samples \(K\) different architectures. It sees which ones got the right answer cheaply (\(m_k\)). It then nudges the probability distribution \(\pi\) to favor those architectures in the future.

Textual Gradients for Operators

This is the most innovative part. How do you “optimize” a prompt using gradients? You can’t use standard calculus. Instead, MaAS uses Textual Gradients.

Figure 3. The demonstration of textual gradient.

As shown in Figure 3, the system uses a meta-agent (an “Optimizer LLM”). This agent analyzes the execution logs. If a “Debate” operator failed because the agents agreed too quickly, the Optimizer LLM generates a “gradient” in text form—essentially feedback saying, “Modify the prompt to encourage more aggressive disagreement.”

Equation for Textual Gradients

The optimization update \(\nabla_{\mathbb{O}}\) consists of updates to the Prompt (\(\mathbf{T}_{\mathcal{P}}\)), Temperature (\(\mathbf{T}_{\mathcal{T}}\)), and Node Structure (\(\mathbf{T}_{N}\)).


Experimental Results: The “Token Economy”

The researchers evaluated MaAS on major benchmarks like GSM8K and MATH (mathematics), HumanEval (coding), and GAIA (general assistant tasks).

Performance vs. Baselines

The results were overwhelming. MaAS consistently outperformed handcrafted systems (like CoT or LLM-Debate) and other automated systems (like GPTSwarm).

Table1.Performancecomparison withsngleagent,hand-craftmulti-agentsystems,andautomatedagenticworkfows.ThebaseLLMis consistently set as gpt-4o-mini for allbaselines. We bold the best results and underline the runner-ups.

In Table 1, notice the MATH benchmark. MaAS achieves 51.82% accuracy, beating the next best automated system (AFlow) and significantly outperforming standard Chain-of-Thought (46.40%).

It also shines in complex tool-use scenarios. In the GAIA benchmark (Table 2 below), which requires web browsing and file processing, MaAS dominates.

Table 2.Performance on GAIA benchmark.The best and runnerup results are bolded and underlined, respectively.

On GAIA Level 1, MaAS scores 25.91%, compared to just 16.13% for AutoAgents.

The Cost Analysis

High accuracy usually comes with a high price tag. However, because MaAS uses the “Early Exit” and adaptive sampling, it is incredibly efficient.

Figure 4. The cost analysis of MaAS on MATH benchmark.

Figure 4 is perhaps the most important chart in the paper.

  • Look at the API Cost ($) plot (bottom). MaAS (the large red circle) is in the “sweet spot”: high Accuracy (y-axis) and very low Cost (x-axis).
  • Compare this to LLM-Debate (purple circle), which has decent accuracy but is massively expensive.
  • Compare it to ADAS (green circle), which uses huge amounts of training tokens but achieves lower accuracy.

Table 3 further quantifies this efficiency:

Table 3.Effciencycomparisonbetween MaASandstate-of-the-artbaselinesonteMATHBenchmark.Weshadthevaluesofthelowes token/cost/wall-clock time and the highest performance.

MaAS required only \(3.38** to train on the MATH benchmark, whereas AFlow required **\)22.50. That is nearly a 7x reduction in training costs for superior performance.


Visualizing the Adaptive Behavior

To prove that the “Supernet” actually adapts to difficulty, the researchers visualized the sampling probability for different queries.

Figure 5. The visualization of MaAS’s operator sampling process.

In Figure 5, look at the difference between (a) Easy and (d) Hard:

  • Easy: The probability mass spikes on “I/O” (Input/Output) and “Early Exit” almost immediately. The system looks at the query, solves it, and quits.
  • Hard: The system engages “Ensemble” and “ReAct” methods, maintaining execution over multiple steps.

This dynamic behavior is further illustrated in the specific workflows generated:

Figure 6. Case study and visualization for MaAS.Queries are from HumanEval, MATH and GAIA benchmarks.

Figure 6 shows the actual graphs created.

  • Top-Left: A simple coding task gets a linear “CoT” flow.
  • Top-Right: A complex GAIA task (searching for Asian monarchies) triggers a complex graph involving Search tools, Summarization, and Debate.

Transferability and Robustness

A common issue in AI is that a system optimized for GPT-4 might break when using Llama-3. MaAS, however, shows strong Cross-Model Transferability.

Table 7.Cros-modeltransferabilityof MaAS.Weoptimize the agentic supernet with gpt-4o-mini,and report the performance before and after equipping the LLM backbones with the optimized agentic supernet.

As shown in Table 7, an agentic supernet optimized using gpt-4o-mini still provides massive gains when transferred to open-source models like Qwen-2.5-72b or llama-3.1-70b.

Furthermore, MaAS exhibits Inductive Capability. The researchers ran an experiment where they hid the “Debate” operator during training but allowed it during inference.

Layer 4 (Debate operator added) Figure9.Thelayer-wisedistributionofMaASonHumanEvalbenchmarkwithDebateoperator.

Remarkably, as seen in Figure 9 (specifically the pie chart on the right), the system figured out how to incorporate the previously unseen “Debate” operator (the gray slice) into Layer 4 logic, demonstrating that the learned distribution is generalized enough to handle new tools.


Conclusion

MaAS represents a significant shift in how we think about Artificial Intelligence agents. We are moving away from the era of “Prompt Engineering” a single, perfect agent, and entering the era of Agentic Architecture Search.

By treating the multi-agent system as a probabilistic distribution rather than a static graph, MaAS achieves what was previously a trade-off: State-of-the-art performance with significantly reduced inference costs.

Key Takeaways:

  1. Dynamic is better than Static: Adjusting complexity based on the query saves money and improves accuracy.
  2. Supernets for Agents: The concept of a continuous distribution of architectures (borrowed from NAS) applies powerfully to agent workflows.
  3. Textual Gradients: We can “optimize” prompts using feedback loops, effectively allowing agents to code their own upgrades.

For students and researchers entering this field, MaAS highlights that the future isn’t just about making smarter LLMs—it’s about making smarter systems that organize those LLMs effectively.