MAS-Zero Icon MAS-Zero:
Designing Multi-Agent Systems with Zero Supervision

Salesforce AI Research

Abstract

Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation-set for tuning and yield static MAS designs lacking adaptability during inference. We introduce MAS-Zero, the first self-evolved, inference-time only framework for automatic MAS design. MAS-Zero employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM back-bones of varying sizes, demonstrate that MAS-Zero outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-evolved design for creating effective and adaptive MAS.

Paradiagm

In supervised learning (or reinforcement learning from verified rewards), humans give the expected output to the model, and the model learns to generate the same output. In manual MAS or workflow design, humans give not only the expected output, but also the fixed structure of the MAS or workflow. This can be sub-optimal when agent preferences differ from human intuition. Moreover, manually designed MAS are hard to adapt to new tasks. In this work, we enable the machine to propose MAS or workflow design without human intervention, not even the final outcome supervision.
Growth Trend Figure
Figure 1: Illustrations of our paradigm (adapted from AZR).

Contrast with Existing Work

Growth Trend Figure
Figure 2: Manual MAS design vs. Existing automatic MAS design vs. MAS-Zero.

Approach

We propose MAS-Zero, a meta-agent that serves several roles (design, evaluate, and verify) and involves two steps:

  1. Meta-Iterations:
    1. MAS-Design: Task Decomposition and propose sub-MAS for each sub-task. We frame the MAS design as code generation.
    2. MAS-Feedback: Evaluate the generated MAS design on solvability and completeness. We evalaute these metrics using intermediate outputs by executing the MAS code.
  2. Self-Verification: selects the most suitable outcome from the set of all candidate solutions generated throughout the meta-iteration process.

In the whole process: no validation set needed; Meta-level self-supervision on MAS design; Inference-time only.

Growth Trend Figure
Figure 3: Illustrations of MAS-Zero.

Key Takeaways

1. “MAS” moments
You might have encountered this: you give a complex task to a strong LLM, and it produces an incorrect answer. You then try to "help" by, for example, manually writing the plan, decomposing the task, listing required knowledge, or even acting as a manual verifier, by explicitly telling the model that an answer is incorrect and needs to be reconsidered—yet none of these work.

With MAS-Zero, we observe several “MAS” moments, where the system autonomously decides to act in ways we did not anticipate—and succeeds as a result.
Growth Trend Figure
Figure 4: An example of "MAS" moment, comparing CoT, MAS-Zero (iteration 1), and MAS-Zero (iteration 5) (more at [MAS Collection]).
2. “Self-evolve” without outcome supervision works
MAS-Zero purely relies on its own generated dataand sets a new frontier in the performance-cost trade-off across diverse domains and LLMs.
Growth Trend Figure
Figure 5: Scatter plots comparing the Pareto fronts of various GPT-4o-based systems on three benchmarks. Manually designed MAS baselines are marked in purple and automatic MAS methods in blue. MAS-Zero is highlighted as an orange star. MAS-Zero delivers high performance at lower cost than comparable automatic MAS methods, establishing a new frontier for accuracy vs. cost trade-off.
3. Inference-scaling helps MAS designs
Off-the-shelf LLMs—often not trained on MAS design—struggle to design effective MAS when used directly. But with inference-time "scaling" (our meta-level strategy), their performance improves significantly.
Growth Trend Figure
Figure 6: Performance vs. meta-iterations with GPT-4o.
4. Structure matters. Verification amplifies it.
MAS design can boost performance by over 10% compared to a single LLM/agent. With a stronger verifier, the gains become even larger.
Growth Trend Figure
Figure 7: Performance gains (purple) over base performance (light blue) given an oracle verifier. Automatic MAS baselines (ADAS, AFlow, and MaAS) cannot integrate external verifieres, yielding zero improvement.
5. More analysis and extensions are coming
Stay tuned for deeper evaluations, and stronger automatic MAS design systems!

BibTeX

@misc{ke2025maszero,
      title={MAS-Zero: Designing Multi-Agent Systems with Zero Supervision}, 
      author={Zixuan Ke and Austin Xu and Yifei Ming and Xuan-Phi Nguyen and Caiming Xiong and Shafiq Joty},
      year={2025},
      eprint={2505.14996},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.14996}, 
}