Introduction

We often treat Large Language Models (LLMs) like omniscient oracles. We type a question into ChatGPT or Claude, and we expect a single, authoritative, and correct answer. But underneath the hood, these models are probabilistic engines. When you ask an open-ended question—like “Is it ethical to eat meat?” or “How should we solve climate change?"—the model often defaults to the most likely continuation based on its training data. This can lead to generic, one-sided, or even biased answers.

Worse, when models don’t know the answer, they often hallucinate. They present falsehoods with the same confidence as facts.

Prompt engineers have tried to fix this with “persona prompting”—telling the model, “You are an expert ethicist” or “You are a senior python developer.” While this helps, it creates an echo chamber. An “expert ethicist” persona might be biased toward a specific philosophical framework, ignoring nutritional or environmental perspectives.

So, how do we get an LLM to think broadly, fact-check itself, and provide a comprehensive answer without retraining the model?

The answer might lie in a technique borrowed from management science: Multi-expert Prompting. In this post, we will deep-dive into a fascinating research paper that proposes simulating a “committee of experts” inside a single LLM to significantly boost reliability, safety, and usefulness.

Figure 1: An overview of Multi-expert Prompting comparing single-expert vs multi-expert responses on ethical questions.

As shown above, where a standard expert prompt might give a rigid “No” to the ethics of meat-eating, Multi-expert Prompting synthesizes views from a nutritionist, an ethicist, and an environmentalist to provide a nuanced, human-level response.

Background: The Limits of the Lone Genius

To understand why Multi-expert Prompting is a breakthrough, we first need to understand the limitations of current prompting strategies.

The “Expert” Problem

The authors of the paper identify a critical weakness in previous methods like ExpertPrompting. In ExpertPrompting, the LLM is asked to write a description of an expert identity that would be best suited to answer a question, and then answer as that expert.

While this outperforms standard zero-shot prompting, it suffers from perspective bias. If you ask about a medical symptom, a “Surgeon” persona might suggest cutting, while a “Homeopath” persona might suggest herbs. Neither gives you the full picture. A single expert framework falls short on open-ended instructions where multiple valid perspectives exist.

The Management Science Connection: NGT

The researchers didn’t just invent a new prompt structure out of thin air; they looked at how humans solve complex problems. They adapted the Nominal Group Technique (NGT).

Developed in the 1970s (Ven and Delbecq, 1974), NGT is a structured variation of small-group discussion to reach consensus. Unlike a chaotic brainstorming session where the loudest person wins, NGT follows a strict process:

  1. Silent Generation: Everyone writes down ideas independently.
  2. Recording: Ideas are listed without debate.
  3. Clarification: The group discusses to clarify meanings.
  4. Voting: The group votes to rank ideas.

Multi-expert Prompting translates this human workflow into an algorithmic chain-of-thought for LLMs.

The Core Method: Inside Multi-expert Prompting

The Multi-expert Prompting framework operates in two distinct phases: Generation and Aggregation.

Figure 2: Overview of the two-step process: Generting Experts and Aggregating Responses.

Let’s break down the architecture step-by-step.

Step 1: Experts & Responses Generation

When the user provides an instruction (e.g., “What are the impacts of AI on education?”), the model acts as a moderator.

First, the system prompts the LLM to identify \(n\) diverse expert identities. Crucially, the authors found that simple, one-sentence descriptions of these experts work just as well as elaborate paragraphs. This makes the method efficient.

For example, if the question is about medical advice, the model might auto-generate:

  1. A Medical Doctor (Focus: Diagnosis and standard treatment)
  2. A Surgeon (Focus: Operative risks)
  3. A Physiotherapist (Focus: Rehabilitation and non-invasive care)

The LLM is then queried \(n\) times, once for each persona, to generate independent, long-form answers. This corresponds to the “Silent Generation” phase of NGT. By forcing the model to adopt distinct personas before answering, the system extracts a wider distribution of knowledge from the model’s latent space.

Step 2: Expert Responses Aggregation (The 7 Subtasks)

This is the most innovative part of the paper. Merging three long essays into one coherent answer is difficult. If you just ask ChatGPT to “summarize these three answers,” it often loses nuance or hallucinates new details.

To solve this, the authors designed a single Chain-of-Thought (CoT) prompt containing seven specific subtasks. This forces the model to process the information logically rather than intuitively.

The 7-Step Workflow

  1. Generating Agreed Viewpoints: The model identifies facts that appear in more than 50% of the expert answers. These form the “consensus foundation.”
  2. Generating Conflicted Viewpoints: The model explicitly lists where the experts disagree. (e.g., “Expert A says X is safe, Expert B says X is risky”).
  3. Resolving Conflicts: This leverages the LLM’s reasoning capabilities. The model reviews the agreed viewpoints (Step 1) to act as an arbiter for the conflicts in Step 2. It essentially performs a “weighted vote” based on logic and evidence.
  4. Generating Isolated Viewpoints: The model looks for unique insights that only one expert mentioned but are not contradictory. This ensures valuable niche information isn’t lost (a common problem in standard summarization).
  5. Collecting Viewpoints: This is a filtering step. The model gathers the outputs from Steps 1, 3, and 4.
  6. Generating Aggregated Response: The model drafts the final long-form response using the collected points.
  7. Final Selection (The Quality Control): The model compares its newly generated Aggregated Response against the original individual expert responses. It selects the best one based on factuality and usefulness.

Note: In 90% of cases, the model selects the Aggregated Response, but Step 7 acts as a safety valve. If the aggregation failed or became incoherent, the system can revert to the best single-expert answer.

A Concrete Example: The Earthworm

To visualize how this works, look at the example below regarding the regeneration of earthworms.

Figure 10: A generated example by Multi-expert Prompting showing the breakdown of the worm regeneration question.

Notice the granularity. A Biologist, Zoologist, and Ecologist provide slightly different angles.

  • Step 1 finds the consensus: The anterior (head) section survives.
  • Step 2 finds the conflict: Can the posterior (tail) section regenerate a head?
  • Step 3 resolves it: Most evidence suggests the posterior end cannot regenerate a head.
  • The Final Output is a highly accurate, nuanced biological fact, avoiding the common myth that cutting a worm creates two worms.

Experiments & Results

The researchers tested this method against a battery of strong baselines, including standard Zero-shot, Chain-of-Thought (CoT), Self-Refine, and Multi-agent Debate. They used two models: Mistral-7B and ChatGPT (GPT-3.5).

The metrics focused on Reliability (Truthfulness, Factuality) and Safety (Toxicity, Hurtfulness).

Truthfulness and Factuality

The results were statistically significant. On the TruthfulQA benchmark—a dataset specifically designed to trick models into mimicking human misconceptions—Multi-expert Prompting achieved state-of-the-art results.

Table 1: Main experimental results comparing Multi-expert Prompting against baselines.

Key Takeaways from the Data:

  • Huge Gains in Truthfulness: With ChatGPT, Multi-expert Prompting scored 89.35% on TruthfulQA, beating the best baseline (ExpertPrompting) by nearly 9%. This is a massive leap in the NLP world.
  • Reduced Hallucinations: On FactualityPrompt, the method achieved the lowest hallucination rate (lower score is better).
  • Zero Toxicity: On the BOLD dataset, the method reduced toxicity to 0.000. The process of cross-examining expert views naturally filters out extreme or toxic takes.

Usefulness and Informativeness

One might worry that aggregating answers leads to a “Frankenstein” response that is factual but unreadable. The authors tested this using ExpertQA, a dataset of open-ended questions.

Figure 3: Informativeness and Usefulness comparisons on ExpertQA.

As shown in Figure 3, Multi-expert Prompting (dark blue bars) consistently wins against baselines in head-to-head comparisons judged by GPT-4. It produces answers that are not just “safe,” but genuinely more informative because they cover more angles (the “Isolated Viewpoints” from Step 4).

Analysis: Why Does It Work?

The paper includes several ablation studies—experiments where they remove parts of the system to see what breaks. These provide deep insights into why the method works.

1. The Magic Number is 3

How many experts do you need?

  • 1 expert (ExpertPrompting) is prone to bias.
  • 10 experts create too much noise and “too many cooks” confusion.

The data shows that 3 experts is the optimal number for current LLMs. This provides enough diversity to triangulate the truth without overwhelming the model’s context window.

Table 4: Multi-expert Prompting with varying numbers of experts.

2. Every Step Matters

The researchers tried skipping steps in the 7-step aggregation process.

  • Skipping Step 1 (Agreed Viewpoints) hurt performance the most. Establishing common ground is essential for coherence.
  • Skipping Step 4 (Isolated Viewpoints) reduced informativeness. The answer became generic.
  • Skipping Step 2 & 3 (Conflict Resolution) reduced truthfulness. Without explicitly handling disagreements, the model just hallucinates a blend of contradictory facts.

Table 3: Ablation study showing performance decline when omitting subtasks.

3. It’s Not Just About Length

Critics might argue, “You’re just generating more text, so of course it covers more facts.” However, the authors compared their method against baselines forced to generate long answers. Multi-expert Prompting still won. The improvement comes from the structure of the reasoning, not the word count.

Discussion and Implications

The “Multi-expert Prompting” paper offers a compelling glimpse into the future of prompt engineering and AI alignment.

The “Democratic” AI The authors draw a parallel to democratic theory. Just as democratic processes (ideally) produce better outcomes by moderating extreme views and aggregating collective wisdom, Multi-expert Prompting moderates the stochastic nature of LLMs. It forces the model to check its own work from multiple angles.

No Fine-Tuning Required Perhaps the most practical advantage is that this is a zero-shot technique. It requires no training data and no fine-tuning. It can be applied immediately to any existing LLM (like Llama 3, GPT-4, or Claude 3) via API or prompt engineering.

Safety by Design The complete elimination of toxicity in their experiments is remarkable. By asking the model to role-play professional experts (who are generally polite and objective) and then aggregating their views, the system naturally suppresses the toxic or hurtful outputs that might emerge from a raw model.

Conclusion

Multi-expert Prompting is more than just a clever prompt trick; it is a robust framework for decision-making within Large Language Models. By simulating a committee of diverse experts and guiding them through a rigorous, NGT-inspired aggregation process, we can significantly reduce hallucinations and bias.

For students and researchers entering the field, this paper teaches a valuable lesson: We don’t always need bigger models to get better results. Sometimes, we just need to structure the model’s “thinking” process to mirror the best practices of human collaboration.

If you are building applications where truthfulness and nuance are non-negotiable—such as in education, medical advice, or complex analysis—implementing a multi-expert workflow could be a game-changer.