Introduction

We live in the golden age of answers. If you want to know the population of Brazil or the boiling point of tungsten, a quick Google search or a prompt to ChatGPT gives you the answer instantly. These systems excel at addressing known unknowns—information gaps you are aware of and can articulate into a specific question.

But what about the unknown unknowns? These are the concepts, connections, and perspectives you don’t even know exist. How do you ask a question about a topic when you don’t know the vocabulary? How do you explore the implications of a new technology if you don’t know the economic or ethical frameworks surrounding it?

In complex information seeking—like academic research, market analysis, or learning a new field—traditional tools often fail. Search engines require you to generate the next query constantly. Chatbots tend to be passive, answering only what is asked, often trapping users in an “echo chamber” of their own limited prior knowledge.

A recent paper from researchers at Stanford and Yale proposes a fascinating solution: Co-STORM. Instead of a lonely interrogation of a search bar, Co-STORM invites users to a dinner party of AI experts. By observing and participating in a collaborative discourse between simulated agents, users can discover serendipitous information and learn more deeply with less mental effort.

Figure 1: Comparison of different paradigms for learning and information seeking. Co-STORM enables humans to observe and participate in a collaborative discourse among LM agents with different roles. Users can request the system to generate a full-length cited report based on the discourse history and the information collected.

As shown in Figure 1, this shift from “Using Search Engines” (High Effort) to “Interacting with Co-STORM” (Low Effort, High Exploration) represents a new paradigm in human-AI interaction.

The Problem: The Cognitive Load of “Search”

To understand why Co-STORM is necessary, we must look at where current systems fall short in complex information seeking.

Complex information seeking isn’t about finding a single fact. It involves collecting, sifting, understanding, and organizing information from multiple sources to build a knowledge product, like a report or a mental model.

Table 1 illustrates the gaps in current technology:

Table 1: Comparison of diferent information-seeking assistance systems.

Information Retrieval (Search Engines): You get multiple sources, but you have to do all the synthesis yourself.
Single-Turn QA: You get an answer, but no depth or ongoing exploration.
Conversational QA (Chatbots): You can interact, but the bot rarely takes the initiative to show you what you should be asking.
Report Generation (like the original STORM system): It writes a great report, but it’s a static process. You can’t interrupt, steer, or learn during the generation.

The researchers identified that to truly support learning, a system needs to support Collaborative Discourse. Just as children learn by listening to parents discuss a topic, or students learn by observing a debate, humans learn effectively when they observe and occasionally participate in a conversation between knowledgeable entities.

The Co-STORM Method

Co-STORM (Collaborative STORM) is an information-seeking assistant that emulates a “roundtable” discussion. It doesn’t just answer you; it creates a conversation around you, which you can steer.

The Architecture of Discourse

At the heart of Co-STORM is a multi-agent system grounded in real-time information retrieval (Search).

$Figure 2: Overview of Co-STORM. Co-STORM emulates a collaborative discourse among the user,simulated perspective-guided experts,and a simulated moderator. It maintains adynamically updated mind map (\$3.2) to help user track and engage in the discourse (83.3).The simulated expert is prompted to determine the uterance intent based on discourse history and generate a question or an answer grounded in the Internet (S3.4). The simulated moderator is prompted with unused information and the mind map to generate a new question to automatically ster the discourse (β3.5). The mind map can be used to generate a ful-length cited report as takeaways. Complete discourse transcript and the associated report are detailed in Appendix $\\ S G) and \\(\\ S \\mathrm { H }\$$

As illustrated in Figure 2, the system consists of three main components working in harmony:

The Agents (Experts & Moderator): These Large Language Models (LLMs) simulate a discussion.
The User: You can observe the agents talking or jump in to ask a question or steer the topic.
The Mind Map: A dynamic data structure that organizes the conversation visually, reducing the cognitive load of reading a wall of text.

1. The Cast of Characters

If you ask a standard chatbot about “AlphaFold 3,” it gives you a summary. Co-STORM acts differently. It first determines who should be at the table. For a biotech topic, it might instantiate a “Geneticist,” an “AI Expert,” and a “Molecular Biologist.”

Perspective-Guided Experts: These agents don’t just generate text; they simulate a perspective. When it is an expert’s turn:

They analyze the conversation history.
They decide on an intent (e.g., Ask a question, Provide an answer, Request detail).
If answering, they generate search queries, retrieve real data from the internet, and cite their sources.

The Moderator: If you leave a group of experts alone, they might obsess over niche details. The Moderator is a special agent designed to ensure breadth. It monitors the conversation and injects new questions to steer the discourse toward unexplored areas.

Crucially, the Moderator looks for unused information. It performs a semantic search to find information relevant to the general topic but dissimilar to what has already been discussed. The researchers mathematically defined this “reranking score” to prioritize novelty:

$()\n\\cos ( \\mathbf { i } , \\mathbf { t } ) ^ { \\alpha } ( 1 - \\cos ( \\mathbf { i } , \\mathbf { q } ) ) ^ { 1 - \\alpha } ,\n()$

Here, the system balances relevance to the topic ($t$) with dissimilarity to the specific question currently being discussed ($q$). This mathematical nudge forces the AI to drag the conversation out of echo chambers and into the “unknown unknowns.”

2. The Dynamic Mind Map

Listening to a complex multi-party debate can be confusing. To help the user keep track, Co-STORM maintains a hierarchical Mind Map (visible in the top left of Figure 2).

As the conversation progresses, the system uses an “Insert Operation.” It analyzes every new piece of information and decides where it belongs in the tree structure. If a node gets too big, it triggers a “Reorganize Operation,” splitting the node into sub-topics. This allows the user to glance at the map and instantly understand the structure of the knowledge being uncovered.

3. The Final Artifact

At any point, the user can request a Cited Report. The system uses the Mind Map as an outline and the collected search results to write a comprehensive, Wikipedia-style article. This turns the casual exploration into a concrete knowledge product.

Evaluation: Measuring Discovery

How do you measure if a system helps someone find “unknown unknowns”? The researchers attacked this problem from three angles: a new dataset, automatic metrics, and human trials.

The WildSeek Dataset

Existing datasets for information seeking were too simple. They focused on fact retrieval. To evaluate Co-STORM, the researchers created WildSeek, a dataset derived from real-world usage of the STORM engine.

Table 2: A sample data point in the WildSeek dataset for studying complex information-seeking tasks; the topic and goal are provided by users on the publicly available STORM website, the domain is assigned manually.

As shown in Table 2, these aren’t simple queries. They are open-ended goals, such as “Investigate how a new shared currency could eliminate transaction costs.” The taxonomy of this dataset covers diverse fields from Economics to Healthcare (Figure 5).

Figure 5: WildSeek taxonomy. The number in the parenthesis denotes the number of data points clasifed under the corresponding category or its descendants.

Automatic Evaluation Results

The researchers simulated users interacting with Co-STORM, a standard RAG Chatbot, and the original STORM system. They measured the quality of the final reports and the discourse itself.

$Table 3: Automatic evaluation results for report quality and the quality of question-answering turns inthe discourse with simulated users.Ablations are included as folows:“w/o Multi-Expert"denotes 1 expert and1 moderator,and “w/o Moderator” denotes \$N\$ experts and O moderator \$^ \\dagger\$ denotes significant differences \$( p < 0 . 0 5 )\$ from a paired \$t\$ -test between \$\\mathrm { C o }\$ -STORM and both baselines. The rubric grading uses a 1-5 scale. All scores reported are the mean values.$

Table 3 reveals critical insights:

Depth & Novelty: Co-STORM significantly outperforms RAG Chatbots and STORM+QA in the Depth and Novelty of the generated reports.
Engagement: The conversation turns were rated as significantly more engaging.
Diversity: Co-STORM cited nearly double the number of unique URLs compared to the baselines, indicating a much broader exploration of the internet.

Ablation studies (removing specific components) showed that the Moderator is essential. Without the moderator steering the conversation toward new areas, the “Novelty” scores drop significantly (Figure 3).

Figure 3: Rubric grading results for question-asking turn quality in automatic evaluation with simulated users.

Human Evaluation: Do Users Like It?

Ultimately, the goal is to help humans. The researchers recruited 20 participants for a study comparing Co-STORM against Google Search and a RAG Chatbot.

The results were overwhelmingly positive.

Figure 4: Survey results of the pairwise comparison (i.e.,agreement on whether Co-STORM is better than Search Engine/RAG Chatbot) in human evaluation.

As displayed in Figure 4:

70% of participants preferred Co-STORM over a Search Engine.
78% preferred it over a RAG Chatbot.
Users specifically noted that Co-STORM required “Less Effort” while providing higher “User Engagement.”

Participants highlighted the “serendipity” of the system. One user noted, “Co-STORM allows for almost full automation and much better understanding as it brings up topics that the user may not even think of.”

Conclusion

The Co-STORM paper presents a convincing argument that the future of search isn’t just about better answers—it’s about better questions.

By moving from a “tool” metaphor (where the AI waits for input) to a “partner” metaphor (where AI agents actively discuss and explore), we can lower the barrier to learning complex topics. Co-STORM demonstrates that when we allow AI agents to converse with each other under the supervision of a moderator, they can surface the “unknown unknowns” that a human user might never have found on their own.

For students and researchers, this suggests a future where our AI assistants don’t just fetch data; they brainstorm with us, challenge our assumptions, and help us map out the frontiers of our own ignorance.

Introduction#

The Problem: The Cognitive Load of “Search”#

The Co-STORM Method#

The Architecture of Discourse#

1. The Cast of Characters#

2. The Dynamic Mind Map#

3. The Final Artifact#

Evaluation: Measuring Discovery#

The WildSeek Dataset#

Automatic Evaluation Results#

Human Evaluation: Do Users Like It?#

Conclusion#