Introduction
In the world of Artificial Intelligence, Large Language Models (LLMs) are often described as “compressed knowledge.” They have devoured varied texts from millions of human authors, encompassing a vast spectrum of beliefs, cultures, and values. Yet, when we chat with a model like GPT-4, we often receive a single, polished, “majority-vote” answer.
This raises a fascinating research question: If these models were trained on diverse perspectives, can we reverse-engineer them to extract that diversity? Can an LLM step out of its default persona and simulate a crowd of people with disagreeing opinions?
Understanding this is crucial. Relying on a single viewpoint in Natural Language Processing (NLP) creates bias. Traditionally, fixing this required hiring diverse groups of human annotators—a process that is costly and slow. If LLMs can reliably generate diverse, grounded perspectives, they could revolutionize how we build datasets for subjective tasks like argumentation or hate speech detection.
In this post, we will dive into a recent paper that explores the limits of LLM diversity. We will look at a novel prompting method designed to “squeeze” different viewpoints out of a model and compare the machine’s “diversity coverage” against actual human groups.

Background: The Complexity of Subjective Tasks
In objective tasks, there is usually one right answer (e.g., “Paris is the capital of France”). However, in subjective tasks involving social norms or argumentation, the “truth” depends on who you ask.
Consider a statement like: “You are expected to do what you are told.”
To some, this is a positive statement about teamwork and safety. To others, it is a negative statement stifling creativity and innovation. These underlying values—teamwork, safety, creativity—are what the researchers call criteria.

As shown in Figure 2 above, a stance (“Agree” or “Disagree”) is rarely arbitrary; it is grounded in specific criteria. To extract true diversity from an LLM, we cannot simply ask it to “write different opinions.” We need to model this underlying reasoning process.
The Core Method: Extracting Maximum Diversity
The researchers propose a two-step approach to solve the problem of Maximum Diversity Extraction. Their goal is to push the model to generate as many unique, valid perspectives as possible until it hits a “saturation point.”
1. Criteria-Based Diversity Prompting
Standard prompting often results in generic responses. To counter this, the authors introduce Criteria-Based Prompting.
Instead of just asking for an opinion, the prompt forces the model to articulate the specific values driving that opinion. The structure looks like this:
- Stance: Agree or Disagree.
- Criteria: A list of keywords (values) that guide the perspective.
- Reason: A free-form explanation grounded in those criteria.
By explicitly asking for the “criteria” first, the model is guided to adopt a specific “persona” or worldview before generating the text. This mimics human reasoning, where our values often dictate our opinions.

2. Step-by-Step Recall Prompting
How do we know when the model has run out of ideas? The researchers devised a method called Step-by-Step Recall.
They do not ask for 20 opinions at once. Instead, they ask for one opinion, feed it back into the prompt, and ask for another diverse opinion that is different from the first. They repeat this iteratively (generating N opinions).

As illustrated in Figure 3, this iterative loop allows the researchers to measure the “diversity coverage.” Eventually, the model starts repeating criteria or fails to produce new unique clusters of ideas. That limit is the model’s diversity saturation point.
Experimental Setup
To test these methods, the researchers used four distinct datasets representing different types of subjectivity:
- SOCIAL-CHEM-101: Social norms and moral judgments (highly subjective/cultural).
- CMV (Change My View): Argumentative debates from Reddit.
- HATE SPEECH: Categorizing texts as hate speech or not (subjective labeling).
- MORAL STORIES: Open-ended story continuation.
They tested various Large Language Models, including GPT-4, GPT-3.5, Llama-2, and Mixtral.
The primary metric for evaluation was Perspective Diversity. They clustered the generated criteria words (e.g., grouping “joy” and “happiness” together) and counted the number of unique criteria clusters. A higher number means the model covered more distinct angles on the topic.
Results and Analysis
1. Criteria-Based Prompting vs. Free-Form
Does asking for “criteria” actually help? The results show a resounding yes.
The researchers compared their method against a baseline “free-form” prompt (just asking for reasons without explicit criteria). They measured semantic diversity (how different the reasons were from each other).

As shown in the radar charts in Figure 4, Criteria-Based Prompting (the green line) consistently outperforms free-form prompting across almost all models and datasets. This confirms that forcing the model to identify values (criteria) first allows it to access a wider region of its latent space, producing more varied opinions.
2. The Saturation Point
The study sought to find “how far” we can push these models. Is the diversity infinite?
The answer is no. There is a saturation point where the model stops producing unique ideas.

Figure 5 shows the trajectory of diversity.
- Social Norms (Social-Chem) & Argumentation (CMV): The models can generate about 7-8 unique perspectives (clusters) per stance.
- Hate Speech: This is less subjective (more binary), yielding fewer unique clusters (around 4-5).
- Moral Stories: Being an open-ended creative task, the diversity keeps climbing much higher (around 20+ clusters).
This tells us that LLM diversity is task-dependent. The more subjective and open-ended the task, the more diversity the model can extract.
3. Human vs. Machine: Who is more diverse?
This is perhaps the most critical part of the study. How does an LLM compare to actual humans? The researchers hired crowd-workers to write diverse opinions and compared them to GPT-4’s outputs.
The Semantic Map The researchers projected the opinions into a semantic space (T-SNE plot) to see if the AI opinions overlapped with human ones.

Figure 6 shows that LLMs (squares) and Humans (circles) generally occupy the same semantic space. The models are quite good at “mimicking” the types of arguments humans make. They aren’t hallucinating alien concepts; they are retrieving human-like perspectives.
The Values Gap However, when looking closely at the specific criteria words used, subtle differences emerge.

Figure 7 reveals a fascinating alignment—and misalignment.
- Agreement: Both humans and models value “responsibility” and “safety.” However, humans prioritized “trust,” which models missed.
- Disagreement: Models tended to be more extreme, heavily weighing “freedom” and “autonomy.” Humans focused more on “personal growth” and “cultural norms.”
This suggests that while LLMs cover the broad strokes of human diversity, they might over-index on certain Western-centric or generalized values (like abstract freedom) while missing distinctively human nuances (like trust).
The “Two-Human” Rule Finally, the authors quantified the diversity gap.

The analysis in Figure 8 leads to a compelling conclusion: An LLM generally produces more diverse perspectives than a single human.
However, once you pair two humans together, their combined diversity meets or exceeds that of the LLM. This highlights that while AI is a powerful tool for brainstorming diverse views, it is not yet a replacement for the collective intelligence of a group of people.
Conclusion
This research paper offers a significant step forward in our understanding of LLMs. It moves beyond treating models as static knowledge bases and views them as engines for perspective generation.
The key takeaways are:
- Prompting Matters: We cannot just ask for diversity; we must ground it. Criteria-based prompting is a powerful technique to unlock the latent perspectives within a model.
- Saturation Exists: LLMs are not infinite fountains of unique ideas. They saturate at varying levels depending on the subjectivity of the task.
- The “Pair” Threshold: An LLM is more diverse than one person, but a team of humans is still the gold standard for diversity.
For students and researchers, this implies that LLMs can be excellent tools for data augmentation—generating diverse synthetic data to train robust models—but we must remain critical of the values they prioritize. As we strive to build AI that serves everyone, understanding whose perspectives are being extracted (and whose are being left out) remains a vital frontier.
](https://deep-paper.org/en/paper/2311.09799/images/cover.png)