Teaching AI Subtlety: Why Multiple Choice Fails Social Reasoning

Imagine you are sitting in a room with a window open. A friend walks in, shivers slightly, and says, “It’s chilly in here.”

If you are a literal thinker, you might simply agree: “Yes, the temperature is low.” But if you have social-pragmatic awareness, you understand the implicature: your friend wants you to close the window.

This gap between literal meaning and intended meaning is the domain of Pragmatics. For humans, navigating these social nuances—implicatures, irony, humor, and metaphors—is intuitive. For Large Language Models (LLMs), it is notoriously difficult. While LLMs have mastered syntax and semantics, they often struggle to grasp the “unsaid” rules of human interaction.

In this deep dive, we explore a fascinating research paper, “Rethinking Pragmatics in Large Language Models,” which argues that we have been teaching and testing AI wrong when it comes to social skills. The researchers propose a paradigm shift: moving away from rigid multiple-choice tests toward open-ended evaluation, and swapping standard training methods for Preference Optimization.

The Problem: When “Correct” Isn’t Enough

Current methods for assessing an AI’s social intelligence rely heavily on Multiple Choice Question Answering (MCQA). The model is given a scenario and asked to pick the correct explanation from four options.

The researchers identified a critical flaw in this approach: selecting the right option label (e.g., “C”) does not mean the model actually understands the social context.

Figure 1: An example of LLM outputs when queried about a social-pragmatic scenario. On the right, a large model picks the correct multiple-choice option but generates a socially oblivious explanation. On the left, a smaller, preference-tuned model generates a nuanced, correct explanation.

As shown in Figure 1, a large model (LLAMA2-13B) correctly identifies the “Gold” option in a multiple-choice setting. However, when asked to explain itself, it fails completely, missing the pragmatic cue that the speaker is trying to “change the subject.” Conversely, a smaller model (LLAMA2-7B) that was tuned using the methods proposed in this paper generates a perfect open-ended response, grasping the social subtlety.

This illustrates the “Clever Hans” effect: models might learn to game the multiple-choice format without developing true pragmatic reasoning. Furthermore, real-world social interactions rarely have a single “gold” answer. A response can be polite, rude, slightly awkward, or charming—it is a spectrum, not a binary choice.

Paradigm Shift 1: Open-Ended Evaluation

To fix the measurement problem, the researchers argue we must stop looking at classification accuracy (Did the model pick ‘C’?) and start looking at generation quality (What did the model actually say?).

They introduce a new metric called the Length-Normalized Relative Score (LNRS).

The Role of the Judge

Since there is no mathematical formula for “socially awkward,” the researchers employ GPT-4 as a judge. For every test scenario, the model generates a free-form answer. GPT-4 then compares this answer to the human-annotated “gold” answer and assigns a score.

The base Relative Score (RS) is calculated as follows:

Equation for Relative Score (RS), comparing the judge’s score of the model’s answer against the gold answer.

Here, \(JS\) represents the “Judge Score.” If the model’s answer is as good as the human reference, the ratio is 1.0. If it’s better, it exceeds 1.0.

The Verbosity Trap

There is a catch. LLMs (and GPT-4 as a judge) have a known bias: they tend to prefer longer answers, regardless of quality. A model could “game” the system by writing verbose, fluffy paragraphs. To counter this, the researchers apply a penalty for excessive length, resulting in the Length-Normalized Relative Score (LNRS):

Equation for Length-Normalized Relative Score (LNRS), which applies a penalty based on the difference in length between the gold answer and the model’s answer.

This formula adjusts the score using a sigmoid function (\(\sigma\)) that penalizes the model if its answer (\(a_{model}\)) is significantly longer than the concise gold reference (\(a_{gold}\)). This forces the model to be both socially accurate and concise.

Paradigm Shift 2: From Supervised Learning to Preference Optimization

Once we have a better way to measure pragmatic ability, how do we improve it?

The standard industry approach is Supervised Finetuning (SFT). In SFT, the model is fed a dataset of questions and “gold” answers and is trained to minimize the error in reproducing that exact answer.

Equation for Supervised Finetuning (SFT) loss.

While effective for facts (e.g., “What is the capital of France?”), SFT is suboptimal for pragmatics. By forcing the model to treat one specific social response as the only correct sequence of words, SFT suppresses the model’s ability to navigate the gray areas of social interaction.

Enter Preference Optimization (PO)

The researchers propose using Direct Preference Optimization (DPO). Instead of telling the model “This is the only right answer,” DPO provides pairs of answers: a preferred (gold) response and a dispreferred (distractor) response.

The model learns to increase the likelihood of the preferred response relative to the dispreferred one.

Equation for Direct Preference Optimization (DPO).

This objective function (\(\mathcal{L}_{DPO}\)) is crucial. It teaches the model to distinguish between a socially appropriate response and a clumsy one, rather than just memorizing text. It aligns the model’s internal representation with human social preferences.

Experimental Results: Text-Based Pragmatics

The researchers tested these methods across several datasets covering sarcasm, implicature, and social norms (such as Social-IQA and PragMega). They compared three versions of models:

Base: The raw, pre-trained model.
SFT: The model finetuned with standard supervision.
DPO: The model finetuned with Preference Optimization.

The results were striking.

Figure 2: Bar charts showing LNRS performance across different models and datasets. The green bars (DPO) consistently outperform blue bars (SFT).

As seen in Figure 2, the DPO-tuned models (green bars) consistently achieve higher LNRS scores than the SFT models (blue bars). In many cases, SFT actually degraded performance compared to the base model. This confirms the hypothesis that force-feeding “gold” answers can confuse the model’s pragmatic reasoning.

Furthermore, human evaluation backed this up. When human judges rated the responses, they found that the DPO models often produced answers that were better than the annotated gold references, offering more clarity and social nuance.

A “Free Lunch”?

One concern with finetuning is “catastrophic forgetting”—does the model become socially smart but stupid at math? The study found that DPO provided a “near-free launch.” The models gained significant pragmatic skills without losing their general capabilities in reasoning, math, or reading comprehension. SFT, conversely, tended to hurt these general abilities.

Multimodal Pragmatics: The Image Referential Game

Social reasoning isn’t just about text; it requires understanding the physical world. To test this, the researchers utilized the Image Referential Game, a task that explicitly requires a “Theory of Mind” (ToM).

The Setup:

Speaker (The AI): Sees a target image (e.g., a snowman) and must write a caption to help a listener identify it.
Listener: Must pick the target image from a lineup of distractors based only on the caption.
The Challenge: The Speaker must anticipate what the Listener needs to know to distinguish the target from the distractors.

Figure 3: Illustration of the Image Referential Game. (a) Creating preference pairs. (b) DPO finetuning the vision-language model. (c) Evaluating based on CLIP score. (d) Evaluating retrieval recall.

As illustrated in Figure 3, the researchers applied the same DPO methodology to a Vision-Language Model (LLaVA). They used pairs of captions where the “preferred” caption was highly specific to the image, and the “dispreferred” was generic.

The Results: The results in the multimodal domain mirrored the text experiments.

Table 2: Results for the Image Referential Game. The PO-tuned model achieves the highest retrieval recall and win rates.

Table 2 shows that the Preference Optimized (PO) model significantly outperformed the SFT model. The R@1 score (Recall at Rank 1), which measures how often the listener correctly identifies the image on the first try, jumped from 30.5 (SFT) to 31.9 (PO). While the numbers seem close, in retrieval tasks, this consistency is significant. The SFT model actually performed worse than the base model in several metrics, reinforcing that standard supervision can damage the delicate capabilities required for Theory of Mind.

Finally, the researchers asked a profound question: Where does social reasoning happen inside the “brain” of the LLM?

They conducted an ablation study, applying Preference Optimization to specific layers of the Transformer network while freezing others. In LLM terminology, “deep” layers usually refer to those closer to the input (processing raw meaning), while “shallow” layers are closer to the output (refining the surface text).

Figure 4: Impact of trainable layer depth. Performance drops significantly when only the later (shallow) layers are tuned.

Figure 4 reveals a fascinating trend. The X-axis represents the range of layers tuned (e.g., “29-32” means only the last few layers were trained).

The graphs show that pragmatic understanding is tied to the deeper layers of the model. When training is restricted to the later layers (the right side of the x-axis), performance collapses. This suggests that pragmatics is not just a surface-level “style” that can be applied at the end of generation. It is a fundamental, high-level cognitive process that must be integrated into the core representation of the model—analogous to how social reasoning in humans is linked to high-level cognitive functions.

Conclusion

This research provides a roadmap for building more socially aware AI. It teaches us three key lessons:

Stop counting letters: Accuracy on multiple-choice questions is a poor proxy for social intelligence. We must evaluate open-ended generation.
Prefer, don’t force: Social interactions are nuanced. Preference Optimization (DPO) works significantly better than Supervised Finetuning (SFT) because it teaches the model relative quality rather than absolute correctness.
Go deep: Pragmatics is a deep cognitive skill. It resides in the fundamental layers of the model, not the surface.

By adopting these methods, we move closer to AI that doesn’t just understand the words “It’s chilly in here,” but understands the human standing in front of them.

The Problem: When “Correct” Isn’t Enough#

Paradigm Shift 1: Open-Ended Evaluation#

The Role of the Judge#

The Verbosity Trap#

Paradigm Shift 2: From Supervised Learning to Preference Optimization#

Enter Preference Optimization (PO)#

Experimental Results: Text-Based Pragmatics#

A “Free Lunch”?#

Multimodal Pragmatics: The Image Referential Game#

The Anatomy of Social Reasoning: Deep vs. Shallow Layers#

Conclusion#