Scaling Social Intelligence: How Weak Models Can Teach Strong Giants Theory of Mind
Imagine you are watching a silent video of a person walking into a kitchen. They open the fridge, look inside, close it, sigh, and then walk over to a cabinet. Without hearing a word, you instantly infer a complex mental state: They are hungry, they wanted something specific (maybe an apple) that wasn’t in the fridge, and now they believe it might be in the cabinet.
This ability is called Theory of Mind (ToM). It is the cognitive bedrock of human social interaction—the ability to attribute beliefs, goals, and intentions to others. For Artificial Intelligence, however, this is a monumental challenge. While Large Language Models (LLMs) can write poetry or code, they often struggle to consistently infer human mental states, especially when observing complex, multimodal environments (like video combined with text).
In this post, we are diving deep into a fascinating research paper, “Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner.” This work proposes a novel solution that combines Bayesian probability with a “weak-to-strong” control mechanism, allowing massive AI models to “think” more like humans without needing expensive retraining.
The Complexity Trap
Current AI approaches to Theory of Mind generally fall into two buckets:
- Structured Workflows: Specific algorithms designed to calculate beliefs (symbolic approaches).
- End-to-End Learning: Training a neural network to simply guess the answer based on data patterns.
The problem is scalability. As a task becomes more complex—requiring more steps of planning or reasoning—standard models fall apart. They hit a “reasoning boundary.”

As shown in Figure 1, notice the sharp decline in accuracy for smaller models (like Llama-3-8B) as the number of planning steps increases. Even advanced techniques like Chain-of-Thought (CoT) struggle to keep up. Only the massive models (like Llama-3-405B) maintain stability, but they are expensive and difficult to fine-tune for specific tasks.
The researchers identified two root causes for this failure:
- The Reasoning Boundary: Standard reasoning methods plateau as task complexity grows.
- The Knowledge Gap: ToM requires vast “world knowledge” (e.g., knowing that milk goes in the fridge, not the oven). Small models simply don’t have enough of this pre-training data.
The Solution: A Scalable Bayesian Planner
The researchers propose a solution that essentially gives us the best of both worlds: the structured, logical reasoning of Bayesian Inverse Planning (BIP) combined with the massive world knowledge of huge LLMs.
The core innovation is a Weak-to-Strong Control mechanism. Instead of retraining a massive 405B parameter model (which is computationally prohibitive), they train a small model to understand the specific rules of Theory of Mind. This small model then acts as a “guide” or “controller” for the massive model during inference.
1. The Foundation: Bayesian Inverse Planning (BIP)
To understand the method, we first need to look at the math of how we infer intent. The researchers formulate human behavior as a Partially Observable Markov Decision Process (POMDP).
In simple terms: An agent has a goal (\(g\)) and a belief about the world (\(b\)). They take actions (\(a\)) based on these. We, the observers, see the actions and the environment (\(s\)), and we want to work backward to figure out \(g\) and \(b\).
This is called Inverse Planning. We are inverting the logic: instead of asking “If I want an apple, what do I do?”, we ask “I saw them walk to the fruit bowl; what did they want?”
The mathematical formulation for the posterior probability of a goal and belief looks like this:

Here, \(\pi(a^{\tau} | g, b^{\tau})\) represents the agent’s policy—the probability of taking a specific action given a goal and belief. To determine which hypothesis (e.g., “They want an apple” vs. “They want a pear”) is correct, the system compares their relative log-likelihoods:

The equation continues to account for the current step comparison:

Essentially, the system calculates: How likely is the action I just saw, assuming Goal A is true versus Goal B?
2. The Innovation: Weak-to-Strong Control
The standard Bayesian approach is great, but calculating that policy \(\pi\) (the probability of an action) is hard. You need a model that understands how humans act in the real world.
Large Language Models (LLMs) are great at this because they have read the entire internet. But they aren’t naturally tuned for these specific Bayesian calculations. Small models can be easily tuned, but they lack world knowledge.
The Fix: Use a small, post-trained model to “steer” the large model.
Phase A: Post-Training the Small Model
First, the researchers take a small model (like Llama-7B) and fine-tune it specifically for ToM tasks using Instruction Tuning and Preference Optimization.
They maximize the likelihood of correct actions:

And they use a preference loss (similar to DPO) to teach the model to distinguish between efficient human actions and irrational ones:

This creates a “ToM Expert” small model (\(\pi^{\mathcal{E}}\)). It knows how to reason about ToM, even if it lacks deep world knowledge.
Phase B: Guiding the Giant
Now comes the magic. During inference, they use a massive model (like Llama-405B) as the main policy engine. However, they adjust its predictions using the “behavioral shift” learned by the small model.
The modified probability distribution \(\bar{\pi}\) is calculated as:

Here is how to read this equation:
- Take the raw prediction from the Large Model (\(\pi^{\mathcal{L}}\)). This gives you good world knowledge.
- Multiply it by the ratio of the Tuned Small Model (\(\pi^{\mathcal{E}}\)) to the Naive Small Model (\(\pi^{\mathcal{N}}\)).
This ratio represents the “ToM knowledge” extracted from fine-tuning. If the Tuned Small Model thinks an action is much more likely than the Naive Small Model does, it boosts that probability in the Large Model.
Visualizing the Architecture
The architecture is visually summarized below. Notice on the right side how the “Latent Behavior Change” (\(\Delta\)) from the small model is applied to the large model’s likelihoods.

The entire data flow, from video input to symbolic representation to Bayesian inference, creates a pipeline where the Large LM acts as a powerful, but guided, engine.

Why does this work? (Theoretical Backing)
You might wonder if adding logits from a small model to a large one is mathematically sound. The authors provide Theorem 1 to justify this. They prove that this “proxy-tuning” approximates the result you would get if you actually fine-tuned the massive model directly.
The error (KL Divergence) is bounded:

This proves that \(\pi^{\mathcal{E}}\) doesn’t need to be perfect; it just needs to provide a gradient-like adjustment to the strong model.
Experimental Results
The researchers tested this approach using MMToM-QA, a benchmark involving videos of household activities where agents search for objects. The tasks involve inferring beliefs (e.g., “Does he think the apple is in the fridge?”) and goals.
1. Beating the State-of-the-Art
The results were impressive. As shown in Table 1, the proposed method (Ours w/ Llama3.1-405B) outperformed existing baselines, including GPT-4V, Video-Llama, and BIPALM.

Key takeaways from the table:
- Human Performance: 93.0% (The gold standard).
- Previous Best (BIPALM): 76.7%.
- Ours: 81.3% (A 4.6% improvement).
- Belief vs. Goal: Large models (GPT-4) are naturally good at Belief Inference (Type 1.1) because it relies on world knowledge. However, the proposed method significantly boosts Goal Inference, where understanding specific agent dynamics is crucial.
2. Scaling Up and Down
Does size matter? Yes. Table 2 shows that as the “Strong” component (the Large LM) gets bigger (from 70B to 405B), accuracy consistently improves.

Conversely, Table 3 shows that the “Weak” component (the controller) can be quite small. Even a 4B parameter model works effectively as a controller, provided it is fine-tuned correctly.

3. “Steering” the Reasoning
One of the most insightful experiments involved visualizing how the small model changes the large model’s mind.
In Figure 3, the researchers plotted the “Likelihood Change” over time. Initially, the change is small. But as the Bayesian inference progresses and the model narrows down the hypothesis, the “Weak-to-Strong” correction becomes more aggressive, redirecting the Large LM toward the correct ToM conclusion.

4. Precision Matters: The “Wine Glass” Effect
Why exactly does the large model need help? Large models tend to “smear” probability across broadly relevant concepts.
In a specific test case (Agent James looking for wine), the base Large LM assigned probability to general kitchen items (cabinet, table). The Post-Trained Small LM, however, honed in specifically on “wine” and “wine glass.” By combining them, the system achieved the precision of the small model with the robustness of the large one.

5. Transferability to New Worlds
Finally, a major test of intelligence is generalization. The models were trained on “Apartment” data. The researchers tested them on completely unseen scenarios: Andersen Fairy Tales, Ancient Egypt, Outer Space, Wild West, and Medieval Castle.
The results in Table 4 show that the Scalable Bayesian Planner adapts remarkably well to these new environments. The Large LM provides the context (understanding what a “throne” or “spaceship” is), while the Small LM provides the ToM logic.

Conclusion and Implications
This research highlights a pivotal shift in how we might build future AI systems. We have reached a point where simply making models bigger yields diminishing returns for complex reasoning tasks like Theory of Mind.
The Scalable Bayesian Planner demonstrates that we don’t always need to retrain the giants. Instead, we can use:
- Modular Design: Breaking complex reasoning into Bayesian steps.
- Specialized Guidance: Using small, agile experts to steer massive, knowledgeable generalists.
By decoupling “reasoning patterns” (learned by the small model) from “world knowledge” (held by the large model), this approach offers a sustainable path toward AI that fundamentally understands human intent—whether in a modern kitchen or an ancient Egyptian palace.
Comparison Summary
To wrap up, let’s look at how this method stacks up against traditional approaches. It is the only one that checks all the boxes: Scalability, Structured Reasoning, World Knowledge, and Multimodality.

This synergy of Bayesian structure and Weak-to-Strong Large Language Models sets a new standard for modeling human mental states in complex environments.
](https://deep-paper.org/en/paper/2506.01301/images/cover.png)