Beyond Chatbots: The 4 Hardest Challenges in Building Socially Intelligent AI

Humans are fundamentally social creatures. Our history, culture, and survival have depended on our ability to interpret a raised eyebrow, understand a pause in conversation, or sense the mood in a room. We call this Social Intelligence.

As Artificial Intelligence becomes more integrated into our daily lives—from healthcare robots to education assistants and customer service chatbots—the demand for these systems to understand us socially is growing. We don’t just want AI that can calculate or retrieve information; we want agents that can empathize, collaborate, and adhere to social norms.

But how do we bridge the gap between a calculator and a companion?

In the paper Advancing Social Intelligence in AI Agents: Technical Challenges and Open Questions, researchers Leena Mathur, Paul Pu Liang, and Louis-Philippe Morency from Carnegie Mellon University map out the landscape of Social-AI. They identify why this field is so difficult and propose four core technical challenges that the computer science community must solve to build agents that can truly live and work alongside us.

Social-AI is the multidisciplinary effort to create agents that can sense, perceive, reason about, learn from, and respond to the affect, behavior, and cognition of other agents (whether human or artificial).

This isn’t just about Natural Language Processing (NLP). While language is a primary tool for social construction, Social-AI spans robotics, computer vision, machine learning, and speech processing. The goal is to endow machines with specific competencies:

Social Perception: Reading socially relevant information (e.g., body language, tone).
Social Knowledge: Knowing facts and norms (e.g., you stay quiet in a library).
Social Memory: Remembering past interactions to build relationships.
Social Reasoning: Making inferences about what others are thinking or feeling.
Social Creativity (Theory of Mind): Imagining counterfactuals and understanding others’ mental states.
Social Interaction: Engaging in mutual, co-regulated patterns of behavior.

To understand why Social-AI is harder than, say, identifying a cat in a photo, we have to look at the philosophy of reality.

The paper draws a distinction between Natural Kinds and Social Constructs.

Natural Kinds exist independently of human thought. A mountain or a biological human body is a natural kind. They have physical properties we can measure objectively.
Social Constructs exist only because humans agree they exist. A “friend,” a “president,” or a “dollar bill” are social constructs. Their existence depends on the observer’s perception.

This “perceiver-dependency” creates a massive headache for AI. If “rapport” is a social construct, there is no physical “rapport particle” an AI can detect. It is subjective, ambiguous, and constantly changing.

Before diving into the challenges, it is worth noting how rapidly this field is growing. The researchers analyzed over 3,000 papers spanning several decades.

Figure 2: Cumulative number of Social-AI papers over time, based on the 3,257 papers from our Semantic Scholar Social-AI queries. Interest in Social-AI research has been accelerating across computing communities.

As shown in Figure 2, interest has skyrocketed in the last decade, particularly in Machine Learning (ML), Robotics, and NLP.

Early Era (1980s-1990s): Research was dominated by rule-based approaches. If you wanted a robot to be polite, you hard-coded the rules of politeness.
The ML Shift (2000s-2010s): The field moved toward statistical learning. Models were trained to predict social signals (like laughter or sentiment) from static datasets.
The Current Era: We are now seeing the rise of Large Language Models (LLMs) and generative agents. However, a key critique from the paper is that much of our current progress is based on static, ungrounded data. We train models on text or video clips that are stripped of their real-world context.

To move forward, we need to address the complexity of real-world interactions.

The Four Core Technical Challenges

The authors identify four specific hurdles that prevent current AI from achieving true social intelligence. These challenges arise because social interaction is not a static task—it is a dynamic, messy loop involving multiple perspectives and subtle signals.

We can visualize these challenges and the context in which they occur in Figure 1 below.

Challenge 1: Ambiguity in Constructs (C1)

As mentioned earlier, social constructs are subjective. In Figure 1A, look at the two humans interacting. Is there “tension” between them? Is there “rapport”?

In traditional AI tasks, we rely on “Gold Standard” labels—a single ground truth. An image contains a cat, or it doesn’t. But in social interactions, the “truth” is ambiguous. Even human annotators often disagree on whether a conversation is “hostile” or “friendly.”

The Technical Gap: Current models usually try to force these ambiguous concepts into discrete labels (e.g., Rapport = 7/10) or aggregate annotator opinions into a single average. This flattens the reality. If three people think a joke was funny and three think it was offensive, the “average” isn’t “mildly funny”—it’s polarizing.

The Opportunity: The researchers suggest moving away from static numerical labels. Instead, we should explore natural language supervision. Language is expressive enough to capture ambiguity (e.g., “They seem friendly, but there is an underlying tension”). AI needs flexible label spaces that can change dynamically rather than forcing complex social vibes into a pre-defined box.

Challenge 2: Nuanced Signals (C2)

Social meaning is often conveyed in the blink of an eye. A 100-millisecond pause can turn a compliment into an insult. A slight shift in posture can signal disengagement.

The Technical Gap: Many current multimodal models process data in chunks that might miss these micro-signals. Furthermore, AI models are typically trained on the presence of cues (what is there). However, social interaction often relies on what is absent—the unsaid word, the lack of eye contact, or the failure to laugh at a joke.

The Opportunity: We need better Social Signal Processing (SSP). This involves aligning different modalities (speech, vision, gestures) with extreme precision. The paper poses an open question: Can language serve as an intermediate layer to represent these nuances? Or are there social signals (like a “gut feeling” about someone’s gait) that simply cannot be described in words? Additionally, researchers need to develop methods for agents to learn from the absence of stimuli.

Challenge 3: Multiple Perspectives (C3)

In Figure 1A, the person on the left thinks, “I think we have great rapport!” while the person on the right thinks, “I think we have poor rapport!”

Social reality is rarely shared perfectly. Each actor in an interaction brings their own history, role, and bias. This is the Rashomon Effect applied to AI. Furthermore, these perspectives are interdependent; if I think you are angry, I might act defensive, which actually makes you angry.

The Technical Gap: Most AI models take a “God’s eye view,” trying to analyze the interaction objectively from the outside. They fail to model the distinct, potentially conflicting mental states of the participants.

The Opportunity: This relates to Theory of Mind. We need models that can maintain separate representations for every actor in a scene. An effective Social-AI agent needs to track:

What I think is happening.
What you think is happening.
What you think I think is happening.

This requires moving from single-model architectures to multi-perspective modeling frameworks that can handle dynamic updates as the interaction evolves.

Challenge 4: Agency and Adaptation (C4)

Finally, social agents are not just passive observers; they are actors with goals.

In reinforcement learning, we usually have a clear reward function (e.g., +1 point for winning the game). In social settings, feedback is implicit, sparse, and fleeting. If a robot makes a faux pas, no one holds up a sign saying “Bad Robot.” They might just glance away or change the subject.

The Technical Gap: How do we motivate an AI to learn social norms without explicit instruction? How does an agent balance a functional goal (e.g., “deliver the package”) with a social goal (e.g., “don’t be rude”)?

The Opportunity: We need to develop socially intrinsic motivation. Agents need to value “shared reality” and “common ground” as part of their objective functions. This involves learning from implicit signals—using a human’s hesitation or tone as a reward/punishment signal to update their behavior. The authors suggest looking into value internalization, where the agent adopts social norms not just as constraints, but as internal drives.

While solving these challenges, we must remember that Social-AI does not exist in a vacuum. As visualized in Figure 1B, these agents must operate across vast dimensions:

Interaction Structure: From Dyads (two people) to Communities. Adding a third person to a conversation fundamentally changes the dynamics (triadic closure, exclusion, etc.).
Timescale: From split-second glances to life-long relationships. Most current research looks at short clips, but a true companion robot needs to remember a joke you told three months ago.
Embodiment: Is the agent a chatbot, a cartoon avatar, or a physical robot? The degree of embodiment changes the communication channels available (e.g., touch, proximity).

Conclusion and Ethical Implications

The vision presented by Mathur, Liang, and Morency is ambitious. It moves us away from AI that simply processes data toward AI that navigates the social world with the nuance of a human.

However, this capability comes with risks.

Bias: If we train agents on internet data, they will learn internet prejudices.
Privacy: To understand nuanced signals, an agent might need invasive monitoring of our faces and voices.
Manipulation: An AI that perfectly understands rapport is an AI that can manipulate humans effectively.

The authors advocate for Participatory AI, where stakeholders (the people who will actually use these systems) are involved in the design process.

Advancing Social-AI is not just about better algorithms; it is about better understanding the fundamental nature of human connection. By addressing the challenges of ambiguity, nuance, perspective, and agency, we step closer to technology that doesn’t just compute, but truly connects.

What is Social-AI?#

The Problem of “Social Constructs”#

The Explosion of Social-AI Research#

The Four Core Technical Challenges#

Challenge 1: Ambiguity in Constructs (C1)#

Challenge 2: Nuanced Signals (C2)#

Challenge 3: Multiple Perspectives (C3)#

Challenge 4: Agency and Adaptation (C4)#

Dimensions of Social Context#

Conclusion and Ethical Implications#

What is Social-AI?

The Problem of “Social Constructs”

The Explosion of Social-AI Research