Introduction

“What’s in a name?” is a question that has echoed through literature for centuries. In the context of human interaction, names often carry signals about gender, race, and ethnicity—signals that humans use, sometimes subconsciously, to make assumptions about the people behind the names. As Large Language Models (LLMs) become increasingly integrated into social computing tasks, a critical question arises: do these models mirror our societal biases when interpreting these signals?

In a fascinating study titled “On the Influence of Gender and Race in Romantic Relationship Prediction from Large Language Models,” researchers from the University of Maryland explore this very question. They investigate whether popular LLMs (like Llama2 and Mistral) exhibit heteronormative biases or prejudices against interracial relationships when analyzing dialogue.

The core problem is straightforward yet profound: If you feed an LLM a movie script where two characters are flirting, will the model recognize it as “romantic”? Or does that prediction depend entirely on whether the characters are named “Romeo and Juliet” versus “Romeo and Julio”?

Figure 1: Sample conversation from DDRel (Jia et al., 2021) dataset and relationships predicted by Llama2-7B when characters are replaced by names with different-gender and same-gender. LLM tends to predict differently despite the same conversation.

As illustrated in Figure 1, the researchers found that simply swapping names in a dialogue—while keeping every other word identical—can drastically change the model’s perception of the relationship. A conversation interpreted as “Lovers” for a male-female pair might be reclassified as “Siblings” for a female-female pair. This blog post breaks down their methodology, their surprising findings regarding Asian names, and the mechanics behind these biases.

Background: The Task of Relationship Prediction

To understand the bias, we first need to understand the task. Relationship prediction is a sub-field of Natural Language Understanding (NLU) where a model analyzes a dialogue to determine how the speakers are related. Are they friends? Colleagues? Spouses?

The researchers utilized the DDRel dataset, which consists of movie scripts annotated with 13 relationship types. For this study, the focus was binary: Romantic (Lovers, Spouse, Courtship) vs. Non-Romantic (Siblings, Friends, Colleagues, etc.).

The hypothesis was that LLMs suffer from two specific types of social bias:

  1. Heteronormativity: The assumption that romantic relationships are exclusively or default to heterosexual pairings.
  2. Interracial Prejudice: The bias against romantic relationships between individuals of different racial or ethnic backgrounds.

The Core Method: Controlled Name Replacement

The beauty of this research lies in its experimental design. To isolate the influence of names, the authors employed a controlled name-replacement strategy.

Here is how it works:

  1. Selection: They selected 327 test instances from the dataset where the ground truth relationship was known to be “Romantic” and the original characters were of different genders (e.g., a man and a woman).
  2. Sanitization: They manually filtered these dialogues to ensure no explicit gender cues (like “Sir,” “Ma’am,” or “Father”) remained in the text. This ensures the model relies on the names and the content of the speech, not explicit labels.
  3. Substitution: They systematically replaced the original names with new names associated with specific demographic groups.

The Variables

The researchers curated lists of names strongly associated with four racial/ethnic groups: Asian, Black, Hispanic, and White.

Crucially, they also categorized these names by gender association using US Social Security data. They didn’t just pick “Male” and “Female” names; they binned names based on the percentage of the population with that name assigned female at birth. This allowed them to test:

  • Strongly Male Names (0-2% female)
  • Gender Neutral Names (approx. 50% female)
  • Strongly Female Names (98-100% female)

By feeding the exact same romantic dialogue into the LLM but swapping the names (e.g., changing “John and Mary” to “John and David” or “Min-jun and Wei”), they could measure how often the model “recalled” that the relationship was romantic. If the model was fair, the recall rate should be roughly the same regardless of the names used.

Experiments & Results

The results provided strong evidence that LLMs are not impartial observers of human relationships.

1. The Influence of Gender Pairings

The first major finding concerns heteronormativity. The researchers tested how Llama2-7B performed when predicting romance in same-gender versus different-gender pairings.

Figure 2: Recall of predicting romantic relationships from Llama2-7B for subset of the dataset where characters originally have different genders. Horizontal and vertical axes denote % female of the name replacing an originally female and male character name from the dialogue. The upper-triangle (lower-triangle) shows the scores when names are replaced preserving (swapping) the genders of characters’ names as-is in the original conversation.

How to read Figure 2: This heatmap visualizes the “Recall”—essentially, how often the model correctly identified the dialogue as romantic.

  • The Axes: Both axes represent the gender probability of the names used (from 0% female/Male to 100% female).
  • The Quadrants:
  • Top-Right & Bottom-Left (Contrastive): These areas represent different-gender pairs (e.g., Male-Female). The colors are lighter (yellow/light green), indicating higher recall.
  • Top-Left & Bottom-Right (Same-Gender): These areas represent Male-Male and Female-Female pairs. The colors are darker green, indicating lower recall.

The Takeaway: The model is significantly less likely to predict a romantic relationship if the characters have names associated with the same gender. For example, looking at the “White” heatmap, notice the bright yellow spot in the top right (Male-Female pairing) versus the dark green in the top left (Male-Male pairing). This confirms the hypothesis that the model mirrors heteronormative biases found in society.

Interestingly, the bias appeared stronger against Male-Male couples than Female-Female couples. The researchers suggest this might be because female names in fiction are more frequently associated with romantic storylines in general, or potentially due to a stronger societal bias against male-male intimacy.

2. The Influence of Race and the “Asian Name” Anomaly

The second set of experiments looked at racial pairings. The researchers wanted to see if the model discriminated against interracial couples.

Figure 3: Recall of predicting romantic relationships from Llama2-7B for subset of the dataset where characters have different genders and are replaced with names associated with different races/ethnicities.

Figure 3 shows the recall for different racial pairings. The rows represent the female character’s race, and the columns represent the male character’s race.

The Findings:

  1. Interracial vs. Intraracial: Surprisingly, the model did not show a massive performance drop specifically for interracial couples compared to same-race couples of non-Asian descent. (e.g., White-Black pairings performed similarly to White-White pairings).
  2. The Asian Outlier: The most striking pattern in Figure 3 is the top row and left column. Whenever an Asian name is introduced into the pair—regardless of whether the partner is Asian, Black, Hispanic, or White—the recall drops significantly (indicated by the darker green cells).

The recall is lowest (0.68) when both characters have Asian names. Why does the model struggle specifically with Asian names in the context of romantic prediction?

Why Does This Happen? The Embedding Analysis

To understand the root cause of the poor performance on Asian names, the authors dug deeper into the model’s internal representations, or embeddings.

Embeddings are how LLMs represent words as vectors of numbers. If an LLM understands the concept of “gender” associated with a name, that information should be encoded in the name’s vector.

The researchers trained a simple classifier (Logistic Regression) on these embeddings to see if the gender of a name could be predicted solely from its vector representation inside Llama2-7B.

Table 1: Logistic regression classification accuracy (%) of predicting the demographic attributes associated with a name from Llama2-7B contextualized embeddings.

Table 1 provides the “aha!” moment of the paper:

  • Non-Asian Names: The model extracts gender from Black, Hispanic, and White names with high accuracy (80% - 99%).
  • Asian Names: The accuracy for predicting gender from Asian name embeddings is only 53.3%. Since binary classification by random chance is 50%, this means the model effectively cannot discern the gender of Asian names.

The Conclusion: The low recall for Asian romantic pairs is likely not due to a specific bias against Asian romance per se, but rather a technical failure. Because the model struggles to identify the gender of Asian names, it cannot apply its “heteronormative script.” It doesn’t know if the pair is Male-Female, Male-Male, or Female-Female, and thus its ability to confidently predict “Romance” (which it strongly associates with Male-Female pairs) collapses.

Sanity Check: Do Names Really Matter?

A skeptic might ask: “Maybe the model just relies on the dialogue context and ignores names entirely?”

To disprove this, the researchers ran a baseline experiment where they replaced names with anonymous placeholders like “X” and “Y”.

Table 2: Evaluation scores for anonymous namereplacements (character replaced with “X” or“Y") for different models under study. These results depict the model’s performance solely based on the context.

As shown in Table 2, when names are stripped away (replaced with X/Y), Llama2-7B achieves a recall of roughly 0.6887.

Compare this to the name-replacement results:

  • When names indicate a heterosexual couple (e.g., White Male + White Female), recall jumps up (often > 0.80).
  • When names indicate a same-gender couple, recall drops.

This deviation from the “Anonymous” baseline proves that the model is indeed using the demographic information embedded in the names to make its final decision.

Conclusion and Social Implications

This paper sheds light on a subtle but significant issue in Large Language Models. While we often worry about models generating hate speech, this study highlights representational harm—the erasure of specific identities.

If LLMs are used to analyze social data, write stories, or target advertisements, these biases could have real-world consequences:

  • Invisibility: Same-gender relationships might be miscategorized as platonic, leading to lower visibility in automated systems.
  • Resource Allocation: If an algorithm targets “couples” for housing loans or family insurance based on social media interactions, same-gender couples or couples with Asian names might be systematically excluded because the model reads their interaction as “sibling-like” or “friendly” rather than romantic.

The authors conclude by emphasizing the need for inclusive technology. As models become gatekeepers of information and opportunity, ensuring they can recognize and respect diverse relationship dynamics—regardless of whether the names are “John and Jane” or “Seung and Min-jun”—is not just a technical challenge, but an ethical imperative.