The Innocent Suspect: Why AI Authorship Detectors Unfairly Target 'Average' Writers

Imagine a forensic investigation where a single anonymous email is the key piece of evidence. Investigators have a pool of 100 potential suspects. They run the email through a state-of-the-art AI Authorship Attribution system. The system spits out a ranked list, and “Suspect B” is at the very top.

Suspect B becomes the primary focus of the investigation. Their life is scrutinized, their reputation damaged. But here is the twist: Suspect B didn’t write the email. The AI made a mistake.

Errors happen in machine learning; we accept that no system is 100% accurate. But consider a darker possibility: What if “Suspect B” is the AI’s favorite scapegoat? What if, whenever the AI is confused, it defaults to pointing the finger at Suspect B, simply because of the mathematical properties of their writing style?

This isn’t just a hypothetical flaw—it is a measurable phenomenon described in the recent research paper “Quantifying Misattribution Unfairness in Authorship Attribution.”

In this post, we will dive deep into this paper to understand why high-performing AI models might be fundamentally unfair, how we can measure this unfairness mathematically, and why being an “average” writer might put you at greater risk of being falsely accused.

The Problem: Accuracy vs. Fairness

To understand the core problem, we first need to look at how modern Authorship Attribution (AA) works.

The “Needle in a Haystack” Approach

In a typical forensic setting, we have a Query (a document of unknown authorship) and a Haystack (a large collection of known authors and their documents). The goal is to find the Needle—the true author—hidden within the haystack.

Modern systems use Embeddings. Instead of counting words manually, they use deep learning models (like BERT or RoBERTa) to convert a document into a dense vector (a long list of numbers) in a high-dimensional geometric space. The assumption is simple: documents written by the same person should be close together in this space, while documents by different people should be far apart.

To find the author, the system converts the Query document into a vector and measures the distance (usually Cosine Similarity) to every author in the Haystack. It then ranks the authors from “most likely” (closest distance) to “least likely.”

Traditionally, researchers evaluate these systems using metrics like Recall@k (is the true author in the top \(k\) guesses?) or Mean Reciprocal Rank (MRR). These metrics ask: How often do we catch the right person?

However, these metrics ignore the other side of the coin: Who are we falsely accusing when we get it wrong?

If a system has 90% accuracy, it fails 10% of the time. If that 10% of blame is spread randomly among all innocent people, the system is fair. But if that 10% of blame lands repeatedly on the same innocent person, the system is biased and unfair. This paper argues that standard evaluation measures completely miss this risk.

Introducing MAUI: The Misattribution Unfairness Index

To solve this, the researchers introduced a new metric called MAUI (Misattribution Unfairness Index). The goal of MAUI is to quantify how much the model’s false alarms deviate from a fair, random distribution.

The Intuition of Fairness

Let’s do a quick thought experiment. Suppose you have a Haystack of 100 authors. You run 1,000 queries where the true author is none of the people in the Haystack (or you just look at the wrong guesses).

In a perfectly fair world, if the model doesn’t know the answer, every innocent author should have an equal probability of appearing in the top 10. Being ranked highly for a document you didn’t write is a “misattribution.”

If the ranking were purely random:

The probability of appearing in the top \(k\) is \(k / N_h\) (where \(N_h\) is the number of haystack authors).
Over \(N_q\) queries, the Expected Count (\(E_k\)) of times an innocent person appears in the top \(k\) is roughly \((k / N_h) \times N_q\).

Unfairness occurs when certain authors appear in the top \(k\) significantly more often than this Expected Count (\(E_k\)).

The Equation

The researchers formalized this into the \(MAUI_k\) metric.

Equation for MAUI metric showing the summation of excess misattributions normalized by the maximum possible unfairness.

Let’s break down this equation (Eq 1):

\(c_j^k\): This is the actual count of times author \(j\) appeared in the top \(k\) rankings for documents they did not write.
\(E_k\): This is the expected count if the system were fair (random).
\(\max(0, c_j^k - E_k)\): We only care about authors who are “over-attributed” (ranked high too often). If an author appears less than expected, we treat the difference as 0. We are summing up the “excess” blame.
The Denominator: This effectively normalizes the score between 0 and 1.

0 means the system is perfectly fair (misattributions are spread evenly).
1 means the system is maximally unfair (the same few people are blamed for everything).

Experiments: Do Good Models Play Fair?

The researchers tested five different embedding models:

SBERT: A standard sentence transformer.
LUAR: A model specifically designed for authorship attribution.
MPNet_AR: A Microsoft model fine-tuned for authorship.
Wegmann: A style-based embedding model.
StyleDist: Another style-based model.

They tested these on three diverse datasets: Reddit comments, Blogs, and Fanfiction.

1. Effectiveness (Accuracy)

First, let’s look at how well these models actually work at finding the correct author.

Table 1 showing Recall-at-8 and MRR scores. LUAR and MPNet perform very well on Reddit and Blogs.

As shown in Table 1, LUAR is a powerhouse. It achieves a Recall@8 of 0.97 on Blogs and 0.82 on Reddit. It is highly effective at identifying the correct author. Wegmann, on the other hand, struggles significantly with accuracy (only 0.08 Recall@8 on Reddit).

2. Unfairness (MAUI Scores)

Now, let’s look at the fairness scores using the new MAUI metric. Remember, lower is better (0 is fair).

Table 2 showing MAUI scores for different k values. SBERT is highly unfair. LUAR shows high unfairness on Blogs despite high accuracy.

Table 2 reveals a startling disconnect.

SBERT is incredibly unfair (0.31 on Reddit for k=10). It constantly misattributes texts to the same group of people.
Wegmann, which was the worst at accuracy, is actually the most fair (lowest MAUI scores). It distributes its confusion evenly.
LUAR, the accuracy champion, shows concerning unfairness levels, particularly on the Blogs dataset (0.12).

Key Takeaway: There is no free lunch. A model can be highly accurate at finding the right person but highly biased when it guesses wrong. Just because a system has “97% accuracy” doesn’t mean it’s safe to use in a courtroom, because the error mode might disproportionately target specific individuals.

The Scale of the Risk

How bad is this “excess” blame? Is it just a few extra times?

Table 3 showing the count of authors ranked in the top 10 more than expected. Thousands of authors face 2x or 4x the expected risk.

Table 3 (top section) shows the raw counts for the Reddit dataset.

With SBERT, over 2,500 authors are ranked in the top 10 four times more often than random chance would predict (\(> 4 \times E_{10}\)).
Even with LUAR, hundreds of authors face a risk that is 4x or 5x higher than their peers.

Table 4 (bottom section of the image) highlights the extreme cases. In the Reddit dataset using SBERT, there is an unlucky individual who is 39 times more likely to be misattributed than the average person. Imagine being 39 times more likely to be a suspect in a crime you didn’t commit, purely because of your writing style.

Why is this happening? The “Centroid” Hypothesis

Why does the AI hate these specific people? Is it their vocabulary? Their grammar? The researchers found a geometric explanation: Distance to Centroid.

In the vector space (where every author is a dot), there is a “center” or “centroid”—the average of all authors.

Outliers: Authors with very unique styles (e.g., using rare words, strange punctuation) live on the edges of the cloud.
Centroid Authors: Authors with very “generic” or “average” styles live near the center.

The Geometric Trap

The researchers measured the distance of every author from this center and compared it to their average rank (how often they appear at the top of the list).

Figure 1 scatter plots showing the relationship between mean rank and distance from centroid. Authors closer to the centroid (0.0) have much better (lower) ranks.

Figure 1 shows a clear trend across all datasets (Reddit, Blogs, Fanfiction). The x-axis is the distance from the centroid (0 is the center). The y-axis is the Mean Rank (lower is better/higher on the list).

The trend is undeniable: As you get closer to the center (moving left on the x-axis), your Mean Rank drops (you appear higher on lists).

This means that if your writing style is “average” or “generic,” you are geometrically closer to everyone. When the model tries to match a Query document, and it’s not sure who wrote it, the vector often lands somewhere in the middle of the space. Who lives in the middle? The generic authors.

Consequently, average writers are the universal “Plan B” for the model. They are the default suspects.

The Distribution of Authors

This problem is exacerbated by how authors are distributed in the space.

Figure 2 histograms showing the distribution of author distances. Most authors are clustered near specific distances, but the shapes vary by model.

Figure 2 shows the density of authors at different distances. You can see that models distribute authors differently. SBERT (green line) tends to clump users tightly, which contributes to its high unfairness—everyone is too close to the “generic” center.

The Irony of the Average

So, being an average writer makes you a frequent false suspect. But does being “average” at least help the model find you when you actually wrote the text?

Surprisingly, no. The researchers analyzed the “Needle” authors—those who were successfully identified (Highest MRR) versus those who were hard to find (Lowest MRR).

Figure 3 scatter plots for Reddit comparing distance from centroid for easy-to-find (red) vs hard-to-find (teal) authors.

Figure 4 scatter plots for Blogs comparing distance from centroid. High MRR authors (red) are distinctly separate from Low MRR authors (teal).

Figures 3 and 4 tell a fascinating story (look at the red vs. teal dots):

Red Dots (Highest MRR): These are authors the model identifies easily. Notice how they are often shifted to the left (closer to the centroid) in some models, or clustered differently.
Teal Dots (Lowest MRR): These are authors the model fails to identify.

The statistical analysis in the paper (Table 7) confirms a cruel irony for Reddit users: Authors closer to the centroid are more likely to be misattributed (false positives), but they are NOT necessarily easier to identify correctly (true positives).

Specifically, for the Reddit dataset, the authors who are easiest to find (Highest MRR) tend to be further away from the centroid than the random population. This makes sense: unique styles are easy to spot. But the “average” styles? They are hard to distinguish from each other, yet they constantly get flagged as false matches for everyone else.

What Does This Mean for the Future?

This research highlights a critical flaw in how we build and evaluate AI forensics.

Risk Communication: If law enforcement uses these tools, they need to know that the “Top 5” list isn’t just a list of likely suspects. It might be populated by “statistical sponges”—innocent people whose writing style just happens to be geometrically central.
Calibration: We need systems that recognize “generic” queries. If a query lands in the middle of the vector space, the system should perhaps return “Inconclusive” rather than outputting the names of the poor souls who live at the centroid.
New Metrics: We cannot rely on MRR and Recall alone. Metrics like \(MAUI_k\) must be part of the standard scorecard for any AI model that affects human lives.

Summary

The paper “Quantifying Misattribution Unfairness in Authorship Attribution” teaches us that in the world of vector embeddings, being “normal” is a liability.

The Unfairness is Real: AI models disproportionately target specific individuals when they make mistakes.
The Cause is Geometry: Authors with “centroid” (average) styles are mathematically closer to all other documents.
The Consequence: These authors are at high risk of misattribution (false accusation) while simultaneously being harder to uniquely identify.

As we continue to deploy AI in sensitive fields like forensics, understanding these hidden geometric biases is not just an academic exercise—it is a requirement for justice.

This post breaks down the research by Alipoormolabashi et al. (2025). The provided images were extracted directly from their paper to illustrate the technical concepts.

The Problem: Accuracy vs. Fairness#

The “Needle in a Haystack” Approach#

The Blind Spot in Current Metrics#

Introducing MAUI: The Misattribution Unfairness Index#

The Intuition of Fairness#

The Equation#

Experiments: Do Good Models Play Fair?#

1. Effectiveness (Accuracy)#

2. Unfairness (MAUI Scores)#

The Scale of the Risk#

Why is this happening? The “Centroid” Hypothesis#

The Geometric Trap#

The Distribution of Authors#

The Irony of the Average#

What Does This Mean for the Future?#

Summary#