The intersection of Artificial Intelligence and mental health is one of the most promising yet precarious frontiers in computer science. We often hear about the potential of Large Language Models (LLMs) like GPT-4 to act as accessible therapists or diagnostic tools. However, a new research paper titled “Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis” pumps the brakes on this enthusiasm.
The researchers introduce a novel dataset called ANGST and benchmark top-tier AI models against it. Their findings reveal a critical gap: while AI can process language fluently, it struggles significantly with the messy, overlapping reality of human mental health—specifically, the phenomenon of comorbidity, where disorders like anxiety and depression exist simultaneously.
In this post, we will dissect the paper, exploring why previous datasets were insufficient, how the ANGST dataset was rigorously constructed, and why even the most advanced models still fail to achieve reliable diagnostic precision.
The Problem with “Digital Psychiatry”
Before diving into the solution, we must understand the flaws in the current landscape of “digital psychiatry.” Researchers have long used social media platforms like Reddit and Twitter to analyze mental health discourse. It offers a safe, anonymous space where users discuss their struggles openly.
However, the authors identify three major “bottlenecks” that hold back progress in this field:
- The Data Source Bottleneck: Most datasets are built using “proxy signals.” Researchers scrape posts containing specific hashtags (e.g., #depression) or from specific subreddits (e.g., r/Depression). This creates a biased dataset that only captures people who are explicitly labeling themselves, ignoring the subtle, semantic expressions of mental illness found in general discourse.
- The Annotation Bottleneck: Relying on self-reported diagnoses or community affiliation lacks clinical validity. A user posting in a mental health forum might be sharing a trauma episode, offering advice, or regulating emotion—not necessarily exhibiting symptoms of a disorder. Lumping all these posts under a “Depressed” label creates noisy data.
- The Task Bottleneck: This is perhaps the most critical flaw. Most studies treat mental health as a binary classification task: Is the user Depressed or Healthy? This ignores comorbidity. In clinical psychology, it is well-documented that depression and anxiety often manifest concurrently. By forcing a model to choose one or the other, we oversimplify the human condition.
Introducing ANGST: A Realistic Benchmark
To address these bottlenecks, the researchers curated ANGST (ANxiety-Depression Comorbidity DiaGnosis in Reddit PoST). This is a multi-label classification benchmark. Unlike previous binary datasets, ANGST allows a single post to be labeled as indicative of:
- Depression
- Anxiety
- Both (Comorbid)
- Neither (Control)
1. Meticulous Data Collection
The team didn’t just scrape the top posts from r/anxiety. They employed a “neutral seeding” strategy. They identified authors who had self-disclosed diagnoses but then analyzed their posting history across various subreddits over five years. This ensured the dataset captured the natural linguistic patterns of these individuals, not just their posts in support groups.
From a pool of over 70,000 filtered posts, they selected samples with high linguistic alignment to depression and anxiety cues using sentiment analysis tools (NRC and Empath).

As seen in the table above, the researchers used these scores to filter for posts that were emotionally charged (high sadness or fear scores) to ensure the dataset was rich in relevant signals, rather than mundane chatter.
2. Gold-Standard Annotation
The true value of ANGST lies in its ground truth. Instead of crowd-workers, the team hired expert psychologists to annotate 2,876 posts. These experts worked in isolation to avoid bias, marking posts for depression and anxiety independently.
This approach yielded a dataset that reflects real-world complexity. The following table showcases examples from the dataset, highlighting how depression and anxiety cues can appear separately or together.

3. Statistical Hardness
Is ANGST actually harder to classify than existing datasets? To prove this, the authors measured the Inter-class Similarity using Jensen–Shannon Divergence (JSD).
In machine learning, you generally want your classes (e.g., “Depressed” vs. “Control”) to be distinct. If they are too similar, the model struggles to draw a decision boundary.

As shown in Table 2, ANGST has significantly lower JSD values (0.027 - 0.036) compared to other popular datasets like DATD or Dreaddit. This implies that the “Control” posts in ANGST are semantically very similar to the “Depression” posts. This is likely because the control posts come from the same distribution of Reddit users, rather than completely random text. This makes ANGST a much tougher, more realistic test of a model’s semantic understanding.
The Experiments: Man vs. Machine
The researchers benchmarked ANGST against a variety of models:
- Discriminative PLMs (Pre-trained Language Models): Mental-BERT, Mental-RoBERTa, Mental-XLNet. These are BERT-style models specifically fine-tuned on mental health texts.
- Generative LLMs (Large Language Models): Llama-2 (7B and 13B), GPT-3.5-turbo, and GPT-4.
They tested these models on two tasks: Binary Classification (simple detection) and Multi-label Classification (detecting comorbidity).
Result 1: Binary Classification
In the binary task, the models simply had to detect the presence of Depression or Anxiety against a control group.

Key Takeaways from Table 3:
- GPT-4 Dominance: GPT-4 (Zero-shot) generally achieves the highest scores, particularly in precision.
- The Power of Specialization: Notice that Mental-XLNet and Mental-BERT (fine-tuned models) are extremely competitive, often beating GPT-3.5 and Llama-2. This suggests that for specific domains, a smaller, specialized model can rival a massive generalist LLM.
- Llama-2 Struggles: The open-source Llama-2 models performed poorly, with significantly lower F1 scores compared to the proprietary GPT models and the specialized BERT models.
The Precision-Recall Trade-off
A closer look at the breakdown for depression detection reveals a worrying trend across all models.

In Table 4, look at the Recall vs. Precision columns for Depression.
- Recall is High (90%+): The models are excellent at finding depressed posts. They rarely miss a true case (low false negatives).
- Precision is Low (~60-65%): The models generate a lot of false positives. They are “trigger happy,” flagging many healthy posts as depressed.
This is a critical insight for real-world application. A diagnostic tool with low precision would overwhelm clinicians with healthy patients misdiagnosed as ill.
Result 2: The Multi-Label Challenge (Comorbidity)
The true test of ANGST is the multi-label task, where models must identify if a user has Depression, Anxiety, Both, or Neither.

Key Takeaways from Table 6:
- Performance Drop: The F1 scores drop significantly compared to the binary task. No model achieved an F1 score above 72%.
- The Comorbidity Gap: The models struggle to distinguish between the two disorders when they appear together. While GPT-3.5 (few-shot) achieved the best balance, the Hamming Loss (a measure of how many labels were wrong) remains high for many models.
- Anxiety is Harder: Across the board, models were much better at detecting Depression (F1 ~53%) than Anxiety (F1 ~17%). This suggests that anxiety cues in text might be more subtle or context-dependent than depression cues.
Why Do Models Fail? Error Analysis
The researchers conducted a qualitative analysis to understand why these advanced models are “still not quite there.”
1. Zero-Shot vs. Few-Shot Confusion
Surprisingly, giving the models examples (few-shot learning) often made performance worse than giving them no examples (zero-shot).

As seen in Table 12, the few-shot approach caused models to over-generalize. In the highlighted examples, the users explicitly mention “I was diagnosed with depression.” The Zero-shot model correctly identified this as a depression-related post. However, the Few-shot model, perhaps confused by the specific context of the examples it was given, misclassified these clear-cut cases.
2. Temporal Blindness
One of the most profound limitations identified is the inability of LLMs to understand time. A user might discuss a past episode of depression from which they have recovered. A human reader understands this context. An LLM, however, sees the keywords and flags the user as currently depressed.

In the second example of Table 13, the user says, “Don’t get wrong, this is the best I’ve felt in a long time… Now that I feel like I’ve leveled out more…” The user is discussing their recovery and the side effects of medication, not an active depressive episode. Yet, both GPT-3.5 and GPT-4 classified this as an active disorder.
3. The Medication Trap
Similarly, models struggle when users discuss medication. If a user says, “Zoloft made me feel better,” the presence of the drug name and the history of the illness often triggers a positive classification, even if the user is currently asymptomatic.

In Table 15 (bottom row), the user discusses smoking weed while on Zoloft but explicitly states, “Life is going good and I feel comfortable and somewhat happy.” The discriminative models (Mental-XLNet) correctly identified this as “Not Depressed,” but GPT-4 was misled by the context of medication and therapy, flagging it as a positive case.
Conclusion: The Road Ahead
The ANGST paper serves as a vital reality check for the AI healthcare community. While Large Language Models like GPT-4 demonstrate impressive general capabilities, they lack the clinical nuance required for reliable mental health diagnosis.
The creation of ANGST provides the community with a much-needed benchmark that reflects the messy, comorbid reality of mental health. The results show that:
- Comorbidity is the new frontier: We must move beyond binary classification.
- Precision matters: High recall is useless if we cannot trust the positive flags.
- Context is King: Future models need to better understand the temporal nature of symptoms and the difference between having a disorder and discussing the history of one.
As the authors conclude, we are “Still Not Quite There.” These models can serve as preliminary screening tools or assistants, but they are far from ready to act as autonomous diagnosticians. The path forward lies in hybrid models that combine the reasoning of LLMs with the domain expertise of specialized architectures, all validated on rigorous, expert-annotated datasets like ANGST.
](https://deep-paper.org/en/paper/2410.03908/images/cover.png)