How can we understand the psychology of people who lived thousands of years ago? We cannot interview them, survey them, or put them in an fMRI machine. Their minds are, quite literally, dead. However, they left behind “textual remains”—philosophical treatises, political records, poetry, and personal letters.
The emerging field of Historical Psychology attempts to reconstruct the thoughts, feelings, and values of past populations by analyzing these texts. But there is a massive bottleneck: the sheer volume of history. Reading every text from the Song Dynasty to map the evolution of “collectivism” is impossible for a single human.
This is where Natural Language Processing (NLP) steps in. While most NLP research focuses on modern English, a recent paper titled “Surveying the Dead Minds” tackles a much harder challenge: Classical Chinese. This language shaped East Asian thought for millennia, serving as the vehicle for Confucianism, Daoism, and Buddhism.
In this post, we will break down how researchers developed a novel pipeline called Contextualized Construct Representation (CCR) to extract psychological traits from ancient Chinese texts, outperforming even GPT-4 in specific tasks.
The Problem: Bag-of-Words vs. Context
Before the advent of deep learning, “reading” historical texts computationally was a blunt instrument. Researchers often used “Bag-of-Words” approaches. If you wanted to measure “anxiety” in a text, you would simply count how many times the word “fear” or “worry” appeared.
The most advanced version of this is the Distributed Dictionary Representation (DDR). DDR uses word embeddings (converting words into numbers based on their meaning) to create a “centroid” or average meaning for a dictionary of concepts. It then compares the text to that dictionary.
However, these methods have a fatal flaw: they ignore context.
In Classical Chinese, context is everything. A single character can change meaning entirely depending on the characters around it. Furthermore, psychological concepts like “moral integrity” or “filial piety” are complex constructs expressed through sentences and stories, not just isolated keywords.
The Solution: Contextualized Construct Representation (CCR)
The researchers propose a new pipeline specifically adapted for historical analysis: CCR. Unlike dictionary methods that look at words in isolation, CCR uses Transformer-based language models (like BERT) to generate embeddings for entire sentences or paragraphs.
The pipeline consists of two massive challenges:
- The Questionnaire Problem: Valid psychological surveys are in modern English. How do we apply them to Classical Chinese?
- The Data Problem: There are no labeled datasets for “Classical Chinese Psychology” to train AI models.
Let’s look at how the researchers solved these, as illustrated in their pipeline overview.

Challenge 1: Cross-Lingual Questionnaire Conversion
Psychologists have spent decades validating questionnaires (e.g., to measure Individualism vs. Collectivism). These are usually simple English statements like “I rely on myself most of the time.”
Translating these directly into Classical Chinese often results in awkward phrasing that doesn’t match the historical style. The researchers devised a clever workaround shown on the right side of Figure 2 above:
- Input: An English questionnaire item.
- Quote Recommendation: Instead of translating, they used a model called “QuoteR” to find existing quotations from historical texts that carry the same semantic meaning.
- Manual Filtering: Experts review the quotes to ensure they fit the psychological construct.
For example, an English item about “following rules” might be matched with a genuine quote from a legalist text in the Warring States period. This ensures the “ruler” we are using to measure the text is linguistically authentic.
Challenge 2: Indirect Supervised Contrastive Learning
The second challenge is the model itself. Off-the-shelf models like bert-ancient-chinese are good at general language but bad at understanding psychological nuance. To fix this, the model needs to be “fine-tuned” (trained further) on relevant data.
But where do you find a labeled dataset of ancient psychology? You build it.
The researchers compiled the Chinese Historical Psychology Corpus (C-HI-PSY). They extracted over 21,000 paragraphs from historical works. But they still lacked labels—no one has tagged these paragraphs as “High Morality” or “Low Neuroticism.”
The “Pseudo Ground Truth” Trick
The authors utilized the titles of the chapters or articles. In Classical Chinese writing, titles often act as summaries of the moral value discussed (e.g., a chapter titled “Filial Piety”).
If two paragraphs come from chapters with semantically similar titles, the paragraphs themselves should be semantically similar. This assumption allows the researchers to create “Pseudo Ground Truth” labels without manual annotation.

As shown in Figure 3, the process works like this:
- Title Embedding: Use a word vector model to verify if Title A is similar to Title B.
- Triplet Sampling: Select three texts:
- Anchor (\(s_A\)): The target paragraph.
- Positive (\(s^+\)): A different paragraph with a very similar title.
- Negative (\(s^-\)): A paragraph with a very different title.
- Contrastive Learning: Train the model to pull the Anchor and Positive closer together in vector space while pushing the Negative away.
The Sampling Mathematics
To be precise, positive pairs are defined by a high similarity threshold (\(\delta^+\)), and negative pairs by a low threshold (\(\delta^-\)).
The set of positive pairs is defined as:

The set of negative pairs is defined as:

Hard vs. Random Sampling
When selecting the triplet samples, the researchers faced a choice. Should they pick the “hardest” examples (e.g., a positive pair that the model currently thinks is very different) to force the model to learn faster? Or just pick randomly?
Usually, “hard sampling” is better in machine learning. Surprisingly, in this context, random sampling worked better.

As Figure 4 shows, Random Sampling (orange line) consistently achieved higher correlation scores than Hard Sampling (blue line).
Why? Because of the “Pseudo Ground Truth.” Since the labels are based on titles, they aren’t perfect. A paragraph in a chapter about “Loyalty” might actually be about “Geography.” Hard sampling tends to zero in on these noisy, incorrectly labeled examples, confusing the model. Random sampling is more robust against this noise.
The Loss Function
Finally, the model is trained using Triplet Loss. This mathematical function punishes the model if the distance between the Anchor and the Positive is not significantly smaller than the distance between the Anchor and the Negative.

Here, \(\mathcal{D}\) represents the distance between vectors. The goal is to minimize this loss, effectively organizing the “mind” of the AI to understand psychological concepts in Classical Chinese.
Experiments and Results
Does this complex pipeline actually work? The researchers tested their fine-tuned CCR models against:
- DDR: The traditional dictionary-based word embedding method.
- Prompting: Using GPT-3.5 and GPT-4 (few-shot prompting) to analyze the texts directly.
They evaluated performance across three tasks:
- Semantic Textual Similarity (STS): Can the model tell if two texts discuss the same value?
- Questionnaire Item Classification (QIC): Can the model categorize a sentence into the correct psychological domain (e.g., Collectivism)?
- Psychological Measure (PM): Can the model accurately score a text on a psychological scale?
The Radar Chart of Victory
The results were decisive.

In Figure 1, the red line (CCR) encompasses the others. It significantly outperforms the traditional DDR method (green) and, impressively, beats GPT-4 (blue) on almost every metric. This highlights that while Large Language Models are powerful, a smaller, domain-specific model fine-tuned on historical data can still be superior for specialized tasks.
Detailed Performance
Let’s look closer at the improvement fine-tuning provided.

Figure 5 shows the difference between using a raw, pre-trained model (orange) and the fine-tuned CCR model (blue). In almost every case—especially for Semantic Textual Similarity—the fine-tuning process yielded massive gains. The model moved from simply understanding “Chinese characters” to understanding “Historical Concepts.”
The numerical data further supports this. Looking at Table 2 below, we see that the CCR method (using BERT-based models) achieves Pearson correlations in the 0.30–0.53 range for similarity tasks, whereas standard word embeddings (DDR) struggle near zero.

Benchmarking Against History: The Song Dynasty Reform
Validating these models is tricky. We don’t have ground truth for what an ancient author really felt. To solve this, the authors looked for a historical event where attitudes were publicly known: The New Policies of Wang Anshi (11th Century).
This was a major political reform in the Song Dynasty. Historians have manually documented which officials supported the reform and which opposed it.
The Hypothesis: Political psychology theory suggests that people high in Traditionalism and respect for Authority generally resist change. Therefore, officials who opposed the reform should score higher on these traits in their writings.
The Test: The researchers ran their CCR pipeline on the writings of 137 Song Dynasty officials to measure their levels of “Traditionalism” and “Authority.”
The Result:

The scatter plots in Figure 6 confirm the theory. There is a statistically significant negative correlation.
- The X-axis represents support for reform (0 to 1).
- The Y-axis represents the CCR-generated score for Traditionalism/Authority.
As the lines slope downward, it indicates that officials who wrote with higher Traditionalism/Authority scores were indeed less likely to support the reform.

Table 3 confirms this with high statistical significance (\(p < .001\)). This external validation proves that the CCR pipeline isn’t just finding linguistic patterns; it is accurately identifying psychological signals that correspond to real-world historical behaviors.
Conclusion
The “Surveying the Dead Minds” paper represents a significant leap forward for Digital Humanities. By combining expert knowledge from psychometrics with state-of-the-art NLP, the researchers have created a time machine of sorts.
Key takeaways for students and researchers:
- Context is King: Moving from word-counting (DDR) to sentence-embedding (CCR) is essential for capturing the nuance of complex languages like Classical Chinese.
- Creativity with Data: When labeled data doesn’t exist, you can create it. The use of titles as “Pseudo Ground Truth” was a clever heuristic that enabled powerful contrastive learning.
- Specialization beats Generalization: A fine-tuned BERT model outperformed the mighty GPT-4 in this specific domain, proving that bigger isn’t always better—better training data is better.
This work opens the door for a quantitative “Historical Psychology,” allowing us to test theories about human nature across thousands of years of history, effectively letting the “dead minds” speak to us once again.
](https://deep-paper.org/en/paper/2403.00509/images/cover.png)