Imagine standing in the dense tropical forests of Costa Rica. The air is thick with humidity, and the soundscape is a chaotic symphony of insect buzzes, bird calls, wind rustling through leaves, and distant rumbles. In the middle of this acoustic storm (“the cocktail party problem”), a white-faced capuchin monkey calls out.
For a human researcher, identifying which specific monkey just made that sound is an arduous task requiring years of training and intense focus. For a computer, it’s even harder. The lack of large, labeled datasets for wild animals has long been a bottleneck in bioacoustics. We have massive datasets for human speech and decent ones for bird calls, but for a specific species of Neotropical primate? The data is scarce.
This brings us to a fascinating question posed by researchers at the University of Michigan: Can we identify individual monkeys by “borrowing” the hearing capabilities of Artificial Intelligence models designed for birds and humans?
In this deep dive, we will explore a recent paper that proposes a novel method for acoustic individual identification. We will look at how combining “embeddings” from models trained on completely different species (humans and birds) creates a super-classifier capable of distinguishing individual capuchin monkeys with impressive accuracy.
The Challenge of Animal Linguistics
For decades, animal vocalizations were viewed largely as reflex reactions to emotional states—a simple “ouch” or “hey.” However, the field of Animal Linguistics has shifted this perspective, revealing that many species possess complex communication systems with features analogous to syntax and semantics.
To truly understand these systems, researchers need to know who is speaking to whom. Individual identification is the cornerstone of studying social networks, behavioral context, and population dynamics.
The white-faced capuchin (Cebus capucinus) is an ideal candidate for this research. They are highly intelligent, use tools, and live in complex social groups. However, collecting data on them is difficult. The researchers for this study spent two years in the Taboga Reserve in Costa Rica, collecting distinct vocalizations known as “Peeps” and “Twitters.”

Figure 1 above provides a window into this study system. Panels A-D display spectrograms of “Twitter” calls, which are complex and structurally varied. Panel G shows the subjects themselves in their natural habitat. The maps in Panel H illustrate the territories of different monkey groups, highlighting the spatial complexity of the fieldwork.
The problem remains: How do we automate the identification of these individuals when we only have a few thousand recordings—a tiny fraction of what is usually required to train deep learning models?
The Core Method: Cross-Species Transfer Learning
The researchers turned to Transfer Learning. In machine learning, this is the equivalent of teaching a musician how to paint; their understanding of rhythm and composition (patterns) might help them understand brushstrokes faster than a complete novice.
In this study, the “musicians” are pre-trained AI models. The researchers hypothesized that models trained on vast amounts of audio data from other species could extract useful features from monkey calls, even though they had never “heard” a monkey before.
They focused on two primary source models:
- Google Perch: A bioacoustics model trained on thousands of hours of bird vocalizations.
- OpenAI Whisper: A massive speech recognition model trained on 680,000 hours of human speech.
The Concept of Embeddings
These models don’t just output text or bird names; they process sound through layers of neurons. At the end of this process, the sound is converted into a vector—a long list of numbers—called an embedding. An embedding represents the mathematical “essence” of the sound.
The hypothesis was that an embedding generated by Whisper might capture speech-like intricacies in the monkey calls (phylogenetic proximity), while an embedding from Perch might capture the acoustic texture of wildlife recordings (environmental proximity).
Joint Multi-Species Embeddings
The major innovation of this paper is not just using one model, but combining them. The researchers tested whether “two heads are better than one” by fusing the embeddings from bird and human models.
They employed three methods to combine these representations:
- Concatenation: simply sticking the two number lists together.
- Summation: adding the numbers together (after some dimensionality adjustments).
- Minimum Redundancy Maximum Relevance (MRMR): This is the most sophisticated approach.
Understanding MRMR When you combine two massive models, you get a lot of data, but also a lot of noise and repetition. MRMR is a feature selection technique. It looks at the thousands of numbers produced by Perch and Whisper and asks two questions:
- Relevance: Does this specific number help me distinguish between Monkey A and Monkey B? (Maximize this).
- Redundancy: Is this number telling me the exact same thing as the number I just picked? (Minimize this).
By applying MRMR, the researchers created a “super-embedding” that kept only the most distinct and informative features from both the bird and human perspectives.
Visualizing the Acoustic Space
To understand what these models “see,” we can use t-SNE plots, which compress high-dimensional data into 2D scatter plots.

In Figure 8 (above), we see the visualization of embeddings from bird-trained models (BirdNET and Perch).
- Column 1: Shows the clear separation between “Peeps” (yellow) and “Twitters” (blue). The models easily distinguish between the two call types.
- Columns 2 & 3: These show the challenge of individual identification. The points (representing specific monkeys) are somewhat clustered, but there is significant overlap. This visual messiness illustrates why this task is so hard: the differences between individuals are subtle and often drowned out by the environment.
Experiments and Results
The researchers ran extensive experiments, training classifiers on the embeddings extracted from Perch, Whisper, and their combinations. They evaluated performance using the F1 Score (a metric that balances precision and recall).
The results led to several surprising conclusions.
1. The Ensemble Effect
The primary hypothesis was confirmed: combining models works better than using them individually.

Table 1 outlines the key findings.
- Look at the “Twitters” section. The Perch (Simple) model achieves an F1 score of 0.61.
- The Whisper (Simple) model only achieves 0.55.
- However, the Perch + Whisper (MRMR) combination jumps to 0.66.
This statistical improvement confirms that the bird model and the human model are noticing different things about the monkey calls. When combined, they provide a more complete picture of the individual’s vocal identity.
2. Environment Beats Genetics
One of the most profound takeaways from this paper is the comparison between Perch and Whisper individually.
Humans are primates. We are genetically much closer to Capuchin monkeys than birds are. One might expect the model trained on human speech (Whisper) to be better at decoding primate vocalizations. It wasn’t.
Perch (the bird model) consistently outperformed Whisper (the human model).
Why? The authors argue that domain relevance matters more than phylogenetic proximity.
- Whisper was trained on clean, studio-quality speech or internet audio. It “expects” clear signals and linguistic structure.
- Perch was trained on field recordings of birds. It “knows” what wind, rain, and distance sound like. It has learned to filter out the background noise of the forest—the very same forest where the monkeys live.
This suggests that for bioacoustics in the wild, the acoustic environment is a stronger shared feature than the biological source of the sound.
3. Peeking Inside the “Brain” of the Model
The researchers didn’t just stop at the final output; they probed the internal layers of the Whisper model to see where the useful information was hiding.

Figure 3 shows the performance of the model based on which “layer” of the neural network was used. Deep learning models process data hierarchically: early layers detect simple edges or tones, while deeper layers detect complex concepts or words.
The graphs show that the intermediate layers (around layer 15) yielded the best performance. This makes sense: the early layers are too basic, but the final layers are too specialized for human language. The middle layers capture general acoustic patterns—pitch, timbre, cadence—that are applicable to monkeys as well as humans.
4. Interpretable Features vs. Deep Learning
Finally, the researchers compared the “black box” Deep Learning embeddings against traditional acoustic measurements (like Peak Frequency, which can be measured manually on a spectrogram).

Figure 2 shows the distribution of Peak Frequency across different individuals. You can see distinct “bumps” (bimodal distributions) for some monkeys. While these traditional features are useful and interpretable to biologists, the study found that they captured less information than the AI embeddings. However, they remain crucial for validating that the AI is detecting real biological differences and not just background noise.
Conclusion and Implications
This research marks a significant step forward in computational bioacoustics. It demonstrates that we don’t always need massive, species-specific datasets to build powerful tools. By creatively combining models from different domains—leveraging the “wild” robustness of bird models and the structural sensitivity of human speech models—we can achieve high-accuracy identification of animals with limited data.
The success of the MRMR method in fusing these embeddings suggests a path forward for conservationists. We can envision a future where “Frankenstein” models, stitched together from various pre-trained AIs, monitor biodiversity in real-time, identifying individual animals in the rainforest canopy to track their health, social structures, and survival.
The white-faced capuchins of Taboga Reserve have shown us that in the world of AI, diversity—of models, species, and data—is the key to understanding.
](https://deep-paper.org/en/paper/file-2308/images/cover.png)