Introduction: The “Speech Divide”
If you are reading this, chances are you have used a voice assistant like Siri, Alexa, or Google Assistant. You might have even marveled at how accurate automated subtitles on YouTube have become. For speakers of English, French, or Spanish, we are living in the golden age of Automatic Speech Recognition (ASR). Large language models and self-supervised learning (SSL) have solved the majority of transcription problems for these “resource-rich” languages.
But what happens if you speak a language that varies wildly depending on which city you are in? What if the “official” version of your language—the one used in books and news—is almost never spoken in daily life?
This is the reality for over 450 million Arabic speakers. While ASR models have improved for Modern Standard Arabic (MSA), they struggle significantly with the rich tapestry of daily dialects. A model trained on news broadcasts in Cairo will likely fail to understand a casual conversation in Casablanca.
In this post, we are diving deep into a new research paper titled “Casablanca: Data and Models for Multidialectal Arabic Speech Recognition.” The researchers behind this work have undertaken a massive community-driven effort to bridge this technological divide. They created a new dataset—dubbed Casablanca—that captures the nuance, code-switching, and diversity of eight distinct Arabic dialects.
We will explore why Arabic presents such a unique challenge for AI, how this dataset was painstakingly constructed, and what the results tell us about the current limitations of state-of-the-art models like Whisper.
Background: The Linguistic Labyrinth of Arabic
To understand the magnitude of the contribution made by the Casablanca project, we first need to understand the linguistic landscape of the Arab world.
The Challenge of Diglossia
Arabic is characterized by diglossia. This means there are two distinct forms of the language used in different contexts:
- Modern Standard Arabic (MSA): Used in formal settings, media, education, and government. It is the written standard, but no one speaks it as a native mother tongue.
- Dialects (Colloquial Arabic): These are the languages of daily life, street markets, family dinners, and TV dramas. They vary by country, region, and even city.
For an ASR model, this is a nightmare. A system trained on MSA (which makes up the bulk of available training data) will encounter vocabulary, grammar, and pronunciation in dialects that it has never seen before. To make matters more complex, speakers often engage in code-switching, seamlessly blending Arabic with English, French, or Berber depending on their colonial history and education.
The Data Gap
Prior to Casablanca, datasets for Arabic ASR fell into two categories:
- MSA-heavy: Datasets like MGB-2 contain thousands of hours of speech, but roughly 78% of it is MSA (news broadcasts).
- Lightly Supervised: Many large datasets use “light supervision,” meaning the subtitles or transcripts were generated by other algorithms or loose alignment techniques. This introduces noise and errors.
Casablanca aims to fix this by providing fully supervised (human-transcribed), fine-grained dialectal data.

As shown in the table above (Table 1), while Casablanca is smaller in total hours (48 hours) compared to giants like MGB-2, it is fully supervised. It also covers a wider range of dialects, including zero-resource dialects like Mauritanian and Yemeni which have rarely, if ever, been studied in NLP.
The Core Method: Building Casablanca
The researchers didn’t just scrape the web; they built a community. The creation of Casablanca involved a year-long effort with a team of native speakers, researchers, and annotators.
1. Data Selection: The “Realness” of TV Series
To capture authentic dialectal speech, the team turned to YouTube. Specifically, they curated episodes from TV series produced in eight different countries: Algeria, Egypt, UAE, Jordan, Mauritania, Morocco, Palestine, and Yemen.
Why TV series? Unlike news broadcasts (which use MSA), TV dramas reflect how people actually talk. They contain slang, emotions, interruptions, and cultural nuances. The researchers selected episodes that featured diverse geographic settings within each country to capture sub-dialects (or “micro-dialects”).
2. Geographic Distribution and Gender
The scope of this project spans the breadth of the Arab world, from the Atlantic Ocean (Mauritania/Morocco) to the Arabian Peninsula (Yemen/UAE).

Figure 1 above visualizes this distribution. You can see the specific breakdown of hours per country. However, the figure also highlights a challenge in the dataset: Gender Bias. If you look at the “Male” vs. “Female” percentages, there is a clear male dominance in the data. For example, the Palestinian subset is over 92% male speech, while Morocco is more balanced (57% male). This is a limitation inherent in the source material (the TV shows selected) and is something the authors transparently acknowledge as a potential source of bias in downstream tasks.
3. The Annotation Process
This is where Casablanca shines. The team employed 27 native-speaker annotators. The audio was split into “snippets” using Voice Activity Detection (VAD) to remove silence and music.
The annotation wasn’t just typing what was heard. It was a multi-layered classification task:
- Transcription: Writing down the spoken words. Since dialects don’t have a standardized spelling system (unlike MSA), annotators were instructed to write as they would in daily digital communication.
- Segmentation: Identifying if a segment was Dialect, MSA, or background noise.
- Gender Labeling: Tagging the speaker as Male or Female.
- Code-Switching: This is critical. If a speaker switched to English or French, the annotator provided the word in Latin script (e.g., “professional”) and a transliterated Arabic script version.

As seen in Figure 3, the annotators worked with a sophisticated interface (Label-Studio) to align text with audio waveforms precisely. This “fully supervised” human touch ensures that the ground truth is accurate, unlike the “lightly supervised” datasets that rely on imperfect algorithms.
Dataset Statistics and Linguistic Diversity
The final dataset comprises roughly 48 hours of high-quality, annotated speech. But the “quality” isn’t just about audio clarity; it’s about the linguistic density.

Table 2 (above) reveals some fascinating linguistic characteristics of the different dialects:
- Speed: The Moroccan dialect is the “fastest” spoken dialect in the dataset, clocking in at 3.2 words per second (WPS). Contrast this with the Jordanian dialect, which is the slowest at 1.2 WPS.
- Code-Switching (CS): Look at the “CS” column. The North African dialects (Algeria, Morocco) have high instances of code-switching (mostly French), while Yemen has almost none. This reflects the historical and colonial context of these regions.
The Complexity of Dialect Variation
One of the biggest hurdles in Arabic NLP is that the same concept can be said—and written—in many different ways depending on where you are.

Table 8 provides a glimpse into this variety. Look at the word for “What.” In MSA, it is “ماذا” (Madha). In Algerian, it can be “واش” (Wash), “شوَّالَا” (Shawala), or “واشنطنو” (Washno). A model trained only on MSA would likely treat “Washno” as a nonsense word or a proper noun, completely missing the interrogative nature of the sentence.
The Challenge of Code-Switching
The dataset also captures how foreign words are integrated.

In Table 10 (above), you can see the “Teal” colored words representing code-switching.
- Jordanian Example: “professional” and “international” are used in the middle of an Arabic sentence.
- Moroccan/Algerian Examples: French words like “l’affaire” (the affair/business) or “préparation” are woven in.
The annotators provided both the Latin script and the Arabic transliteration, making this dataset uniquely suited for training models to handle mixed-language speech.
Experiments & Results: How Do Modern Models Fare?
The researchers used Casablanca to benchmark the performance of current State-of-the-Art (SoTA) speech models. They set up two main scenarios:
- General Models (Zero-Shot): Testing massive multilingual models like Whisper (v2/v3), SeamlessM4T, and MMS without any specific training on this dataset.
- Dedicated Models: Testing models that had been previously fine-tuned on Arabic data (MSA, Egyptian, Moroccan).
They measured performance using Word Error Rate (WER). In ASR, a lower WER is better (0 is perfect, 100+ is terrible).
Scenario 1: The General Model Failure

The results in Table 3 are stark.
- High Error Rates: Even the mighty Whisper-large-v3 struggles. Without preprocessing, the average WER is 69.49. For context, a usable commercial ASR system usually targets a WER below 10-15.
- Dialect Disparity: Look at the difference between Egyptian (WER ~48 with Whisper v3) and Algerian (WER ~84). Egyptian is the most widely understood dialect in the Arab world (largely due to Egyptian cinema), so general models have likely seen more of it during pre-training. Algerian, with its heavy French influence and unique vocabulary, breaks the models.
- MMS Performance: The MMS model, which was trained largely on religious texts (MSA), performed the worst, highlighting that domain matters as much as language.
Scenario 2: Vocabulary Overlap and Dedicated Models
The researchers then tested models fine-tuned on specific dialects. Interestingly, they found that a model fine-tuned on Egyptian data performed surprisingly well on other dialects like Yemeni and Jordanian. Why?

Figure 2 explains this phenomenon using a heatmap of Vocabulary Intersection.
- The Egyptian Hub: Egyptian (EGY) shares a significant amount of vocabulary with Levantine dialects (Jordan/Palestine) and Gulf dialects (UAE/Yemen). This linguistic proximity allows an Egyptian model to “transfer” its knowledge.
- The Maghreb Island: Look at the bottom right of the heatmap (Morocco/Algeria). They share vocabulary with each other but have very low intersection with the eastern dialects. This explains why the Egyptian model failed on Moroccan speech, and why a dedicated Moroccan model was necessary.
The Code-Switching Struggle
Perhaps the most concerning (and interesting) result came from testing how Whisper handles code-switching.

Table 12 shows what happens when Whisper tries to transcribe sentences with English words.
- Hallucination & Translation: In the row labeled “CS-EN” (where the model is told the language is English), Whisper completely fails to transcribe the Arabic, often trying to translate the sentence instead of transcribing it.
- Script Confusion: Even in “Auto” mode, the model struggles to decide whether to write the English word in Latin letters or Arabic letters.
This proves that current large-scale models are not robust enough to handle the fluid, mixed-language reality of modern Arabic speakers.
Conclusion and Implications
The Casablanca paper is more than just a new dataset; it is a wake-up call for the speech processing community. It highlights three critical takeaways:
- Data Scarcity is a Blocker: You cannot solve Arabic ASR by just throwing more MSA data at the problem. We need high-quality, human-labeled dialectal data.
- One Model Does Not Fit All: The linguistic distance between Moroccan and Yemeni is too vast for a single “Arabic” model to handle perfectly without specific representation in the training data.
- The “YouTube” Domain: Models trained on formal news (MGB-2) fail on casual TV series. To build AI that works for regular people, we need data from where regular people speak.
By releasing Casablanca, the authors have provided the roadmap and the fuel for the next generation of Arabic ASR systems. Future models trained on this data will likely be more inclusive, finally allowing a grandmother in Mauritania or a teenager in Yemen to interact with technology in their own voice.
The project page for Casablanca is accessible for those interested in the raw data and further technical details.
](https://deep-paper.org/en/paper/2410.04527/images/cover.png)