Can AI Learn Language from Sound Alone? The Economics of Scaling Speech Models
Preschool children possess a remarkable ability: they learn to speak, understand syntax, and grasp semantic meaning entirely from raw sensory input—sound waves—without ever seeing a written word. This observation inspired the field of “Textless NLP,” or Generative Spoken Language Modeling (GSLM). The goal is ambitious: to train AI models to learn language directly from audio, bypassing text transcriptions entirely.
While the concept is elegant, the reality has been challenging. Despite significant advances, current Speech Language Models (SLMs) struggle to match the syntactic and semantic proficiency of their text-based cousins, Large Language Models (LLMs). An SLM might generate sounds that resemble speech, but it often lacks the coherent reasoning and grammatical structure of a model like GPT-4.
This raises a critical question for the future of AI: Is this a fundamental limitation of learning from audio, or do we simply need more computing power?
In the research paper “Scaling Properties of Speech Language Models,” Santiago Cuervo and Ricard Marxer investigate whether the famous “scaling laws” that govern text models also apply to speech. By training over 50 different models, they provide a roadmap for the future of speech AI, estimating exactly how much compute is required to bridge the gap between text and audio.
Background: From Waveforms to Tokens
To understand how we scale these models, we must first understand how a machine “reads” audio. Unlike text, which is naturally discrete (letters and words), audio is continuous.
The GSLM Pipeline
The standard approach to Generative Spoken Language Modeling involves three distinct stages:
- Tokenizer: A model (typically an acoustic model like HuBERT) processes raw audio waveforms and converts them into a sequence of discrete units, or “tokens.” This effectively turns continuous sound into a vocabulary of integers (e.g., 500 unique sound units).
- Language Model (LM): A Transformer model is trained on these discrete tokens, just as a text LLM is trained on words. Its job is to predict the next sound unit in the sequence.
- Vocoder: A final module takes the generated tokens and converts them back into audible waveforms.
For this study, the researchers focused specifically on the Language Model component. They wanted to know if the logic governing text LMs—specifically scaling laws—could be transferred to this speech domain.
The Concept of Scaling Laws
In 2020, Kaplan et al. demonstrated that the performance of neural language models isn’t random. It follows a power law. As you increase the amount of compute (\(C\)), the number of parameters (\(N\)), or the dataset size (\(D\)), the model’s loss (error rate) decreases in a mathematically predictable way.
The general relationship is expressed as:

Here, \(\gamma\), \(\alpha\), and \(\beta\) are exponents that determine how quickly the model improves as we add more resources. If these laws hold for speech, researchers can predict exactly how large a model needs to be, and how much data it needs, to achieve a specific level of intelligence.
Core Method: Modeling the Scale of Speech
To test this, the authors undertook a massive training campaign. They utilized the Llama architecture, a standard in modern LLMs, but adapted it for speech tokens derived from a HuBERT tokenizer.
They trained models ranging from 20 Million to 823 Million parameters.

For every model size, they varied the amount of training data, creating a grid of experiments. This allowed them to decouple the effects of model size from the effects of data volume.
The Mathematical Framework
To analyze their results, the authors adopted the “Chinchilla” scaling framework proposed by Hoffmann et al. (2022) and refined by Muennighoff et al. (2023). This framework posits that the final loss of a model (\(\hat{L}\)) can be modeled as a sum of three terms:
- Irreducible Loss (\(E\)): The inherent entropy of the data (the limit of perfect modeling).
- Model Approximation Error: A term that decreases as the model size (\(N\)) increases.
- Data Estimation Error: A term that decreases as the number of training tokens (\(D\)) increases.
The core equation for a single epoch of training looks like this:

However, in the real world, we often train on data for more than one epoch (reusing data). To account for this, the researchers used a generalized version of the equation that considers “effective” parameters (\(N'\)) and “effective” data (\(D'\)), acknowledging that seeing the same data twice offers diminishing returns:

The goal of the research was to empirically find the constants (\(A, B, \alpha, \beta\)) for speech. By fitting this curve to their experimental data, they could determine the optimal allocation of compute.
Specifically, given a fixed budget of compute (\(C_{avail}\)), how should you split it between making the model bigger (\(N\)) versus training it longer (\(D\))?

The solution to this minimization problem gives us the optimal model size (\(N_{opt}\)) and dataset size (\(D_{opt}\)):

Experiments & Results
The researchers trained their suite of models using a massive compilation of English speech datasets, including LibriSpeech, Vox Populi, and a novel synthetic dataset (which we will discuss later). In total, they utilized nearly 11 billion speech tokens.
1. Speech Follows Power Laws
The first major finding is that speech models do behave like text models. When plotting the test loss against the compute budget (in FLOPS), the models form a clear “envelope” of performance.

As shown in Figure 1, the dashed line represents the optimal frontier. Just like in text, adding more compute reliably reduces the test loss according to a power law. This confirms that we can predict the performance of future, larger speech models before we even build them.
2. Loss Predicts Intelligence
A lower “test loss” simply means the model is better at guessing the next sound token. But does that mean it understands language?
To verify this, the researchers compared their models’ test loss against downstream linguistic tasks:
- sBLIMP: A test of syntactic consistency (grammar).
- Topic Cloze & Story Cloze: Tests of semantic understanding (staying on topic and predicting logical story endings).

Figure 4 reveals a very strong linear correlation. As the upstream test loss drops (x-axis), the performance on grammar and storytelling (y-axis) improves. This validates that optimizing for the mathematical “next token prediction” objective genuinely teaches the model language skills.
3. The Efficiency Gap: Speech vs. Text
Here lies the most critical contribution of the paper. While speech models do scale, the researchers compared their scaling trajectory against text-based LLMs (specifically the Pythia suite).
The results, shown in Figure 2, highlight a stark difference in efficiency.

Both text (black squares) and speech (green circles) improve with more compute. However, look at the slopes. The text models improve much faster.
The authors quantified this by calculating the scaling exponents (\(\gamma_q\)) for both modalities.

As Table 4 shows, for syntactic tasks (BLIMP), the exponent for text is 0.066, while for speech it is 0.021.
What does this mean practically? It means speech models are significantly less efficient learners. To achieve the same gain in syntactic performance, a speech model requires vastly more compute than a text model. The authors estimate that speech linguistic performance scales up to three orders of magnitude (\(10^3\)) more slowly than text.
If you want a Speech LLM to match the grammar of GPT-3, you might need 1,000 times the computing power used to train GPT-3.
4. Improving Semantics with Synthetic Data
Why are speech models so inefficient? One hypothesis is the quality of the data. Standard speech datasets (like audiobooks) often contain long gaps or complex dependencies that are hard to capture in the limited context window of a model (2050 tokens).
To address this, the authors created sTinyStories. They took the “Tiny Stories” text dataset—simple narratives designed to teach reasoning to small models—and synthesized it into speech.

The results in Figure 3 are encouraging. Models trained on the synthetic sTinyStories (orange lines) consistently outperformed models trained on standard audiobooks (blue lines) in semantic tasks (Topic Cloze), even when evaluated on real human speech. This suggests that the content of the training data matters just as much as the quantity.
5. The Tokenization Bottleneck
Finally, the authors tried to pack more information into the model by using “Unigram” tokenization. This method compresses the sequence length, allowing the model to see more “time” in its context window.

Surprisingly, this backfired. As shown in Figure 5, while the test loss (upstream metric) looked good, the downstream performance (actual intelligence) degraded. The “Story Cloze” metric (bottom right of Figure 5) completely flattened out, implying the model stopped learning semantics altogether with the compressed tokens. This indicates that current compression methods might be discarding essential linguistic information.
Conclusion & Implications
This research provides the first comprehensive “economic” analysis of Textless NLP. The authors have successfully established that Speech Language Models obey the same fundamental laws of physics—or rather, mathematics—as text models.
The implications are two-sided:
- The Optimistic View: It is theoretically possible to build a “Speech GPT” that learns purely from audio. We don’t need new architectures; we just need to scale up along the predictable power-law curve.
- The Realistic View: The cost is prohibitive. Because speech is so much less information-dense than text, reaching human-level proficiency purely from audio requires astronomical amounts of compute.
The study suggests that while scaling is the answer, “brute force” scaling might not be the solution. Future breakthroughs will likely need to come from better ways to tokenize audio (increasing information density) or hybrid approaches that leverage the efficiency of text models while retaining the richness of speech. Until then, we know exactly what the price of pure audio learning is—and it is expensive.
](https://deep-paper.org/en/paper/2404.00685/images/cover.png)