Introduction
We have all been there. You ask a voice assistant a simple question like, “How do I choose a new phone?” and it responds by reading a five-paragraph essay out loud. It drones on, listing “bullet point one,” reading out URLs, or using complex sentence structures that force you to concentrate intensely just to follow the logic. By the time it finishes, you have forgotten the beginning of the sentence.
This creates a fundamental friction in modern AI. While Large Language Models (LLMs) like GPT-4 or Llama have revolutionized information retrieval, they are primarily trained to generate text. We read text with our eyes, which allows us to scan, skip ahead, and re-read complex clauses. However, when we interact with AI via voice—through smart speakers, car assistants, or accessibility tools—we are using our ears.
Speech is serial and transient. Once a word is spoken, it is gone. You cannot “scan” an audio wave with your ears the way you scan a paragraph with your eyes.
In a fascinating research paper titled “Speechworthy Instruction-tuned Language Models,” researchers from USC and Amazon investigate this modality mismatch. They argue that models aligned purely on text preferences fail to meet the unique cognitive requirements of spoken conversation. Their work proposes a new framework for creating “Speechworthy” LLMs using insights from the radio industry, novel prompting strategies, and a unique speech-based preference dataset.

The Problem: The Text-Speech Mismatch
To understand why current AI struggles with voice, we must look at how it is trained. Most instruction-tuned language models (ITLMs) are fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Human annotators read two text responses and pick the best one.
The issue is that an annotator reading a response favors detail, comprehensive lists, and structured formatting (like bold text or bullet points). As shown in Figure 1 above, a standard model (OLMo 7B Instruct) generates a response full of lists and emojis. While this looks great on a screen, it sounds unnatural when synthesized by a Text-to-Speech (TTS) engine.
The researchers hypothesize that speech processing requires a higher cognitive load than reading. Therefore, the optimal response for a voice interface is not just a transcript of a good text response. It needs to be:
- Concise: Avoiding unnecessary fluff.
- Linear: Simple sentence structures that don’t require holding multiple clauses in memory.
- Vocalizable: Free of emojis, URLs, and formatting syntax.
To validate this, the researchers surveyed users, asking them to evaluate responses in both text and audio formats. The results were stark: users penalized spoken responses heavily for being “too long” or containing “too much information,” even if those same responses were rated highly in text format.
Methodology: Teaching AI to Speak, Not Just Write
The core contribution of this paper is a systematic approach to adapting LLMs for speech. The authors explore two primary methodologies: Prompt Engineering grounded in radio journalism, and Preference Learning using a novel dataset.

As illustrated in Figure 2, the process begins by defining what “good” speech looks like, generating diverse samples, and then training the model to prefer those samples.
1. Prompt Engineering: The “NPR” Approach
Before training a model, you can guide it. The researchers turned to an unexpected source for guidelines: the radio industry. Decades of experience in radio journalism have established “rules of thumb” for writing for the ear:
- Use simple words and sentence structures.
- Avoid “tongue twisters” and excessive alliteration.
- Avoid hyphenated adjectives (e.g., “mineral-rich”).
- Keep it conversational.
The researchers developed specific System Prompts—instructions given to the AI before it generates a response—that enforce these rules. They also utilized In-Context Learning (ICL), where the model is shown examples of “bad” text responses converted into “good” speech responses.
For example, a standard prompt might simply be: “You are a helpful assistant.” The speech-optimized prompt would be: “You are a helpful voice assistant. Respond colloquially using simple vocabulary… Keep your response compact… Do not use bullet lists.”
2. Preference Learning: Training on What We Hear
While prompting is effective, it is brittle. To bake these behaviors into the model weights, the researchers needed data. They created SPEECHPREF, a dataset of over 20,000 response pairs.
Crucially, the annotators for this dataset did not read the responses. They listened to them.

As shown in Figure 9, annotators were presented with an interface where the text was hidden. They listened to the user query and two generated responses (converted to audio via Amazon Polly), then selected the winner. This ensured that the preference signal captured the nuances of pacing, length, and listenability that are lost in text-based annotation.
The Training Algorithms: PPO vs. DPO
With this new dataset, the researchers fine-tuned models using two popular alignment algorithms:
- Proximal Policy Optimization (PPO): This is the classic RLHF method used to train ChatGPT. It involves training a separate “Reward Model” to predict how much a human would like a response, and then optimizing the language model to maximize that reward.
- Direct Preference Optimization (DPO): A newer, more stable approach that optimizes the model directly on the preference data without needing a separate reward model loop.
For the PPO implementation, the researchers trained a reward model using the following pairwise binary ranking loss:
\[ \mathcal { L } _ { r a n k i n g } = - \mathrm { l o g } ( \sigma ( r _ { \theta } ( x , y _ { c } ) ) - r _ { \theta } ( x , y _ { r } ) ) , \]
In this equation, \(r_{\theta}\) represents the reward score. The goal is to maximize the difference between the score of the chosen response (\(y_c\)) and the rejected response (\(y_r\)). By training on heard preferences, the reward model learns to penalize verbosity and complexity in ways a text-based reward model never would.
Experiments and Results
The researchers applied these methods to two open-source models: Falcon 7B and OLMo 7B. They compared the “Base” models against versions adapted via Prompting, In-Context Learning (ICL), and DPO.
Human Evaluation
The results were overwhelmingly positive. As seen in the charts below, every adaptation technique significantly outperformed the base model in head-to-head comparisons.

The green bars represent “Wins,” where the adapted model was preferred over the base model.
- Prompting works: Just changing the system prompt (OLMo-Prompt) resulted in a 57% win rate against the base model.
- Fine-tuning works better: The DPO models showed strong performance.
- The Combination is King: The most interesting finding is that these methods are additive. The best performance came from DPO-ICL—a model fine-tuned with DPO that also used the optimized system prompt and examples. For Falcon, the DPO-ICL model achieved a staggering 75% win rate against the original model.
This suggests that prompting helps guide the model during the generation process, while preference learning fundamentally alters the model’s probability distribution to favor speech-worthy tokens.
Why does the combination work?
The researchers analyzed the training trajectory to understand why combining Prompts (ICL) with DPO worked so well.

Figure 4 shows the validation margins (the confidence the model has in distinguishing good vs. bad responses). The DPO ICL (Maroon line) consistently achieves higher margins faster than the base DPO. Effectively, the prompts act as “training wheels,” helping the model identify the desired speech characteristics much earlier in the fine-tuning process.
Even GPT-4 Needs Help
The researchers didn’t just stop at smaller open-source models. They also tested their prompting strategies on GPT-4.

Even though GPT-4 is a massive, state-of-the-art model, the base version struggles with speech suitability. By simply applying the “Speechworthy” prompts (GPT-4-Prompt and GPT-4-ICL), the researchers saw massive improvements in win rates. This confirms that speech unsuitability is not a lack of intelligence; it is a lack of alignment toward the audio modality.
What Makes a Response “Speechworthy”?
So, what exactly changed in the text? Did the models just stop talking as much? The researchers conducted an automatic evaluation to measure specific linguistic features.

Table 5 provides a quantitative look at the changes:
- Word Count (\(\downarrow\)): The adapted models (DPO-ICL) drastically reduced response length. For OLMo, the word count dropped from ~211 to ~95.
- Flesch Reading Ease (FRE \(\uparrow\)): This metric estimates how easy a text is to read (and hear). Higher scores are better. The Falcon DPO-ICL model improved from 49.74 to 69.52.
- Non-vocal Characters (NV \(\downarrow\)): The models learned to stop generating markup, emojis, and list formatting that clutters TTS output.
However, the qualitative analysis reveals that it is not just about being short. It is about being conversational.

In the comparison above (Table 10), look at the example “How can I initiate conversation with a stranger?”
- GPT-4 (Right): Uses a numbered list (“Step one…”, “Second…”, “Third…”). While accurate, it feels mechanical.
- Falcon DPO-ICL (Left): Uses natural transition words: “One way is by… Another way is by…” It concludes with a conversational check-in: “Would you like more tips?”
This distinction—using linguistic connectors instead of structural formatting—is the hallmark of true speech alignment.
Conclusion and Implications
The paper “Speechworthy Instruction-tuned Language Models” highlights a critical gap in current AI development. As we move toward ubiquitous voice computing—from Star Trek-style computers to AI pins and smart glasses—we cannot simply rely on text-based models equipped with a TTS voice skin.
The researchers demonstrated that modality matters. What we prefer to read is distinct from what we prefer to hear. By combining radio-style prompting with preference learning on actual audio data (SpeechPref), we can build models that feel natural, concise, and helpful to listen to.
The implications extend beyond just convenience. For users with visual impairments or low literacy, voice is a primary interface. Improving the “listenability” of these models is an accessibility imperative.
Key Takeaways:
- Don’t list, narrate: Voice assistants should avoid bullet points in favor of conversational transitions.
- Less is more: Spoken responses need to be significantly shorter than written ones to reduce cognitive load.
- Listen to your data: To train a voice model, you must evaluate it with your ears, not your eyes.
As LLMs continue to evolve, this work serves as a reminder that “alignment” isn’t a single target. A model aligned for a chatbot is not aligned for a voice assistant. To build the future of spoken AI, we must start training models that know when to stop writing and start talking.
](https://deep-paper.org/en/paper/file-3659/images/cover.png)