Introduction

We have all been there. You ask a voice assistant a simple question like, “How do I choose a new phone?” and it responds by reading a five-paragraph essay out loud. It drones on, listing “bullet point one,” reading out URLs, or using complex sentence structures that force you to concentrate intensely just to follow the logic. By the time it finishes, you have forgotten the beginning of the sentence.

This creates a fundamental friction in modern AI. While Large Language Models (LLMs) like GPT-4 or Llama have revolutionized information retrieval, they are primarily trained to generate text. We read text with our eyes, which allows us to scan, skip ahead, and re-read complex clauses. However, when we interact with AI via voice—through smart speakers, car assistants, or accessibility tools—we are using our ears.

Speech is serial and transient. Once a word is spoken, it is gone. You cannot “scan” an audio wave with your ears the way you scan a paragraph with your eyes.

In a fascinating research paper titled “Speechworthy Instruction-tuned Language Models,” researchers from USC and Amazon investigate this modality mismatch. They argue that models aligned purely on text preferences fail to meet the unique cognitive requirements of spoken conversation. Their work proposes a new framework for creating “Speechworthy” LLMs using insights from the radio industry, novel prompting strategies, and a unique speech-based preference dataset.

Current instruction-tuned language models tend to generate verbose responses with nonvocalizable content, such as bullet lists or parentheses, that are not suitable for responses that are delivered as speech by voice assistants.

The Problem: The Text-Speech Mismatch

To understand why current AI struggles with voice, we must look at how it is trained. Most instruction-tuned language models (ITLMs) are fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Human annotators read two text responses and pick the best one.

The issue is that an annotator reading a response favors detail, comprehensive lists, and structured formatting (like bold text or bullet points). As shown in Figure 1 above, a standard model (OLMo 7B Instruct) generates a response full of lists and emojis. While this looks great on a screen, it sounds unnatural when synthesized by a Text-to-Speech (TTS) engine.

The researchers hypothesize that speech processing requires a higher cognitive load than reading. Therefore, the optimal response for a voice interface is not just a transcript of a good text response. It needs to be:

  1. Concise: Avoiding unnecessary fluff.
  2. Linear: Simple sentence structures that don’t require holding multiple clauses in memory.
  3. Vocalizable: Free of emojis, URLs, and formatting syntax.

To validate this, the researchers surveyed users, asking them to evaluate responses in both text and audio formats. The results were stark: users penalized spoken responses heavily for being “too long” or containing “too much information,” even if those same responses were rated highly in text format.

Methodology: Teaching AI to Speak, Not Just Write

The core contribution of this paper is a systematic approach to adapting LLMs for speech. The authors explore two primary methodologies: Prompt Engineering grounded in radio journalism, and Preference Learning using a novel dataset.

Preference learning method overview. Since we only have an approximate idea of what makes a good spoken response, we first compile a set of system prompts intended to vary the speech suitability of generated responses.

As illustrated in Figure 2, the process begins by defining what “good” speech looks like, generating diverse samples, and then training the model to prefer those samples.

1. Prompt Engineering: The “NPR” Approach

Before training a model, you can guide it. The researchers turned to an unexpected source for guidelines: the radio industry. Decades of experience in radio journalism have established “rules of thumb” for writing for the ear:

  • Use simple words and sentence structures.
  • Avoid “tongue twisters” and excessive alliteration.
  • Avoid hyphenated adjectives (e.g., “mineral-rich”).
  • Keep it conversational.

The researchers developed specific System Prompts—instructions given to the AI before it generates a response—that enforce these rules. They also utilized In-Context Learning (ICL), where the model is shown examples of “bad” text responses converted into “good” speech responses.

For example, a standard prompt might simply be: “You are a helpful assistant.” The speech-optimized prompt would be: “You are a helpful voice assistant. Respond colloquially using simple vocabulary… Keep your response compact… Do not use bullet lists.”

2. Preference Learning: Training on What We Hear

While prompting is effective, it is brittle. To bake these behaviors into the model weights, the researchers needed data. They created SPEECHPREF, a dataset of over 20,000 response pairs.

Crucially, the annotators for this dataset did not read the responses. They listened to them.

Audio preference annotation interface. The left pane contains the user prompt and two responses. The right side contains the survey that appears after the annotator listens to both responses.

As shown in Figure 9, annotators were presented with an interface where the text was hidden. They listened to the user query and two generated responses (converted to audio via Amazon Polly), then selected the winner. This ensured that the preference signal captured the nuances of pacing, length, and listenability that are lost in text-based annotation.

The Training Algorithms: PPO vs. DPO

With this new dataset, the researchers fine-tuned models using two popular alignment algorithms:

  1. Proximal Policy Optimization (PPO): This is the classic RLHF method used to train ChatGPT. It involves training a separate “Reward Model” to predict how much a human would like a response, and then optimizing the language model to maximize that reward.
  2. Direct Preference Optimization (DPO): A newer, more stable approach that optimizes the model directly on the preference data without needing a separate reward model loop.

For the PPO implementation, the researchers trained a reward model using the following pairwise binary ranking loss:

\[ \mathcal { L } _ { r a n k i n g } = - \mathrm { l o g } ( \sigma ( r _ { \theta } ( x , y _ { c } ) ) - r _ { \theta } ( x , y _ { r } ) ) , \]

Ranking loss equation used for the reward model training.

In this equation, \(r_{\theta}\) represents the reward score. The goal is to maximize the difference between the score of the chosen response (\(y_c\)) and the rejected response (\(y_r\)). By training on heard preferences, the reward model learns to penalize verbosity and complexity in ways a text-based reward model never would.

Experiments and Results

The researchers applied these methods to two open-source models: Falcon 7B and OLMo 7B. They compared the “Base” models against versions adapted via Prompting, In-Context Learning (ICL), and DPO.

Human Evaluation

The results were overwhelmingly positive. As seen in the charts below, every adaptation technique significantly outperformed the base model in head-to-head comparisons.

Head-to-head human evaluation results for OLMo (left) and Falcon (right). If the win rate is higher than the loss rate, the model is more often preferred in a speech setting.

The green bars represent “Wins,” where the adapted model was preferred over the base model.

  • Prompting works: Just changing the system prompt (OLMo-Prompt) resulted in a 57% win rate against the base model.
  • Fine-tuning works better: The DPO models showed strong performance.
  • The Combination is King: The most interesting finding is that these methods are additive. The best performance came from DPO-ICL—a model fine-tuned with DPO that also used the optimized system prompt and examples. For Falcon, the DPO-ICL model achieved a staggering 75% win rate against the original model.

This suggests that prompting helps guide the model during the generation process, while preference learning fundamentally alters the model’s probability distribution to favor speech-worthy tokens.

Why does the combination work?

The researchers analyzed the training trajectory to understand why combining Prompts (ICL) with DPO worked so well.

Falcon’s DPO training trajectory suggests that prompts help the preference learning process by providing useful initial guidance.

Figure 4 shows the validation margins (the confidence the model has in distinguishing good vs. bad responses). The DPO ICL (Maroon line) consistently achieves higher margins faster than the base DPO. Effectively, the prompts act as “training wheels,” helping the model identify the desired speech characteristics much earlier in the fine-tuning process.

Even GPT-4 Needs Help

The researchers didn’t just stop at smaller open-source models. They also tested their prompting strategies on GPT-4.

Head-to-head human evaluation results with our prompts using GPT-4.

Even though GPT-4 is a massive, state-of-the-art model, the base version struggles with speech suitability. By simply applying the “Speechworthy” prompts (GPT-4-Prompt and GPT-4-ICL), the researchers saw massive improvements in win rates. This confirms that speech unsuitability is not a lack of intelligence; it is a lack of alignment toward the audio modality.

What Makes a Response “Speechworthy”?

So, what exactly changed in the text? Did the models just stop talking as much? The researchers conducted an automatic evaluation to measure specific linguistic features.

Automatic evaluation results comparing word count, reading ease, and non-vocal characters.

Table 5 provides a quantitative look at the changes:

  • Word Count (\(\downarrow\)): The adapted models (DPO-ICL) drastically reduced response length. For OLMo, the word count dropped from ~211 to ~95.
  • Flesch Reading Ease (FRE \(\uparrow\)): This metric estimates how easy a text is to read (and hear). Higher scores are better. The Falcon DPO-ICL model improved from 49.74 to 69.52.
  • Non-vocal Characters (NV \(\downarrow\)): The models learned to stop generating markup, emojis, and list formatting that clutters TTS output.

However, the qualitative analysis reveals that it is not just about being short. It is about being conversational.

Comparison of Falcon DPO-ICL vs GPT-4-ICL. While GPT-4 generates lists, Falcon uses conversational language.

In the comparison above (Table 10), look at the example “How can I initiate conversation with a stranger?”

  • GPT-4 (Right): Uses a numbered list (“Step one…”, “Second…”, “Third…”). While accurate, it feels mechanical.
  • Falcon DPO-ICL (Left): Uses natural transition words: “One way is by… Another way is by…” It concludes with a conversational check-in: “Would you like more tips?”

This distinction—using linguistic connectors instead of structural formatting—is the hallmark of true speech alignment.

Conclusion and Implications

The paper “Speechworthy Instruction-tuned Language Models” highlights a critical gap in current AI development. As we move toward ubiquitous voice computing—from Star Trek-style computers to AI pins and smart glasses—we cannot simply rely on text-based models equipped with a TTS voice skin.

The researchers demonstrated that modality matters. What we prefer to read is distinct from what we prefer to hear. By combining radio-style prompting with preference learning on actual audio data (SpeechPref), we can build models that feel natural, concise, and helpful to listen to.

The implications extend beyond just convenience. For users with visual impairments or low literacy, voice is a primary interface. Improving the “listenability” of these models is an accessibility imperative.

Key Takeaways:

  1. Don’t list, narrate: Voice assistants should avoid bullet points in favor of conversational transitions.
  2. Less is more: Spoken responses need to be significantly shorter than written ones to reduce cognitive load.
  3. Listen to your data: To train a voice model, you must evaluate it with your ears, not your eyes.

As LLMs continue to evolve, this work serves as a reminder that “alignment” isn’t a single target. A model aligned for a chatbot is not aligned for a voice assistant. To build the future of spoken AI, we must start training models that know when to stop writing and start talking.