Speeding Up Speech Generation: How Wavelets Cut Diffusion Time in Half

In the fast-evolving world of Generative AI, Diffusion Models have become the gold standard. From creating surreal imagery (like Stable Diffusion) to synthesizing hyper-realistic human speech, these models work by iteratively refining random noise into structured data.

But there is a catch: Diffusion models are slow.

In the realm of audio, this slowness is exacerbated. Audio data is high-dimensional; a single second of high-quality audio can contain 24,000 or 44,100 data points. Training a model to predict such long sequences requires immense computational power and time.

Today, we are diving into a fascinating paper titled “Speaking in Wavelet Domain” by researchers from UNSW, CSIRO, and NTU. They propose a refreshingly simple question: Instead of complicating the neural network architecture to make it faster, what if we just changed the way we represent the audio signal itself?

Their solution involves the Discrete Wavelet Transform (DWT). By moving the generation process into the wavelet domain, they managed to double the training and inference speed of speech diffusion models without sacrificing quality. Let’s break down how they did it.

The Bottleneck: Why is Speech Synthesis Slow?

To understand the solution, we first need to appreciate the problem. Denoising Diffusion Probabilistic Models (DDPMs) generate data through a reverse diffusion process. They start with Gaussian noise and take hundreds (sometimes thousands) of small steps to “denoise” it into a clean waveform.

When dealing with images, we often work in a “latent space” (like a compressed version of the image) to save computation. However, audio poses a unique challenge. Speech signals have very high temporal resolution. If you simply shrink an image, you get a blurry picture. If you “shrink” audio by downsampling it, you lose high-frequency details that make the voice sound human (like the crispness of ’s’ or ’t’ sounds).

Standard approaches to speed this up usually involve:

Architectural changes: Making the neural network smaller or sparser.
Scheduling changes: Trying to denoise in fewer steps.

The authors of this paper took a third path: Signal Compression via Wavelets.

The Core Method: Speaking in Wavelets

The researchers turned to a classic signal processing technique: The Wavelet Transform.

What is a Wavelet?

Unlike the Fourier Transform (FFT), which breaks a signal down into sine waves (frequencies) but loses time information, a Wavelet Transform breaks a signal down into “wavelets.” These are small oscillations that are localized in time. This allows the transform to capture both frequency information and time information simultaneously—which is crucial for non-stationary signals like speech.

The specific wavelet family used often in compression is the Cohen-Daubechies-Feauveau (CDF) 5/3 wavelet.

Figure 1: Wavelet of Cohen-Daubechies-Feauveau 5-tap/3-tap. (a) Scaling and wavelet functions, (b) decomposition and reconstruction filters.

As shown in Figure 1, the wavelet functions (bottom row) oscillate and return to zero. They act as filters that can extract specific textures from the signal.

The Discrete Wavelet Transform (DWT)

The method relies on Decomposition. The Discrete Wavelet Transform (DWT) takes an input speech signal and passes it through two filters:

Low-Pass Filter: Extracts the “Approximation coefficients” (cA). This captures the smooth, low-frequency structure of the voice.
High-Pass Filter: Extracts the “Detail coefficients” (cD). This captures the high-frequency noise and textures.

Crucially, after filtering, the signal is downsampled by 2.

\[ \varPsi _ { l o w } \left( n \right) = \sum _ { k = - \infty } ^ { + \infty } g \left( k \right) \phi \left( 2 n - k \right) . \]

Equation showing low pass filtering

\[ \varPsi _ { h i g h } \left( n \right) = \sum _ { k = - \infty } ^ { + \infty } g \left( k \right) \psi \left( 2 n - k \right) . \]

Equation showing high pass filtering

This leads to the critical compression step:

\[ \begin{array} { r } { c A = { \varPsi } _ { l o w } \mathrm { ~ \downarrow ~ 2 , ~ } } \\ { c D = { \varPsi } _ { h i g h } \mathrm { ~ \downarrow ~ 2 . ~ } } \end{array} \]

Equation showing downsampling

If your original audio had a length of \(L\), you now have two vectors (\(cA\) and \(cD\)), each with length \(L/2\).

The Wavelet Diffusion Pipeline

Here is the genius of the approach: The Diffusion Model is trained to generate the Wavelet components (\(cA\) and \(cD\)), not the raw audio.

Figure 2: Overview of the Speech Wavelet Diffusion Model pipeline: First, the speech signal is decomposed into Approximation coefficients Matrix(cA) and Detail coefficients matrix(cD), the Diffusion model subsequently generates cA and cD and restores the speech signal from these matrices.

As illustrated in Figure 2, the pipeline works as follows:

Input: The target speech is decomposed via DWT into \(cA\) and \(cD\).
Concatenation: These two vectors are stacked to form a matrix with 2 channels and half the original length (\(L/2\)).
Diffusion: The model learns to denoise this compressed representation.
Reconstruction: Once the model generates the clean \(cA\) and \(cD\), an Inverse Discrete Wavelet Transform (IWT) perfectly reconstructs the full-resolution speech signal.

Why does this speed things up?

You might wonder: “We still have the same amount of data points, just rearranged. Why is it faster?”

The answer lies in how GPUs and Convolutional Neural Networks (CNNs) work. The computational cost of a convolution layer depends heavily on the sequence length.

\[ M A C _ { h ( n ) } = K \times C _ { o u t } \times x = \frac { 1 } { 2 } M A C _ { g ( n ) } . \]

Equation showing MAC reduction

By halving the sequence length (\(L \to L/2\)), even though we doubled the channels (\(1 \to 2\)), the GPU can process the data much more efficiently due to parallelization. The researchers found that this simple change roughly doubles the speed of both training and inference.

Further Enhancements: Squeezing More Performance

The researchers didn’t stop at basic decomposition. They introduced two clever modules to further improve quality and speed.

1. The Frequency Bottleneck Block

Speech energy is mostly concentrated in low frequencies. High frequencies often contain noise. To help the model focus on what matters, the authors proposed a Frequency Bottleneck Block.

Figure 4: Overview of Frequency Bottleneck Block

This block (Figure 4) sits inside the model. It processes the signal by separating frequencies again, applying convolutions, and emphasizing the low-frequency components (speech) while potentially attenuating high-frequency components (noise) during the enhancement tasks. This resulted in better audio quality, particularly for Speech Enhancement (denoising) tasks.

2. Multi-Level Wavelet Accelerator

If splitting the signal once (\(L/2\)) is fast, what if we split it again?

The Multi-Level Accelerator applies the DWT recursively. This results in a signal that is 1/4th of the original length with 4 channels.

Overview of Multi-Level DWT and Enhancement Modules (a) Block of Multi-Level Discrete Wavelet Transform

(b) Multi-Level Low-Frequency Voice Enhancement Module

(c) Block of Multi-Level Inverse Discrete Wavelet Transform

As seen in the diagrams above, this allows for aggressive compression. The experiments showed this could speed up the model by more than 5x, though with a slight trade-off in audio fidelity.

Experiments and Results

The team tested this method on two major tasks: Speech Synthesis (Text-to-Speech using DiffWave) and Speech Enhancement (Removing noise using CDiffuSE).

Speed Comparison

The primary claim of the paper is speed. Let’s look at the data:

Table 1: The table presented above displays the results for various wavelet bases in both Speech Enhancement and Speech Synthesis tasks.

In Table 1, look at the Training Time and RTF (Real-Time Factor) columns.

Original Model: Training took ~481 seconds per epoch; RTF was 0.728.
Wavelet Models (Haar, DB2, etc.): Training dropped to ~248 seconds; RTF dropped to ~0.40.

This confirms the hypothesis: The Wavelet approach cuts the computational time almost exactly in half.

Quality Comparison

Speed is useless if the audio sounds robotic. The researchers used PESQ (Perceptual Evaluation of Speech Quality) and MOS (Mean Opinion Score, rated by humans) to judge quality.

Haar Wavelet: Very fast and sharp, but can be a bit “blocky” in theory. In practice, it achieved a MOS of 4.32, very close to the original model’s 4.38.
DB2 (Daubechies 2): A slightly more complex wavelet. It offered a great balance, maintaining timbre better than Haar.
Coif1: This wavelet performed exceptionally well on the DNS_MOS metric (a neural network that judges quality), actually beating the original model in some cases. However, human listeners noted it sometimes altered the “timbre” (the character) of the voice slightly.

The Power of the Frequency Enhancer

When they added the Frequency Bottleneck Block (discussed in the previous section), the results got even better.

Table 3: The table presented above displays the results for various wavelet bases in both Speech Enhancement and Speech Synthesis tasks with Enhancer.

Table 3 shows that with the enhancer, the Haar wavelet model actually achieved a Speech Naturalness (SN) score of 4.421, outperforming the original model’s 4.372, while still being twice as fast.

Extreme Speed with Multi-Level Wavelets

Finally, for the Multi-Level Accelerator (splitting the signal into 4 parts):

Table 2: The Table shows the result of Multi-level wavelet Accelerator.

Table 2 shows the Haar4C (4-channel/4-part split). The RTF drops to a staggering 0.126. This is incredibly fast. However, the MOS drops to 4.32. This creates a spectrum of choice: if you need “perfect” audio, use the 2-level split. If you need lightning-fast generation (e.g., for a mobile app), use the 4-level split.

Conclusion and Key Takeaways

The paper “Speaking in Wavelet Domain” teaches us a valuable lesson in AI research: sometimes the best optimization isn’t a new layer or a new optimizer, but a better representation of the data itself.

Key Takeaways for Students:

Signal Representation Matters: Deep learning models don’t exist in a vacuum. Understanding signal processing (like Wavelets vs. FFT) allows you to feed your model “better” food, making it digest (train) faster.
Parallelization is King: Reducing sequence length, even at the cost of increasing channels, is usually a winning trade-off on modern GPUs.
Versatility: This method isn’t tied to one specific model. It can be plugged into almost any diffusion-based audio model (DiffWave, CDiffuSE, etc.) to instantly gain a 2x speedup.

By simply looking at the speech signal through the lens of wavelets, the authors unlocked a method to make high-fidelity speech synthesis practical and scalable. For anyone working on diffusion models, this invites a compelling question: What other domains could benefit from a wavelet transformation?

The Bottleneck: Why is Speech Synthesis Slow?#

The Core Method: Speaking in Wavelets#

What is a Wavelet?#

The Discrete Wavelet Transform (DWT)#

The Wavelet Diffusion Pipeline#

Why does this speed things up?#

Further Enhancements: Squeezing More Performance#

1. The Frequency Bottleneck Block#

2. Multi-Level Wavelet Accelerator#

Experiments and Results#

Speed Comparison#

Quality Comparison#

The Power of the Frequency Enhancer#

Extreme Speed with Multi-Level Wavelets#

Conclusion and Key Takeaways#