Introduction

In the rapidly evolving world of Artificial Intelligence, multimodal models—systems that can understand and process multiple types of data like text, images, and audio—are breaking new ground. Just as Vision-Language Models (VLMs) like CLIP revolutionized computer vision by connecting images to natural language, Audio-Language Models (ALMs) are doing the same for sound.

These models allow for Zero-Shot Audio Recognition. Imagine playing a sound clip of a dog barking to an AI model that has never been explicitly trained to classify “dog barks.” Instead, you simply provide the text “A recording of a dog,” and the model matches the audio features to the text features, correctly identifying the sound.

However, there is a catch. The performance of these models is notoriously sensitive to the specific wording of the text prompt. Changing “A recording of a dog” to “This is a sound of a dog” can drastically alter the accuracy. This dependency forces researchers to engage in “prompt engineering”—a tedious process of manually guessing the best phrasing.

In this post, we explore a research paper titled “PALM: Few-Shot Prompt Learning for Audio Language Models.” The authors propose a novel, efficient method called PALM that automates this process. Instead of struggling with hand-crafted sentences, PALM learns the optimal context directly within the feature space of the model, achieving state-of-the-art results with a fraction of the computational cost of previous methods.

The Problem with Hand-Crafted Prompts

To understand why PALM is necessary, we first need to look at the limitations of the current standard: Zero-Shot inference.

In a typical ALM setup (like the PENGI model used in this study), the model has two main branches: an Audio Encoder and a Text Encoder. The Audio Encoder turns sound waves into mathematical embeddings (vectors), and the Text Encoder does the same for text. To classify a sound, the model calculates the cosine similarity between the audio embedding and various text embeddings (e.g., “dog,” “car,” “rain”). The class with the highest similarity score wins.

The issue is that the text encoder is sensitive. The authors demonstrated this by testing the PENGI model on standard datasets using eight different prompt templates.

Figure 2: Impact of Hand-crafted Prompts on ZERO-SHOT Performance. Zero-shot accuracy across four audio recognition datasets is evaluated with eight different text prompts. The accuracy varies with changes in the handcrafted prompts.

As shown in Figure 2, the accuracy fluctuates significantly depending on the template used. For the ESC50 dataset, a simple prompt like {CLASS NAME} yields 43.3% accuracy, whereas “The is a recording of {CLASS NAME}” jumps to 53.5%. Relying on manual engineering to find that “magic sentence” is inefficient and unreliable.

Existing Solutions and Their Flaws

The computer vision community faced this exact problem with VLMs and developed “Prompt Learning” solutions like COOP (Context Optimization) and COCOOP (Conditional Context Optimization).

These methods replaced manual text prompts with learnable “context tokens.” Instead of feeding the words “A recording of,” they feed learnable vectors into the input of the text encoder. During training, the model adjusts these input vectors to maximize accuracy.

While effective, adapting COOP and COCOOP to Audio-Language Models comes with a heavy computational price. Because these methods optimize the input space, the error gradients must propagate all the way through the massive Text Encoder during training. This requires significant memory and processing power.

The PALM Solution: Optimizing the Feature Space

The authors of PALM took a different approach. They asked: Why optimize the input when we can optimize the output?

PALM (Prompt Learning in Audio Language Models) shifts the focus from the token embedding space (input) to the feature space (output).

The Architecture

Let’s look at how PALM compares to the traditional Zero-Shot approach and the COOP baseline.

Figure 3: Overview of Zero-Shot, COOP, PALM. (a) Zero-Shot matches embeddings directly. (b) COOP optimizes the input space (token embeddings). (c) PALM optimizes the feature space by adding learnable context embeddings to text feature vectors.

As illustrated in Figure 3:

  1. Zero-Shot (a): Uses fixed, hand-crafted prompts.
  2. COOP (b): Learns context at the beginning of the pipeline (input to Text Encoder). This requires backpropagation through the entire encoder (the grey box).
  3. PALM (c): Passes the simple class name through the frozen Text Encoder first. Then, it modifies the resulting feature vector with learnable parameters.

Because the modification happens after the Text Encoder, the gradients do not need to flow back through the encoder. The encoder remains completely frozen, making training incredibly fast and efficient.

The Mathematics of PALM

How does PALM actually modify the features? It uses a clever combination of the original text features and a learned context vector.

First, the text prompt \(t_i\) (which is just the class name) is passed through the text encoder \(f_T\) to get a feature vector. Then, PALM computes a modified feature vector \(f'_T(t_i)\) using the following equation:

Equation for modified text feature vector.

Here:

  • \(f_T(t_i)\) is the original feature vector from the frozen encoder.
  • \(z_i\) is a learnable context vector specific to that class.
  • \(\lambda_i\) is a learnable scalar (between 0 and 1) that acts as a gate, deciding how much to rely on the pre-trained knowledge versus the learned context.

Once the modified text features are created, the model compares them to the audio features (\(f_A(\mathbf{x})\)) using cosine similarity. The final prediction logic remains similar to standard zero-shot inference, but uses the optimized text features:

Equation for prediction using modified features.

Training

The model is trained in a few-shot setting. This means it only sees a small number of examples (e.g., 16 audio clips) per class. The goal is to minimize the difference between the predicted class and the actual label. The objective function minimizes the standard cross-entropy loss:

Equation for minimization objective.

Importantly, during this optimization, only the context vectors (\(z\)) and the gating parameters (\(\lambda\)) are updated. The massive Audio and Text encoders remain frozen.

Efficiency Analysis

One of the strongest arguments for PALM is its efficiency. By avoiding the need to pass gradients through the text encoder, PALM drastically reduces the computational burden compared to baselines like COCOOP.

Table 3: Number of Learnable Parameters in baselines and PALM.

As seen in Table 3, PALM requires roughly 87% fewer learnable parameters than COCOOP (12,393 vs 98,880). This makes the model lighter to store and faster to train, without sacrificing the complex understanding embedded in the pre-trained encoders.

Experiments and Results

To validate their method, the researchers tested PALM on 11 different audio datasets covering a wide range of tasks, from emotion recognition to identifying musical instruments.

Table 1: Datasets Information showing 11 multi-class classification datasets.

Performance Comparison

The results were compared against three baselines:

  1. Zero-Shot: The standard pre-trained PENGI model.
  2. COOP: Adapted from vision-language models.
  3. COCOOP: An advanced version of COOP with feedback loops.

The comparison results are summarized in the chart below:

Figure 1: Comparison of our proposed approach, PALM, with three baselines. Bar plots show classification accuracy averaged across 11 audio datasets.

PALM acts as the clear winner.

  • Zero-Shot baseline averaged 39.7% accuracy.
  • COOP improved this to 71.1%.
  • COCOOP reached 73.5%.
  • PALM achieved the highest accuracy at 76.6%.

This demonstrates that optimizing the feature space is not only more efficient but also more effective than optimizing the input space for Audio-Language Models. PALM outperformed COOP by an average of 5.5% and COCOOP by 3.1%.

Why Does It Work? (Ablation Studies)

The researchers performed “ablation studies”—experiments designed to remove parts of the system to see if they matter.

1. Is the learnable context actually helping? They compared PALM against a version where the learnable context embeddings (\(z_i\)) were removed, leaving only the original text features.

Figure 4: Comparison of PALM dagger and PALM. Removal of context embeddings drastically degrades performance.

Figure 4 shows that removing the context (represented by the orange bars, PALM\(^{\dagger}\)) leads to a massive drop in accuracy across almost all datasets. This confirms that the learned vector \(z_i\) is capturing crucial information that the bare class name alone misses.

2. Does more data help? Since this is a few-shot learning method, the number of “shots” (training examples) matters.

Figure 5: A higher number of shots generally leads to increased audio classification accuracy using PALM.

Figure 5 confirms a positive correlation: as the number of shots increases (from 1 to 16), the accuracy of PALM consistently improves across various datasets.

Conclusion

The PALM paper presents a significant step forward for Audio-Language Models. It addresses the “prompt engineering bottleneck” not by asking humans to write better prompts, but by allowing the model to learn the best representation itself.

By shifting the optimization from the input space (tokens) to the feature space (embeddings), PALM achieves a “best of both worlds” scenario:

  1. High Accuracy: It outperforms state-of-the-art baselines like COCOOP.
  2. High Efficiency: It uses significantly fewer parameters and eliminates the need for expensive backpropagation through the text encoder.

For students and researchers in multimodal AI, PALM offers a valuable lesson: sometimes the most effective way to adapt a massive pre-trained model isn’t to retrain it or massage its inputs, but to fine-tune its outputs with a lightweight, learnable layer. As ALMs continue to grow in popularity, efficient prompt learning techniques like PALM will be essential for deploying these models in real-world applications.