Introduction

We are living in the era of massive artificial intelligence. In recent years, deep learning models—particularly Transformers—have shattered records in computer vision and natural language processing. We have moved beyond simple image classifiers to complex multimodal systems capable of understanding video, audio, and text simultaneously. However, this capability comes at a steep price.

Between 2012 and 2020, the computational requirements for state-of-the-art machine learning applications increased by a staggering factor of 1,000,000. While hardware has improved, it hasn’t kept pace with this exponential growth. This creates a massive bottleneck for inference serving—the process of actually running these models in production to generate predictions for users.

When you deploy a massive multimodal model, you are constantly fighting two battles: latency (how fast can you reply?) and cost (how much GPU memory and compute power do you need?). To win these battles, engineers often turn to model compression techniques like pruning (removing connections), distillation (teaching a smaller model), or quantization (reducing numerical precision). These are effective, but they have limits. Sometimes, you cannot compress a model further without destroying its accuracy.

But what if we looked at the problem from a different angle? Instead of shrinking the model, what if we optimized the input?

This is the core premise of a fascinating research paper titled “MOSEL: Inference Serving Using Dynamic Modality Selection.” The researchers propose a novel system that dynamically decides which parts of the data (modalities) are actually necessary to process a request. By intelligently dropping expensive inputs—like video frames—while keeping cheaper ones—like audio—MOSEL achieves a \(3.6\times\) improvement in throughput and an \(11\times\) reduction in job completion times, all while maintaining strict accuracy guarantees.

In this post, we will tear down the MOSEL architecture, explain the intuition behind “modality selection,” and explore how this system manages to perform a high-wire act between speed and precision.

Background: The Cost of Multimodality

To understand why MOSEL is necessary, we first need to understand the nature of the models it serves. Multimodal Learning involves models that process different types of data inputs—such as text, image, audio, and video—to make a prediction.

Think of a model designed to analyze a video clip and determine the emotion of the speaker. This model might look at the speaker’s facial expressions (Video Modality), listen to the tone of voice (Audio Modality), and analyze the transcript of what is said (Text Modality).

The Fusion Challenge

These models use a technique called fusion to combine these inputs.

  • Early Fusion: The raw data (e.g., video patches and audio spectrograms) are combined right at the beginning and fed into the neural network together.
  • Late Fusion: Each modality is processed separately by different sub-networks, and their results are combined at the very end.

Regardless of the fusion strategy, processing all these modalities is expensive. Video, in particular, is a resource hog. It requires processing temporal data (time) and spatial data (pixels), which consumes vast amounts of memory and compute cycles.

The Unequal Contribution of Modalities

Here is the key insight driving this research: Not all modalities are created equal.

In many scenarios, one modality might provide the bulk of the predictive power, while another adds a lot of computational cost for very little gain.

Let’s look at the data. The researchers analyzed several popular multimodal models, including TVLT (Textless Vision-Language Transformer).

Figure 1: Performance comparison of different modalities for multimodal models.

Figure 1 above illustrates this trade-off perfectly.

  • Upper Left (Latency): Look at the first group of bars for the TVLT model. The bar representing “all” modalities is the highest. However, the “audio” only bar is significantly lower.
  • Upper Middle (Memory): The difference is even more stark here. The memory footprint for “all” modalities is massive, while “audio” is tiny.
  • Upper Right (Accuracy): This is the most important chart. While using “all” modalities gives the highest accuracy, using only audio often gets you very close to that peak.

For the TVLT model, the video modality consumes significant resources but provides diminishing returns compared to audio. This creates an opportunity. If the server is under heavy load, could we temporarily stop processing the video stream and relying only on audio? We might lose a fraction of a percent in accuracy, but we could gain a massive speedup, ensuring the system doesn’t crash or time out.

This is the concept of Modality Selection: selectively enabling or disabling modalities per request based on the application’s requirements and the system’s current load.

The Scheduling Challenge

Implementing modality selection isn’t as simple as just “turning off video when busy.” Real-world inference systems handle Jobs, which are batches of requests. Each job comes with a Service Level Objective (SLO)—a specific deadline and a minimum accuracy requirement.

The system must decide which modalities to use for each request in a job to meet the deadline without dropping below the target accuracy. This is a complex scheduling puzzle.

Let’s visualize this problem.

Figure 2: Execution plans for scheduling jobs with different modality strategies.

Figure 2 demonstrates the complexity of scheduling three different jobs:

  • Job 1 is already running.
  • Job 2 arrives at timestamp 10.
  • Job 3 arrives shortly after at timestamp 20.

The system has a “menu” of strategies (\(S_1\) through \(S_6\)) for Job 2. Some strategies use high-accuracy/high-latency modalities (like Video+Audio), while others use low-accuracy/low-latency ones (Audio only).

  • Plan 1 (Top Right): If we choose strategy \(S_6\) for Job 2, it finishes quickly, but the accuracy (0.685) is too low to meet Job 2’s requirement (0.71). This plan fails.
  • Plan 2 (Bottom Left): If we choose strategy \(S_1\) (highest accuracy) for Job 2, it meets the accuracy requirement easily. However, it takes so long to execute that Job 3 is stuck waiting in the queue. By the time Job 2 finishes (at timestamp 140), Job 3 cannot possibly finish before its deadline (timestamp 150). This plan also fails.
  • Plan 3 (Bottom Right): This is the sweet spot. We select a strategy for Job 2 that uses a mix of modalities—perhaps using Video+Audio for some requests and Audio-only for others. This lowers Job 2’s accuracy slightly (to 0.735, which is still above the 0.71 requirement) but finishes much faster. This leaves enough time for Job 3 to execute and meet its deadline.

This example highlights that we cannot look at jobs in isolation. A “greedy” approach that gives Job 2 the maximum possible accuracy might starve Job 3. The system needs to be socially aware, sometimes selecting a slightly lower (but still acceptable) accuracy for the current job to ensure future jobs survive.

MOSEL: System Design

To automate this decision-making process, the authors built MOSEL. The system operates in two distinct stages: an Offline Stage (where it learns about the model) and an Online Stage (where it makes real-time decisions).

Figure 3: MOSEL constructs optimized strategies offline, then applies them online.

1. The Offline Stage: Profiling and Optimization

Before the model serves a single user, MOSEL performs rigorous profiling. It needs to understand the “menu” of options available.

Profiling

MOSEL runs the model through a validation dataset using every possible combination of modalities (e.g., Audio-only, Video-only, Audio+Video) and various batch sizes. It records two key metrics for each combination: Latency (how long it takes) and Accuracy (how good the predictions are).

Strategy Generation

Once the profiling is done, MOSEL needs to create a lookup table of Optimal Modality Selection Strategies.

If a job arrives with 100 requests and requires 80% accuracy, there are millions of ways to mix and match modalities for those 100 requests. We don’t want to calculate this at runtime.

MOSEL pre-computes these strategies using Non-Linear Integer Programming (NILP). The goal is to minimize total latency subject to specific constraints.

The objective is to minimize the total latency \(D\) for selected modalities \(i\) and batch sizes \(j\):

Equation: Minimizing total latency sum of D_ij

Subject to two main constraints:

Equation: Constraints for total requests and average accuracy

The first constraint (top line) ensures that the sum of requests in our strategy equals the total size of the job \(|\mathcal{R}|\). The second constraint (bottom line) ensures that the weighted average accuracy of the selected strategies meets or exceeds the target accuracy \(\alpha\).

By solving this offline, MOSEL creates a “Matrix” (as seen in Figure 3b). When a live job arrives, the system doesn’t need to do complex math; it simply looks up the best pre-computed plan in the matrix.

2. The Online Stage: Dynamic Execution

The online component is where MOSEL handles the pressure of real-time traffic.

Figure 4: MOSEL Workflow diagram showing the Profiler, Strategy Generator, and runtime components.

As shown in Figure 4, the workflow involves a Monitor Process and a Worker Process.

  1. Job Arrival: Jobs arrive and are placed in a queue.
  2. Default Assignment: Initially, every job is assigned the highest-accuracy strategy available.
  3. Violation Detection: MOSEL constantly monitors the queue. It looks at the estimated finish time of the last job in the queue.
  4. Dynamic Adjustment: If MOSEL detects that a job is going to miss its deadline (a “violator”), it triggers an optimization routine.

The goal of this online optimization is to “squeeze” the jobs currently in the queue. It looks for alternative strategies for the preceding jobs that are faster but still meet their individual accuracy requirements.

This dynamic adjustment is formulated as another optimization problem. The system tries to maximize the average accuracy for all requests in the queue:

Equation: Maximizing accuracy across strategies and jobs

While ensuring that the total latency of all strategies fits within the time budget \(T\) (the time remaining until the deadline):

Equation: Sum of latency must be less than Time Budget T

If solving this optimization problem takes too long (the integer programming solver can take up to 70ms), MOSEL falls back to a Greedy Heuristic. It randomly picks jobs in the queue and switches them to faster strategies until the timing works out. This ensures the system itself doesn’t become the bottleneck.

Evaluation and Results

The researchers implemented MOSEL in Python using PyTorch and evaluated it on NVIDIA A100 GPUs. They tested it against five different multimodal models (including TVLT, ViLT, and TBN) using realistic workloads derived from Twitter traces.

Does it improve throughput?

The primary goal of MOSEL is to handle more traffic without crashing.

Figure 5: Throughput and SLO violation ratio comparisons.

Figure 5 compares different policies:

  • Blue (None): The standard approach (using all modalities always).
  • Red (Optimized): MOSEL’s dynamic approach.

Look at the top row (Throughput). The red line consistently handles higher request rates. For the TVLT model (column ‘a’), MOSEL achieves \(5.3\times\) higher throughput than the baseline.

More importantly, look at the bottom row (SLO Violation Ratio). This measures how often the system failed to meet a deadline. The blue bars (baseline) are high, meaning frequent failures. The red bars (MOSEL) are significantly lower. For TVLT, the violation ratio drops from nearly 70% to almost zero. This proves that dynamically dropping expensive modalities prevents the system from getting overwhelmed during traffic spikes.

Does accuracy suffer?

You might worry that dropping modalities ruins the user experience.

Figure 6: Accuracy distribution of TVLT.

Figure 6 shows the accuracy distribution. While the “None” strategy (using all modalities) has a tight distribution at the very top, the “Optimized” strategy (MOSEL) maintains an average accuracy that is almost identical (around 0.740). The distribution is wider—meaning some requests got lower accuracy predictions—but the average quality delivered to the user remained high.

Can it work with other optimizations?

MOSEL is an input-level optimization. Can it work alongside model-level optimizations like quantization (using 16-bit floating point numbers instead of 32-bit)?

Figure 7: SLO violation ratio using FP32, FP16, and dynamic modality selection combined.

Figure 7 shows the breakdown.

  • Blue: FP32 (Standard)
  • Orange: FP16 (Quantization only)
  • Green: Dynamic + FP16 (MOSEL + Quantization)

In almost every graph, the Green line performs the best. It stays at 0.0 violation ratio for higher Query Per Second (QPS) loads than the other methods. This confirms that MOSEL is complementary to existing techniques; you don’t have to choose between them.

Is it robust to errors?

Profiling isn’t always perfect. Sometimes the estimated latency of a modality might differ from the actual execution time on the GPU.

Figure 8: Throughput and accuracy under latency estimation errors.

Figure 8 tests MOSEL’s resilience. The x-axis represents the discrepancy between estimated and actual latency.

  • Left Chart: Normalized throughput. Even if the estimation is off by 50% (x-axis at 1.5), the throughput (y-axis) remains stable for most models.
  • Right Chart: Accuracy. The accuracy distribution also remains stable across a wide range of estimation errors.

This robustness is crucial for production systems, where GPU clock speeds and thermal throttling can cause unpredictable variations in execution time.

Conclusion

The explosive growth of multimodal models presents a crisis of resources. We simply cannot afford to process every pixel of video and every millisecond of audio for every single query, especially when accuracy requirements can often be met with less data.

MOSEL introduces a paradigm shift in inference serving. Rather than treating the input data as a fixed requirement, it treats it as a flexible variable. By creating a “market” of modality strategies—trading small amounts of accuracy for large gains in latency—MOSEL allows inference servers to survive heavy loads that would otherwise crash them.

Key takeaways from the MOSEL paper:

  1. Modality Selection works: You can often drop expensive modalities (like video) and rely on cheaper ones (like audio) with minimal accuracy loss.
  2. Offline Profiling is key: Pre-computing the best strategies allows for instant decision-making during runtime.
  3. Global Optimization matters: Optimizing a single job in isolation is dangerous; you must optimize the entire queue to prevent starvation of future jobs.

As we move toward even larger “foundation models” that accept text, image, video, and sensory data, systems like MOSEL will likely become standard components of the AI infrastructure stack, ensuring that our desire for smarter AI doesn’t outpace our ability to serve it.