Introduction

In the world of Large Language Models (LLMs), we are constantly battling the “Scaling Laws.” The rule of thumb has generally been: if you want a smarter model, you need a bigger model. However, bigger models come with a steep price tag—they require massive computational power (FLOPs) and huge amounts of video memory (VRAM).

To solve the computational problem, researchers turned to Mixture-of-Experts (MoE) architectures (like Mixtral 8x7B or DeepSeek-MoE). MoE models are clever; they have many parameters but only use a small fraction of them for each token generated. This keeps inference fast and cheap in terms of calculation.

But there is a catch. While MoE reduces computation, it doesn’t reduce the memory requirement. You still need to store all those billions of parameters. For students and researchers with limited GPU resources, this often means relying on “offloading”—storing the model on system RAM or disk and swapping parts into the GPU as needed. As you might guess, this swapping process creates a massive bottleneck, killing inference speed.

Enter Mixture of Lookup Experts (MoLE), a novel architecture presented at ICML 2025. MoLE proposes a radical shift: what if we could design experts that require zero computation during inference? By converting complex neural networks into simple lookup tables, MoLE achieves the high performance of MoE models with the speed and low memory footprint of dense models.

In this post, we will tear down the MoLE paper, explain how it turns neural networks into static tables, and look at the data proving it works.

The Problem: The VRAM Bottleneck

To understand why MoLE is necessary, we first need to look at the limitations of standard Mixture-of-Experts (MoE) models.

In a standard MoE, a “Router” decides which “Experts” (usually Feed-Forward Networks or FFNs) handle a specific input token. For example, in Mixtral-8x7B, there are 8 experts, but only 2 are active for any given token. This sparse activation is why MoE is computationally efficient.

However, the memory cost is a different story. Even though you only use 2 experts, you must store all 8. If your GPU VRAM is too small (e.g., an 80GB model on a 24GB consumer card), you have to use Expert Offloading.

Expert Offloading keeps the active experts in VRAM and stores the rest on the CPU RAM or SSD. When the router picks an expert that isn’t on the GPU, the system must pause, fetch the weights from the CPU/Disk, load them, and then proceed.

This introduces two major issues:

  1. High Latency: Moving data over PCIe is slow compared to computing on the GPU.
  2. Batching Issues: If you are generating text for 32 users at once, they might all need different experts. You might end up needing to load all experts into VRAM every step, defeating the purpose of offloading.

The chart below summarizes the current landscape.

Figure 1 comparing Dense, MoE, and MoLE.

As shown in Figure 1, notice the trade-offs:

  • Dense Model (Orange): Fast (low latency) and low VRAM, but lower accuracy.
  • MoE (Cyan): High accuracy, but massive VRAM usage.
  • MoE Offloading (Blue): Solves the VRAM issue, but latency skyrockets (10x slower).
  • MoLE (Green - Ours): The “Goldilocks” zone. High accuracy, low VRAM, and low latency.

How does MoLE achieve this? By fundamentally changing what an “expert” does.

The Core Method: Mixture of Lookup Experts

The researchers realized that the bottleneck in offloading isn’t just the size of the experts, but the fact that experts are functions that require computation. You have to load the weights to perform matrix multiplication.

MoLE asks: Can we pre-compute the answers?

If we could pre-calculate the output of an expert for every possible input, we wouldn’t need to load the expert’s weights at all. We would just look up the answer in a table.

Architectural Changes

To make pre-computation possible, MoLE introduces a constraint during training that pays off during inference.

In a standard Transformer, the input to an expert is the “hidden state”—intermediate features that change dynamically based on the entire context of the sentence. Because the context is unique for every sentence, the hidden states are infinite. You cannot pre-compute the answers for infinite inputs.

MoLE changes the input. Instead of the dynamic hidden state, MoLE experts take the Embedding Token as input.

Illustration of MoLE architecture compared to MoE.

As seen in Figure 2:

  • Left (Standard MoE): The Router and FFNs take the output of the previous Attention layer.
  • Middle (MoLE): The Router still looks at the dynamic context, but the Experts (FFNs) take the “Embedding” (the raw vector representation of the word) as input.

Why does this matter? Because the number of Embedding Tokens is finite. It is exactly equal to the vocabulary size (e.g., 32,000 or 50,000 words).

Training Phase

During training, MoLE behaves somewhat like a standard MoE, but with two key differences:

  1. Input Source: As mentioned, experts process the embedding \(e\), not the hidden state \(h\).
  2. Full Activation: Standard MoE only trains the top-k experts. MoLE activates all experts for every token during training. This avoids “expert collapse” (where only one expert gets smart) and removes the need for complex auxiliary losses.

The equation for the final output \(h'\) of a MoLE layer during training looks like this:

Equation for MoLE layer computation.

Here, \(g_j\) is the gating weight (from the router), and \(FFN_i(e)\) is the expert processing the embedding \(e\). Notice that the router still uses the dynamic hidden state \(h\) to decide how much to weight each expert, preserving context sensitivity.

Inference Phase: The Transformation to LUTs

This is where the magic happens. Once training is done, the experts (FFNs) are frozen. Because the experts only take embeddings as inputs, and the embeddings for the vocabulary never change, the output of every expert for every word is deterministic.

We can define the output vector \(v\) for expert \(j\) and word ID \(i\) as:

Equation for expert output vector.

Since the vocabulary size \(|\mathcal{V}|\) is finite, we can calculate this vector for every single word in the dictionary and store it. The neural network expert is replaced by a Lookup Table (LUT):

Equation for Lookup Table definition.

During inference, we throw away the expert weights. We don’t do matrix multiplication for the experts. Instead, we use the input ID to retrieve the pre-computed vectors from the LUT.

The inference calculation simplifies to:

Equation for MoLE inference using pre-computed vectors.

The experts are now “computation-free.”

Why This is More Efficient

You might be thinking, “Wait, isn’t a table of every expert answer for every word huge?”

Yes, the Lookup Table is large. However, we can store this table on the CPU RAM or SSD (offloading). The efficiency gain comes from Communication Bandwidth.

Let’s look at the complexity comparison:

Table 1 comparing complexities of architectures.

Focus on the column # Param Loaded per Token:

  • MoE Offloading: In the worst case, you have to load the weights of the specific experts selected. These are massive matrices (\(2 \cdot d \cdot k \cdot D_r\)).
  • MoLE: You only load the result vectors for the experts (\(d \cdot N\)).

The difference is staggering. Loading a vector is orders of magnitude faster than loading the weights required to create that vector. The paper notes that for a 1B parameter model, the data loaded per token in MoLE is roughly 1/2000th the size of the data loaded for standard MoE.

Because the data transfer is so small, the latency penalty for keeping the LUT on a slow storage device (like an SSD) becomes negligible.

Experiments and Results

The theory sounds solid, but does swapping dynamic hidden states for static embeddings hurt the model’s intelligence? The researchers tested MoLE against Dense models and standard MoE models across various sizes (160M, 410M, 1B parameters) on the “Pile” dataset.

Accuracy vs. Complexity

Table 3 comparing performance across benchmarks.

Table 3 highlights the key findings:

  1. MoLE vs. Dense: MoLE consistently outperforms the Dense model (Orange vs Green rows) while maintaining similar inference latency.
  2. MoLE vs. MoE: MoLE achieves performance comparable to, and often better than, standard MoE. For example, in the 410M category, MoLE-16E (16 experts) achieves an average score of 45.7, while MoE-10E achieves 43.9.
  3. Data Transfer: Look at the “Param Loaded per Token” column. For the 1B model, MoLE loads 0.26M parameters per token, whereas MoE loads 537M. This massive reduction explains why MoLE is so fast even when offloaded.

Decoding Latency

The most critical metric for user experience is latency—how long it takes to generate the next word.

Figure 3 showing decoding latency.

Figure 3 shows the latency (in milliseconds) as batch size increases.

  • MoE (Blue bars): As batch size increases (from 1 to 32), the latency explodes. This is because different sequences in the batch request different experts, forcing the system to load almost all experts from memory.
  • MoLE (Green bars): The latency remains flat and low, almost identical to the Dense model (Orange). Because fetching vectors is cheap, increasing the batch size doesn’t clog the bandwidth.

Addressing the “Context” Trade-off

One valid critique of MoLE is that by feeding experts only the embedding token, the experts lose “context.” They don’t know the words that came before.

The authors address this in their ablation study (Table 7 in the paper). They found that while switching to embedding inputs does cause a slight performance drop, the ability to fully activate all experts (because they are free to compute) more than compensates for it.

Furthermore, context isn’t lost entirely. The Router still sees the full context (hidden states), so it can dynamically decide which static expert result is most relevant for the current context. The Shared Expert and Attention layers also continue to process full contextual information.

Conclusion and Implications

The “Mixture of Lookup Experts” (MoLE) paper introduces a clever engineering workaround to the hardware limitations of modern AI. By realizing that we can trade storage (large Lookup Tables on cheap SSDs) for computation and bandwidth (expensive VRAM and PCIe transfer), MoLE makes powerful MoE architectures accessible on consumer-grade hardware.

Key Takeaways:

  1. Computation-Free Experts: Experts are converted into static Lookup Tables (LUTs) after training.
  2. Massive Bandwidth Savings: MoLE reduces the data transfer required for offloading by thousands of times compared to standard MoE.
  3. High Performance: It retains the accuracy of MoE models while matching the speed of dense models.
  4. Batch-Friendly: unlike standard expert offloading, MoLE scales effortlessly with batch size.

For students and developers working with Edge AI or limited GPU clusters, MoLE represents a promising direction. It suggests that the future of efficient LLMs might not just be about making models smaller, but about rethinking when and where the computation happens.