Introduction
The current landscape of Artificial Intelligence is dominated by a “bigger is better” mentality. We train massive Large Language Models (LLMs) on trillions of tokens, hoping they learn everything from Python coding to French poetry. However, this monolithic approach has a downside: when we want the model to learn a new task, we often have to retrain or fine-tune the whole system—or at least a significant part of it. This is computationally expensive and rigid.
Enter Modular Deep Learning. Instead of one giant, unchangeable brain, imagine an LLM as a core processor that can plug in different “cartridges” or modules depending on the task at hand. Need to solve a math problem? Plug in the math module. Translating to German? Plug in the German module.
This modular dream relies on Zero-Shot Transfer—the ability to take a module trained on one set of tasks and successfully apply it to a completely new, unseen task without extra training. However, there is a hidden problem in this architecture: Entanglement.
When we train a specialized module (e.g., for summarizing news), that module doesn’t just learn “summarization.” It also inevitably relearns basic English grammar, sentence structure, and general facts—things the base LLM already knows. This “general knowledge” becomes redundant noise. It bloats the module and makes it harder for the system to decide which module is actually the best fit for a specific problem.
In this post, we will dive deep into GenKnowSub (General Knowledge Subtraction), a novel technique proposed by researchers from the University of Tehran. They hypothesize that by mathematically subtracting this redundant general knowledge from task-specific modules, we can create sharper, more effective tools that significantly boost performance.
Background: The Modular LLM Landscape
Before we unpack the new method, let’s establish the foundational concepts that GenKnowSub builds upon.
The Rise of PEFT and LoRA
Fine-tuning a 70-billion parameter model is a nightmare for GPU memory. To solve this, the industry adopted Parameter-Efficient Fine-Tuning (PEFT). The most popular form of this is LoRA (Low-Rank Adaptation).
Think of LoRA as a lightweight set of “diff” weights. Instead of changing the model’s original brain, LoRA trains a tiny adapter layer that sits on top. It says, “For this task, adjust the neuron activations slightly in this direction.” LoRAs are small, portable, and easy to swap.
The Routing Problem
If you have a library of 50 different LoRA modules—one for science, one for history, one for coding, etc.—how does the model know which one to use for a specific prompt?
This is where Routing comes in. A routing algorithm analyzes the input (often token by token) and dynamically selects the best module. The paper we are discussing uses a state-of-the-art routing method called Arrow.
- Arrow Routing: This algorithm looks at the input token and compares it against the “signature” of the available modules. It selects the top-\(k\) most relevant modules and combines their outputs.
The Problem: Entangled Knowledge
Here lies the core issue the authors address. When you train a LoRA on a “History” dataset, the LoRA learns two things:
- Task-Specific Knowledge: Dates of battles, names of kings, historical causality.
- General Knowledge: How to write a coherent sentence, common words like “the” or “and,” and general reasoning.
The base model (like Phi-3 or Llama-3) already knows #2. Having the LoRA relearn it creates redundancy. When the Router tries to pick a module, this general knowledge acts like static noise. If every module contains general English knowledge, they all look somewhat similar to the Router, making it harder to pick the distinct “History” expert.
Core Method: GenKnowSub
The researchers propose a method called GenKnowSub to disentangle these two types of knowledge. The intuition is elegant: if we can capture a representation of “General Knowledge” and subtract it from our “Task Expert,” what remains should be pure, concentrated task expertise.
Let’s break down the architecture, visualized below.

As shown in Figure 1 (a), the process begins with two parallel training tracks.
Step 1: Training the Modules
- Task-Specific LoRAs (\(LoRA_{ts}\)): The model is fine-tuned on specific clusters of tasks (like the Flan dataset). These are the standard experts we want to use.
- General Knowledge LoRA (\(LoRA_g\)): This is the clever addition. The researchers train a separate LoRA on a general corpus—specifically, Wikipedia. The goal here is to create a module that represents “generic linguistic and factual capabilities” without being tied to a specific reasoning task.
Step 2: The Subtraction (Forget via Negation)
Once both types of modules are trained, the authors perform arithmetic on the weights of the neural network. They apply a principle known as “forgetting via negation.”
They define a Residual LoRA (\(LoRA_{res}\)) as the result of subtracting the General Knowledge LoRA from the Task-Specific LoRA.

In this equation:
- \(LoRA_{ts}^i\) is the module trained on Task \(i\).
- \(LoRA_g\) is the generic Wikipedia module.
- \(LoRA_{res}^i\) is the new, refined module.
By performing this subtraction, the researchers are effectively removing the “average” linguistic information that the task module relearned. The remaining weights in \(LoRA_{res}\) represent the delta—the unique information required for that specific task that isn’t just general language capability.
Step 3: Dynamic Adaptation with Arrow
Now that we have a library of these “cleaned” Residual LoRAs, we need to use them. This happens during inference (Figure 1b).
The system uses the Arrow routing algorithm to decide which modules to activate. For every token in the input sequence (e.g., every word in your prompt), the model calculates a weighted sum of the available residual modules.

Here:
- \(c_t^{i,l}\) is the coefficient (importance score) calculated by the Arrow router for token \(t\).
- \(LoRA_{res}^{i,l}\) is our cleaned residual module.
Because the residual modules are less “noisy” with general knowledge, the hypothesis is that the router can make more precise decisions, and the combination of modules will be less redundant.
Experiments & Results
To validate this theory, the authors extensively tested GenKnowSub against several baselines. They primarily used Phi-3-mini (a 3.8 billion parameter model) as the base, known for its strong reasoning capabilities.
The Baselines:
- Phi-3 Base: The model with no modules.
- Shared: A single LoRA trained on all tasks (no modularity).
- Arrow: The standard modular approach without subtraction.
- Mean Norm: A naive ablation where they subtract the average of all task modules instead of a dedicated General Knowledge module.
Performance on English Reasoning Benchmarks
The first test ground was a suite of 9 standard English reasoning datasets, covering physics (PIQA), logic (BoolQ), and commonsense (HellaSwag).

Key Takeaways from Table 1:
- Modularity Wins: Both Arrow and GenKnowSub beat the “Shared” baseline. Specialized modules are better than one generalist adapter.
- Subtraction Works: GenKnowSub (specifically using the Average general LoRA) achieves an average score of 67.17%, compared to 65.56% for standard Arrow. While 1.6% might sound small, in the world of zero-shot generalization on fixed benchmarks, this is a consistent and meaningful improvement.
- Specific vs. Average: Notice the “Mean Norm” row (62.25%). Simply subtracting the mathematical average of the experts actually hurt performance. This proves that you can’t just subtract random weights; you must subtract a module that specifically represents General Knowledge (the Wikipedia LoRA).
Multilingual Generalization
The results become even more interesting when looking at cross-lingual transfer. Can a model trained primarily on English tasks use these modules to solve problems in German or French?
The authors trained General Knowledge LoRAs for English (\(En\)), German (\(De\)), and French (\(Fr\)), and experimented with subtracting different ones.

Analysis of Table 2:
- Robust Gains: In both German and French, GenKnowSub significantly outperforms the baselines. For German, the average accuracy jumps from 42.09% (Standard Arrow) to 45.15% (GenKnowSub with German subtraction).
- The Power of Language-Specific Subtraction: Interestingly, subtracting the English general knowledge didn’t help much for German tasks. But subtracting the German general knowledge (derived from German Wikipedia) provided a massive boost.
- Interpretation: This suggests that when the model tries to solve a German task using English-trained modules, the “Englishness” of those modules interferes. By subtracting the general German linguistic features (or English features in other settings), we leave behind a more language-agnostic “reasoning core” that transfers better across languages.
The Limits of the Method: The Phi-2 Experiment
To understand the boundaries of their method, the researchers also tested it on Phi-2, a slightly older and weaker model than Phi-3.

Looking at Table 4: Here, the gains are minimal or non-existent. Why? The authors attribute this to the base capabilities of Phi-2. Phi-2 is heavily English-focused and lacks the underlying multilingual competence of Phi-3. GenKnowSub relies on the base model having the latent knowledge to handle the task once the LoRA guidance is refined. If the base model fundamentally doesn’t understand German well, no amount of module subtraction will fix that.
This highlights a crucial constraint: GenKnowSub acts as a lens to focus the model’s existing capabilities; it cannot create capabilities that don’t exist.
Conclusion and Implications
The “GenKnowSub” paper offers a compelling narrative for the future of Modular AI. It challenges the assumption that training a module is purely an additive process. Sometimes, to learn a specific skill, we must explicitly “unlearn” or subtract the general context that surrounds it.
Summary of Key Contributions:
- Disentanglement: The paper successfully demonstrates that Task-Specific Knowledge and General Knowledge are entangled in LoRA modules, and separating them is beneficial.
- GenKnowSub Algorithm: A simple, computationally cheap method (linear subtraction of weights) that can be applied post-training without complex joint optimization.
- Cross-Lingual Boost: The method shines particularly well in transferring reasoning skills to new languages by removing language-specific general noise.
Broader Implications: For students and practitioners, this research suggests a shift in how we build libraries of adapters. We might move toward a future where we download a “clean” reasoning module from a hub, and because it has been scrubbed of its original training bias (general knowledge), it plugs seamlessly into our specific workflow—whether we are working in English, French, or Python.
By treating general knowledge as a distinct, subtractable component, we move one step closer to truly LEGO-like, composable Artificial Intelligence.
](https://deep-paper.org/en/paper/2505.10939/images/cover.png)