Imagine training an autonomous vehicle in sunny California. The car performs flawlessly, detecting pedestrians, other vehicles, and traffic signs with high precision. Then, you ship that same car to London during a rainy, foggy night. Suddenly, the system falters. The “domain shift”—the difference between the sunny training data and the rainy real-world environment—causes the model to fail.
This is the core challenge of Domain Generalization (DG): How do we build models that learn on one specific domain (source) but perform robustly on unseen, unpredictable domains (target)?
Traditionally, we tried to solve this by heavily augmenting data or forcing the model to ignore style differences. But recently, the game changed with the arrival of Vision Foundation Models (VFMs) like DINOv2 or EVA02. These massive models have seen billions of images; they already possess “world knowledge.” The new challenge isn’t teaching them to see—it’s adapting them to specific tasks without erasing that general knowledge.
In this post, we will dive deep into SoMA (Singular Value Decomposed Minor Components Adaptation), a novel research paper that proposes a smarter way to fine-tune these giants. Instead of retraining everything, SoMA asks a fundamental question: Which parts of a neural network hold general knowledge, and which parts hold specific details?

As shown in Figure 1, SoMA achieves state-of-the-art results by selectively tuning the “minor” components of the model while keeping the “major” components frozen. Let’s unpack how this works.
Background: The Fine-Tuning Dilemma
Before we get to the solution, we need to understand the problem with adapting Foundation Models.
When you have a massive pre-trained model (like DINOv2), you typically want to adapt it to a specific task, such as detecting cars. You have a few options:
- Full Fine-Tuning (FFT): You retrain every single parameter in the model. This is computationally expensive and often leads to catastrophic forgetting. The model gets so good at your specific dataset that it forgets the broad “world knowledge” it learned during pre-training.
- Parameter-Efficient Fine-Tuning (PEFT): You freeze the giant model and only train a tiny add-on module. A popular method is LoRA (Low-Rank Adaptation).
A Quick Refresher on LoRA
LoRA hypothesizes that the changes needed to adapt a model are “low-rank.” Instead of updating a massive weight matrix \(W\), LoRA injects two small matrices, \(B\) and \(A\), and trains those instead.

Here, \(W_0\) is the frozen pre-trained weight. \(BA\) represents the update. This saves massive amounts of memory. However, LoRA initializes these matrices randomly (usually \(A\) is random Gaussian, \(B\) is zero). It doesn’t look at the structure of the pre-trained weights. It effectively flies blind, potentially interfering with the important features the model already knows.
The Insight: Analyzing Weights with SVD
The authors of SoMA took a step back. They wanted to understand where the generalizable knowledge lives inside a pre-trained weight matrix. To do this, they used Singular Value Decomposition (SVD).
SVD is a mathematical technique that breaks a matrix (\(W\)) down into three parts:
\[W = U \Sigma V^T\]- \(U\) and \(V^T\): Rotation matrices (singular vectors) that define directions in the feature space.
- \(\Sigma\) (Sigma): A diagonal matrix of singular values, ordered from largest to smallest. These values represent the “strength” or “importance” of each direction.
The “General vs. Specific” Hypothesis
The researchers performed an experiment: What happens if we delete specific parts of the weights based on their singular values?

Figure 2 illustrates their profound finding:
- Large Singular Values (Principal Components): correspond to General Knowledge. These capture broad shapes, object classes, and high-level concepts. If you mess with these, the model forgets what a “dog” or “car” is entirely.
- Middle Singular Values: correspond to Coarse-grained Knowledge. Removing these hurts recognition of specific shapes or sub-categories.
- Small Singular Values (Minor Components): correspond to Context-Specific Knowledge. These handle fine-grained details, textures, and noise.
The Conclusion: To adapt a model to a new task without breaking its general reasoning capabilities, you should preserve the Principal Components (large values) and only modify the Minor Components (small values).
The Core Method: SoMA
SoMA takes this insight and builds a fine-tuning strategy around it. The name stands for Singular Value Decomposed Minor Components Adaptation.
Step 1: Decomposing the Weights
First, SoMA performs SVD on the pre-trained weight matrix \(W\).

Step 2: Splitting Major and Minor
The method splits the singular components into two groups based on a rank \(r\):
- Residual (Major) Components: The top components with the largest singular values. These represent the core “world knowledge.” These are frozen.
- Minor Components: The bottom \(r\) components with the smallest singular values. These represent the “noise” or specific details that we can safely overwrite.
Step 3: Initializing the Adapter
Unlike LoRA, which initializes randomly, SoMA initializes its trainable matrices (\(B\) and \(A\)) using these extracted minor components.

Here, \(B\) and \(A\) are initialized to exactly reconstruct the minor components of the original weight. This means that at the start of training, the model is mathematically identical to the original pre-trained model.
Step 4: The Forward Pass
During training, we optimize \(B\) and \(A\) (which now represent the minor components). The forward pass looks like this:

The term \(\Delta W_{SoMA}\) represents the changes we make to the minor components. Because we are operating in the orthogonal space of the minor components, we minimize interference with the major, generalizable directions.
Bonus Strategy: Freezing Early Blocks
The researchers didn’t stop at the weight level; they also looked at the architecture level.
In Domain Generalization, differences between domains often appear in low-level statistics—lighting, fog, noise, or art style (e.g., synthetic game data vs. real photos). In Convolutional Neural Networks (CNNs) and Transformers, the early blocks are responsible for processing these low-level features.

As shown in Figure 3 (Top), the early blocks of a VFM like DINOv2 already capture robust, localized semantics. Figure 3 (Bottom) shows that freezing these early blocks (preventing them from updating) actually improves performance (the darker blue regions). By freezing the early layers, we force the model to rely on its pre-trained ability to handle low-level visual features, preventing it from overfitting to the specific “style” of the training data (like the video game graphics of GTA5).
Experiments & Results
The SoMA framework was tested rigorously on two main tasks: Domain Generalized Semantic Segmentation (DGSS) and Object Detection (DGOD).
Semantic Segmentation: Synthetic to Real
In this setup, the model is trained on synthetic data (GTA V, which looks like a video game) and tested on real-world driving datasets (Cityscapes, BDD, Mapillary). This is a classic “Sim-to-Real” gap.

Table 2 shows the results. Key takeaways:
- SoMA vs. FFT: SoMA outperforms Full Fine-Tuning (FFT) while training significantly fewer parameters (only ~5 million vs 300+ million).
- SoMA vs. LoRA: SoMA consistently beats standard LoRA. This proves that how you initialize the adapter matters. Initializing with minor components is superior to random initialization.
Visually, the difference is striking.

In Figure 6, look at the segmentation masks produced by SoMA compared to LoRA and other methods. SoMA produces crisp, accurate boundaries for road signs and vehicles, even though it was trained on video game data.
Object Detection in Adverse Weather
The researchers also tested Object Detection, training on “Daytime-Sunny” images and testing on “Night,” “Rain,” and “Fog.”

Figure 4 demonstrates robustness. In the “Dusk Rainy” and “Daytime Foggy” rows, SoMA detects vehicles that other methods miss (or hallucinate). This confirms that preserving the “major components” (which understand the shape of a car regardless of weather) is crucial.
Generative Modeling (Subject Personalization)
Finally, to prove the versatility of SoMA, the authors applied it to Stable Diffusion. The goal: Teach the model a new subject (a specific dog) without forgetting how to generate different artistic styles.


As seen in Figure 10 and Figure 11, SoMA excels at “Subject Personalization.” It captures the identity of the specific dog (from the reference photos) but retains the VFM’s ability to render it in “origami style” or “watercolor.” Standard methods (like DreamBooth) often overfit, losing the ability to change styles effectively.
Conclusion
The SoMA paper teaches us a valuable lesson about the nature of deep learning models. Foundation models are not just “big bags of weights”; they have a hierarchical structure of knowledge.
- Structure matters: Knowledge is encoded in singular components.
- Respect the hierarchy: The strongest signals (major components) hold the most general truths about the world.
- Tune the noise: Adaptation is best done by manipulating the minor components—the parts of the model reserved for specific contexts and details.
By combining this SVD-based initialization with architectural insights like freezing early blocks, SoMA provides a parameter-efficient, mathematically grounded way to adapt AI models. Whether for self-driving cars navigating rain or AI artists learning a new character, SoMA ensures that while the model learns new tricks, it doesn’t forget the world it already knows.
For those interested in the implementation details, the hyperparameters used in the experiments are provided below.

](https://deep-paper.org/en/paper/2412.04077/images/cover.png)