Imagine you are driving an autonomous vehicle through a thick, heavy fog.
For you, the driver, the goal is Perceptual Image Restoration (PIR). You want the fog cleared from your vision so you can see the scenery, the road texture, and the world in high fidelity. You care about aesthetics and clarity.
For the car’s computer, however, the goal is Task-Oriented Image Restoration (TIR). The AI doesn’t care if the trees look pretty; it cares about edge detection, object classification, and semantic segmentation. It needs to know exactly where the pedestrian is and where the lane marker ends.
Historically, in the field of Computer Vision, these have been two separate worlds. Methods that made images look good to humans often destroyed the subtle data patterns machines need. Conversely, methods that optimized for machine accuracy often produced images that looked noisy or “weird” to human eyes.
In this post, we are diving deep into UniRestore, a groundbreaking paper that proposes a unified model capable of satisfying both biological eyes and silicon sensors. By leveraging the power of Diffusion Models and introducing novel adaptation modules, UniRestore manages to clear the fog for everyone.

As shown in Figure 1 above, UniRestore doesn’t just compromise between the two goals; it excels at them. Notice specifically in section (b) how UniRestore (the orange square) achieves high classification accuracy while simultaneously maintaining high segmentation performance, occupying the “sweet spot” in the top right corner.
Let’s unpack how this architecture works, why it uses a diffusion prior, and how it solves the “Jack of all trades” problem.
The Context: Why Is This Hard?
To understand the innovation of UniRestore, we first need to understand the conflict between Perception and Task Performance.
Perceptual Image Restoration (PIR)
PIR algorithms aim to remove degradation (noise, blur, rain, snow) to improve visual quality. The metric here is human perception. We use scores like PSNR (Peak Signal-to-Noise Ratio) to measure success. However, cleaning up an image for the human eye often involves smoothing out pixels, which can inadvertently wipe out high-frequency textures that Neural Networks rely on for feature extraction.
Task-Oriented Image Restoration (TIR)
TIR focuses on utility. If an image is degraded, a TIR model tries to recover the semantic information required for downstream tasks like object detection or classification. Sometimes, a TIR model might introduce artifacts that look ugly to us but make a specific object “pop” for an algorithm.
The Diffusion Dilemma
Recently, Diffusion Models (like Stable Diffusion) have become the gold standard for generating high-quality images. They work by iteratively denoising a random distribution to form a coherent image. They have incredible “priors”—knowledge about what the world looks like.
However, using a standard diffusion model for restoration has a flaw: it is optimized for generation, not necessarily fidelity or utility. A diffusion model might see a blurry blob and turn it into a high-definition cat when it was actually a dog. It prioritizes looking “real” over being “accurate.”
UniRestore addresses this by using a pre-trained Stable Diffusion model as a backbone but modifying it to listen to the specific needs of the restoration task.
The Core Method: UniRestore Architecture
The researchers built UniRestore on top of the Stable Diffusion Autoencoder (VAE). The goal was to keep the generative power of diffusion but control it tightly so it serves specific tasks.

As illustrated in Figure 2, the architecture fundamentally changes how data flows through the diffusion process. The input is a degraded image (e.g., a snowy street). The output is either a visually clean image or a task-optimized feature map.
There are two critical innovations introduced here to bridge the PIR/TIR gap:
- Complementary Feature Restoration Module (CFRM): Fixes the input features in the Encoder.
- Task Feature Adapter (TFA): Adapts the output features in the Decoder for specific tasks.
Let’s break these down step-by-step.
1. Complementary Feature Restoration Module (CFRM)
Standard encoders in diffusion models are not designed to handle heavy degradation. If you feed a snowy image into a standard VAE encoder, the resulting “latent features” (the compressed representation of the image) will be corrupted.
The CFRM is injected into the encoder to clean these features on the fly.

Looking at the left side (a) of Figure 3, the CFRM operates in four distinct steps:
- Feature Enhancement: It takes the raw features and expands them using a standard convolutional block (NAFBlock). This prepares the data for deeper analysis.
- Intra-group Channel Attention: This is a clever design choice. The channels are split into groups. Why? Because different types of degradation (rain vs. blur vs. noise) affect image frequencies differently. By grouping channels, the model can learn specific weights to handle different “types” of damage within the features.
- Inter-group Channel Integration: After processing groups individually, the model needs to synthesize this information. This step combines the insights from the different groups to form a cohesive, restored feature map.
- Feature Recovery: A final skip connection blends the restored features with the original input to ensure no structural information was lost.
By the time the data leaves the Encoder, the “snow” or “fog” has been significantly suppressed in the feature space.
2. Task Feature Adapter (TFA)
Now that we have clean latent features, we pass them through the Denoising U-Net (the brain of Stable Diffusion). However, the Decoder needs to know what to do with them. Should it make a pretty picture? Should it highlight cars?
This is where the Task Feature Adapter (TFA) comes in. Instead of training a completely new massive network for every single task (which is computationally expensive), UniRestore uses Prompts.
The TFA acts like a switchboard operator. It sits in the Decoder and takes a small, learnable “prompt vector” that represents the task (e.g., “Semantic Segmentation”).
Referencing the right side (b) of Figure 3 (above) and the equations below, here is how the TFA works dynamically:

The math might look intimidating, but the logic is elegant:
- Token Updating (\(f_i, i_i\)): The model calculates “forget” and “input” gates, similar to an LSTM (Long Short-Term Memory) network. It decides how much of the previous layer’s prompt information to keep and how much new information to accept.
- Prompt Propagation (\(C_{i+1}^k\)): The prompt \(C\) is updated layer by layer. It evolves as the image is upscaled in the decoder.
- Feature Fusion: The adapter mixes the restored encoder features (\(F_{enc}\)) with the diffusion latent features (\(F_{latent}\)).
- Adaptation: Finally, the prompt controls how these features are mixed. If the prompt is “Classification,” it might emphasize object shapes. If it is “Perception,” it might emphasize texture and color.
This structure allows the model to be incredibly efficient. To add a new task, you don’t need to retrain the whole model—you just train a new tiny prompt vector.
The Training Pipeline
Training UniRestore is a two-stage process, ensuring the model learns how to see before it learns what to do.
Stage 1: Learning to Restore First, the model must learn to clean up images. The CFRM and the Controller are trained using a Perceptual Image Restoration (PIR) dataset.
The loss function for the CFRM forces the degraded features to match the “clean” features of a ground-truth image:

Simultaneously, the Control module is trained to ensure the diffusion process stays on track:

Stage 2: Learning to Adapt Once the model is good at general restoration, the TFA is trained. Here, the CFRM and the main Diffusion model are frozen. Only the lightweight TFA parameters are updated.
The model is trained on multiple tasks simultaneously (Multi-task learning). The loss function combines objectives—for example, looking good (PIR), identifying objects (Classification), and outlining boundaries (Segmentation):

Specifically for this paper, the researchers used a weighted sum of three losses:

Experiments and Results
Does this complex architecture actually deliver? The researchers tested UniRestore against state-of-the-art methods, including specialized PIR models (like NAFNet) and specialized TIR models (like URIE).
Perceptual Results (Making it look good)
For visual quality, UniRestore was tested on datasets involving rain, haze, and blur.

Table 1 shows the quantitative results. UniRestore achieves the highest scores in almost every category (PSNR and SSIM). Note the “Unseen Datasets” columns—this is crucial. It means UniRestore performs well even on types of weather or degradation it wasn’t explicitly trained on. This generalization capability is largely thanks to the underlying robust priors of the Stable Diffusion model.
Qualitatively, the difference is stark:

In Figure 4, look at the middle row (the Parthenon). The URIE method (second column) leaves a lot of noise. PromptIR (third column) is better but still hazy. UniRestore (fourth column) produces a crisp, clean image that rivals the High Quality (HQ) ground truth.
Task-Oriented Results (Making it useful)
This is where UniRestore truly shines. Most restoration models fail here.
Image Classification: The researchers took degraded images, restored them, and then fed them into standard classifiers (like ResNet-50).

Table 2 shows that UniRestore provides a massive boost in accuracy. In some cases, it improved accuracy by over 20% compared to other methods on unseen datasets (CUB).
Semantic Segmentation: This task requires understanding pixel-level boundaries.

In Table 3, we see UniRestore achieving the highest mIoU (mean Intersection over Union) scores. This means it is better at helping the AI distinguish the road from the sidewalk, even after heavy degradation.
Why is it better? We can verify why it works by looking at “Activation Maps”—visualizations of what the AI is focusing on.

In Figure 5 (left side), look at the bird. The “Low Quality” (LQ) map is scattered; the AI is confused by the noise. The UniRestore map is tight and focused on the bird’s body, very similar to the HQ map.
In Figure 6 (right side), look at the segmentation of the street. Other methods produce jagged, confused masks (the purple/pink blobs). UniRestore produces clean, coherent road segments.
Adaptability and Efficiency
One of the most impressive findings is the Extendability of the model. The researchers asked: “What if we want to add Object Detection?”
Usually, you’d have to retrain your whole network. With UniRestore, they simply added a new prompt vector and trained only that prompt.

As shown in Table 6, despite not retraining the core network, UniRestore outperformed dedicated methods. It also showed that using specific prompts (UniRestore) works better than trying to use a single “one-size-fits-all” prompt (UniRestore-SP), as seen in the ablation study below:

Conclusion
UniRestore represents a significant step forward in image restoration. It acknowledges a fundamental truth in computer vision: what humans need and what machines need are often different.
By building a unified framework that shares a powerful core (Stable Diffusion) but adapts the input (via CFRM) and the output (via TFA), UniRestore achieves the best of both worlds.
Key Takeaways:
- Unified Framework: No need for separate models for viewing and analysis.
- Diffusion Power: Leverages the generative capabilities of large pre-trained models without losing fidelity.
- Prompt-Based Adaptability: New tasks can be added cheaply and efficiently by training lightweight prompts rather than heavy backbones.
For students and researchers entering the field, UniRestore offers a blueprint for how to tame generative AI. It shows that with the right architectural controls, we don’t have to choose between beauty and utility—we can restore images that are both pleasing to the eye and intelligible to the machine.
](https://deep-paper.org/en/paper/2501.13134/images/cover.png)