Introduction
Imagine you have a photo of your specific hiking backpack—the one with the unique patches and a distinct texture. You want to generate an image of that exact backpack sitting on a bench in a futuristic city. You type the prompt into a standard text-to-image model, but the result is disappointing. It generates a backpack, sure, but it’s generic. It’s missing the patches. The texture looks like smooth plastic instead of canvas.
This is the central challenge of Subject-Driven Image Personalization. We have moved past simply generating “a dog” to needing “this specific dog.” While tools like DreamBooth and IP-Adapter have made massive strides, they often struggle with consistency. They might capture the general shape but lose the fine details, or conversely, capture the texture but paste it onto a distorted shape.
The problem lies in how these models process visual information. Most current approaches feed the reference image features into the generation process in a static way. They shout all the information—shape, color, texture, edges—at the model simultaneously, regardless of whether the model is ready to use it.
Enter TFCustom, a new framework presented at CVPR that rethinks this process. The researchers behind TFCustom argue that generation is a journey over time. Just as a painter starts with broad strokes (low frequency) and finishes with fine details (high frequency), a diffusion model needs different types of information at different stages of the denoising process.
In this deep dive, we will explore how TFCustom harmonizes time and frequency to achieve state-of-the-art results in personalized image generation.

Background: The State of Personalized Generation
To understand why TFCustom is necessary, we first need to look at the current landscape of “fine-tuning-free” personalization.
The Rise of ReferenceNet
Early personalization methods (like DreamBooth) required fine-tuning the entire model for every new subject, which is computationally expensive and slow. Newer methods (like IP-Adapter) use a separate encoder to inject visual features without retraining the model.
The current gold standard in this space involves using a ReferenceNet. This is essentially a copy of the main diffusion network (specifically the UNet) that processes the reference image. The features extracted from this ReferenceNet are injected into the main generation network via attention mechanisms.
The Limitation: Static Injection
While ReferenceNet-based methods are powerful, they have a flaw. They typically extract features from the reference image once (or in a static manner) and provide them to the main network.
However, diffusion models work iteratively. They start with pure noise and slowly remove it over typically 50 steps.
- Early steps: The model decides the layout and broad shapes (e.g., “There is a backpack here”).
- Late steps: The model refines textures and edges (e.g., “The zipper looks like brass,” “The fabric is canvas”).
If we feed high-frequency texture details to the model during the early “layout” phase, it can confuse the structure. If we fail to provide sharp details in the late phase, the result looks blurry. TFCustom was built to solve this mismatch.
The Core Method: TFCustom
The TFCustom framework introduces three major innovations to align the reference features with the generation process: Synchronized Reference Guidance, Time-Aware Frequency Feature Refinement, and Reward Model Optimization.
Let’s break down the overall architecture before diving into the specifics.

As shown in Figure 2(a) above, the architecture runs two parallel paths. On the top is the Synchronized ReferenceNet, processing the input image (e.g., the dog). On the bottom is the DenoisingNet, generating the new image. The magic happens in the connections between them.
1. Synchronized ReferenceNet Guidance
In standard approaches, the reference image is usually clean (no noise). However, the image being generated starts as pure noise. This creates a domain gap—the features of a clean image look very different from the features of a noisy latent code.
TFCustom addresses this by “noising” the reference image. If the generation process is at timestep \(t\) (where \(t\) represents the noise level), the model adds a corresponding amount of noise to the reference image before feeding it into the ReferenceNet.
The formulation for adding noise is the standard diffusion forward process:

By doing this, the ReferenceNet extracts features from a noisy domain that matches the current state of the generated image. This “synchronization” ensures that the guidance is semantically aligned. The ReferenceNet is trained with a diffusion loss to ensure it understands how to extract meaningful features even from these noisy inputs.
2. Time-Aware Frequency Feature Refinement (TA-FFR)
This is the most critical contribution of the paper. Once the features are extracted from the synchronized ReferenceNet, we shouldn’t just dump them into the DenoisingNet. We need to filter them based on time.
The researchers drew inspiration from the fact that neural networks learn hierarchically. They propose a module that separates the reference features into High-Frequency and Low-Frequency components.
Splitting Frequencies
As illustrated in Figure 2(b) (the right side of the architecture diagram), the module uses two distinct convolutional operators:
- Gaussian Operator: Acts as a low-pass filter, capturing smooth gradients, shapes, and colors.
- Krisch Operator: Acts as a high-pass filter, capturing edges, textures, and fine details.
The mathematical operation for extracting these features looks like this:

Here, \(\mathbf{F}_{ref}\) are the raw features, and \(\mathbf{H}_{conv}\) / \(\mathbf{L}_{conv}\) are the filters. \(\mathbf{W}\) represents learnable weights that allow the model to adjust how much filtering is applied.
Injecting Time
Once separated, the model needs to decide how much of each frequency to use. This decision depends on the timestep \(t\).
The model uses a Time-Aware Attention (TA-Attention) mechanism. It takes the time embedding \(t_{emb}\) (a vector representing the current step in the diffusion process) and injects it via Adaptive Layer Normalization (AdaLN).

The result is a dynamic feature set. At early timesteps (when the image is just rough shapes), the network can emphasize the Low-Frequency path to guide the structure. At later timesteps (refining details), it can ramp up the High-Frequency path to draw in textures and edges.
Finally, these refined features are summed back together:

This refined feature map, \(\mathbf{F}_{enh}\), is what finally gets sent to the main generation network via cross-attention.
3. Reward Model Optimization
Even with perfect features, the model can sometimes hallucinate or blend objects (especially when generating multiple specific subjects, like a cat and a dog). To enforce identity preservation, the authors introduce a Reward Model.

As shown in Figure 3, the system tries to predict the final “clean” image (\(x'_0\)) from the current noisy state (\(x_t\)).

Once the model predicts what the final image might look like, it compares this prediction against the original reference image using DINOv2. DINOv2 is a vision transformer known for capturing high-level semantic identity effectively.
The loss function calculates the cosine similarity between the reference image and the predicted generated image:

This acts as a supervisor. If the generated image starts drifting away from the identity of the reference object, the Reward Model penalizes it, forcing the network to correct course. Importantly, this reward is only applied during the earlier timesteps (\(t < T_0\)) because prediction error is too high when the image is pure noise.
Experiments and Results
The researchers tested TFCustom on two major benchmarks: DreamBench (single subject) and MS-Bench (multi-subject). They compared it against industry heavyweights like IP-Adapter, MS-Diffusion, and DreamBooth.
Quantitative Performance
The metrics used were CLIP-I and DINO (measuring image similarity) and CLIP-T (measuring text prompt adherence).
In single-subject tasks, TFCustom dominated the leaderboard.

As seen in Table 1, TFCustom achieved a DINO score of 71.4% in the zero-shot setting (No-FT), significantly higher than MS-Diffusion (67.1%) and IP-Adapter (61.3%). This indicates a much stronger ability to preserve the identity of the subject without needing fine-tuning.
For multi-subject tasks (e.g., “A specific cat and a specific dog”), the gap was even wider.

Table 2 shows TFCustom leading in M-DINO (Multi-subject DINO), which specifically checks if both subjects are accurately represented.
Qualitative Analysis: The Eye Test
The numbers are impressive, but the visual results are where the method shines.
Single-Subject Generation
In Figure 4, we see a comparison of generating a cat and a specific pair of boots.

Notice the boots in the bottom row. The IP-Adapter result is blurry and loses the specific texture of the leather. SSR-Encoder captures the shape but changes the color tone. TFCustom (far right) captures the exact sheen and shape of the boots while placing them naturally in the snow.
Multi-Subject Generation
Multi-subject generation is notoriously difficult. Usually, attributes “leak” between objects (e.g., the dog gets the cat’s color).

In Figure 5 (Row 2), the prompt asks for a “backpack and a pair of shoes.”
- SSR-Encoder gets confused and blends the colors (purple backpack, colorful shoes).
- MS-Diffusion does a decent job but the backpack texture is flat.
- TFCustom perfectly preserves the pink backpack and the specific beige sneakers, keeping them distinct.
Ablation Study: Do we need all the parts?
The researchers performed an ablation study to prove that every component (Synchronized Noise, Frequency Refinement, Reward Model) is necessary.

Looking at Figure 8 (part of the image above):
- w/o \(\mathcal{L}_{ref}\) (Synchronized Noise): The logo on the bag becomes distorted (Row 1).
- w/o Frequency: The text on the bag (“GITHUB”) becomes gibberish (“Ga/beb”) because the high-frequency details weren’t injected at the right time.
- w/o Reward Model: The colors become muddy and the contrast drops.
- Ours (Full Model): The text “GITHUB” is crisp, the logo is sharp, and the identity is perfect.
Figure 9 (also in the image above) visualizes the attention maps. The “Ours” column shows distinct, strong attention on the subject across all timesteps, whereas other methods have scattered attention, explaining their lower consistency.
Conclusion
TFCustom represents a significant step forward in personalized image generation. By treating the generation process as a dynamic timeline rather than a static event, it allows for much finer control over how reference features are used.
The key takeaways are:
- Synchronization matters: Matching the noise level of the reference to the generation creates better feature alignment.
- Frequency is time-dependent: Structure comes first (low frequency), detail comes later (high frequency). TFCustom automates this flow.
- Supervision helps: A reward model acts as a crucial quality check during training.
For students and researchers, TFCustom highlights the importance of looking inside the “black box” of the UNet. Rather than treating diffusion as a single generic process, understanding the different behaviors at different timesteps unlocks new levels of precision and control. As we move toward generating even more complex scenes and eventually video, these time-aware techniques will likely become the standard.
](https://deep-paper.org/en/paper/file-2245/images/cover.png)