Introduction

Imagine logging into a movie streaming platform. You love adventure films—the rush of adrenaline, the vast landscapes, the hero’s journey. Your friend, on the other hand, loves animation—bright colors, whimsical characters, and exaggerated expressions.

Now, imagine both of you see a recommendation for the same new movie. In a traditional system, you’d both see the exact same poster. But what if the poster could change to match your specific taste? You see a gritty, high-contrast poster emphasizing the action; your friend sees a vibrant, stylized version emphasizing the character design.

This is the promise of Personalized Generation. While Recommender Systems are great at finding which items you might like, they usually present those items in a static, “one-size-fits-all” format.

In this post, we are diving deep into a new framework called I-AM-G (Interest Augmented Multimodal Generator). This research proposes a novel way to bridge the gap between user history and generative AI, creating item representations (like posters or clothing designs) that mathematically and visually align with a user’s unique interests.

Figure 1: An example illustration for I-AM-G.

As shown in the concept above, the system takes an objective item (like “Pocahontas”) and morphs its visual representation based on user interest tags (like “Character” or “Adventure”), resulting in vastly different visual outputs.

The Problem: Preference Ambiguity and Semantic Gaps

Why haven’t we solved this yet? We have powerful image generators like Stable Diffusion, and we have powerful recommendation algorithms. Why can’t we just mash them together?

The researchers identify two main hurdles:

  1. Preference Ambiguity: Users are often bad at describing what they want. You might know you like the movie Fast and Furious, but can you articulate exactly which visual elements (color grading, composition, lighting) appeal to you? Probably not. This makes it hard to give a generator a simple text prompt.
  2. Semantic Correlation Ignorance: Even if we know a user likes “cute” things, the word “cute” means something different in the context of a horror movie poster versus a summer dress. A standard generator doesn’t inherently understand how to map a user’s past interaction history to the visual semantics of a new item.

The Solution: The I-AM-G Framework

To solve this, the authors propose a pipeline that doesn’t just look at the current item, but looks backward at the user’s history to “rewrite” the generation process.

The framework is built on a paradigm called Rewrite and Retrieve.

Figure 2: The whole pipeline of I-AM-G.

As illustrated in the pipeline above, the architecture is divided into three main stages:

  1. Interest Rewrite: Using Large Language Models (LLMs) to extract and summarize user interests into text tags.
  2. Interest Retrieve Attention (IRA): Searching a database for items that visually match those interests and fusing them into the generation process.
  3. Generation: Using a diffusion model to create the final image.

Let’s break these down step-by-step.

Step 1: Interest Rewrite

Since users can’t always articulate their preferences, the model needs to infer them. The system looks at the items a user has interacted with in the past (historical interactions).

It uses a Vision Language Model (VLM) to look at the images of those past items and an LLM to read their descriptions. Both models generate “tags.”

Equation for tag extraction

Here, \(p_{I,\phi}\) and \(p_{T,\phi}\) are prompts given to the AI to extract relevant keywords (like “vintage,” “neon,” “minimalist”).

The system then combines these tags to create a comprehensive profile for each item in the history:

Equation for tag union

For the target item (the one we want to generate), the system looks at the user’s last \(k\) interactions. It counts the most frequent tags found in that history to determine the user’s dominant interests.

Equation for counting tags

Once the top tags are identified, the system rewrites the text prompt for the new item. Instead of just “A movie poster for Pocahontas,” the prompt becomes something like:

“This movie poster is about Pocahontas, maybe related to adventure, landscape, and vibrant colors.”

This “Interest Rewrite” bridges the gap between the item’s inherent properties and the user’s specific tastes.

Step 2: Interest Retrieve Attention (IRA)

Text prompts are powerful, but they lose a lot of nuance. The word “happy” is vague. A picture of a “happy” scene is specific. To give the generator better guidance, I-AM-G uses a retrieval mechanism.

First, the system encodes the text and images of all items into a latent space using CLIP (a model that connects text and images):

Equation for CLIP embeddings

It then searches the item pool.

  1. Text Retrieval: It finds existing items whose descriptions are similar to the rewritten prompt we created in Step 1.
  2. Image Retrieval: It finds existing images that are visually similar to the target item.

The similarity is calculated using cosine similarity:

Equation for cosine similarity

Based on this math, the system retrieves the top-\(K\) most relevant text and image embeddings (\(\Phi_i\) and \(\Psi_i\)):

Equation for Text Retrieval Equation for Image Retrieval

The Attention Mechanism

Now the system has a set of “reference” embeddings that represent what the user likes. It uses an Attention Mechanism to fuse these references into the generation process.

For text, it refines the rewritten prompt using the retrieved text embeddings:

Equation for Text Attention

For images, it fuses the retrieved visual features in a cross-modal manner:

Equation for Image Attention

This step effectively tells the model: “Look at these other posters/outfits that match the user’s vibe. Use their style as a reference.”

Step 3: The Generator

The heart of the generation is a Stable Diffusion model equipped with an IP-Adapter. The IP-Adapter is crucial because it allows the diffusion model to accept image prompts (the retrieved visual features) alongside text prompts.

The generator takes three inputs:

  1. The rewritten text tags (\(T^*\)).
  2. The foreground of the original item image (to keep the main subject consistent).
  3. The fused interest embedding (\(\tilde{z}\)) from the IRA module.

The noise prediction process (the core of how diffusion models learn to draw) is guided by these inputs:

Equation for Noise Prediction

Inside the neural network (specifically the U-Net), the attention layers are modified to weigh these inputs. The model balances the importance of the text rewrite (\(\lambda_1\)) and the visual interest retrieval (\(\lambda_2\)):

Equation for U-Net Attention

Finally, the model is trained using a standard Mean Squared Error (MSE) loss, comparing the generated noise against the actual noise, ensuring the image resolves into something high-quality and relevant.

Equation for Loss Function

Experiments and Results

The researchers tested I-AM-G on three distinct datasets:

  • MovieLens: Generating movie posters.
  • MIND: Generating news thumbnails.
  • POG: Generating fashion/outfit images.

They compared their model against standard baselines like Openjourney and DreamBooth.

Visual Performance

The visual results are striking. Below is a collection of generated outputs across the different datasets.

Personalized generation examples for outfits, movie posters, and news.

Take a look at the Movie Posters (Table 4 in the image above).

  • Row 1 (Elephant): The original is a standard photo. The “Cartoon” version turns it into a vibrant animation style. The “Horror” version darkens the lighting and adds an ominous atmosphere.
  • Row 2 (Submarine): The “Adventure” version makes it look like a high-seas epic, while the “Horror” version emphasizes the claustrophobia of the deep ocean.

In Table 3 (Outfits), notice how the same base item (like a green hoodie) is transformed. The “Cool” version adds edgy graphics, while the “Simple” version cleans up the design.

Quantitative Evaluation

But do people actually prefer these? The researchers conducted human studies where participants ranked the images.

Table 1: Human-evaluated average scores for results.

In the table above, a lower score is better (ranking 1st is better than 4th). I-AM-G consistently achieved the best (lowest) scores across all three datasets, outperforming Openjourney and DreamBooth.

They also used GPT-4o to act as a judge, providing it with the user history and asking it to rank the results.

Table 2: ChatGPT 4o-evaluated average ranks.

The AI evaluation aligned closely with human preferences, further validating the method.

Comparison with Baselines

Why does I-AM-G win? Let’s look at a direct comparison.

Table 9: Comparison of generation results by different models.

In the figure above (Row 1), looking at the elephant:

  • Openjourney creates a very stylized, almost abstract image that loses the semantic meaning of the original movie.
  • DreamBooth often makes the image too dark or introduces artifacts (weird text overlays).
  • I-AM-G maintains the subject (the elephant and the setting) but successfully shifts the style to match the “Adventure” preference without breaking the image.

Ablation and Analysis

The researchers also tested “turning off” different parts of the system to see what matters most.

Table 7: Ablation study on I-AM-G core components.

  • w/o Interest Rewrite: Removing the tag rewriting caused the biggest drop in quality. This confirms that explicitly stating the user’s interests in the text prompt is vital.
  • w/o IRA: Removing the retrieval mechanism also hurt performance, proving that simply having the text isn’t enough—the model needs visual references from the item pool to generate high-fidelity results.

Controlling the Personalization

An interesting aspect of I-AM-G is the ability to tune the “strength” of the personalization using hyperparameters \(\lambda_1\) (text rewrite strength) and \(\lambda_2\) (visual retrieval strength).

Table 8: Case study for lambda parameters

In Row 1, as \(\lambda_2\) (visual retrieval) increases, the dress transforms from a simple black dress to one with high-contrast white accents, matching a user’s interest in “cool” styles. However, if you push the parameter too high (e.g., 0.5 or 1.0), the image can become distorted or drift too far from the original item.

This balance is also visible in the number of tags used (\(H\)).

Figure 3: The relationship between the maximum number of used tags H and SSIM.

The graph above shows that there is a “sweet spot” (around 5 to 8 tags). Too few tags, and the personalization is weak. Too many tags, and the prompt becomes noisy, confusing the generator and lowering the structural similarity (SSIM) to the original item.

Conclusion

The I-AM-G framework represents a significant step forward in personalized media. By combining the semantic understanding of Large Language Models with the visual creativity of Diffusion models, it offers a way to escape the “one-size-fits-all” world of current recommendations.

Key Takeaways:

  • Rewrite: Explicitly extracting user interest tags helps the model understand what to generate.
  • Retrieve: Looking up similar items provides the visual “vibe” that text alone cannot convey.
  • Result: A system that can take a generic movie poster or piece of clothing and tailor it to your specific aesthetic preferences.

While the system still faces challenges—such as the computational cost of retrieval and occasional “hallucinations” in generated details (like text on posters)—it opens up exciting possibilities for the future of e-commerce and entertainment. Soon, the internet might look a little different for everyone, uniquely tailored to who we are and what we love.