Imagine you have just launched a new streaming service. A new user signs up. You know their age and location, but you have zero data on what movies they actually like. What do you recommend?

If you recommend a romantic comedy to a horror fan, they might churn immediately. This is the classic Cold Start Problem in recommender systems. The algorithm needs data to learn preferences, but it needs to make good recommendations to get that data. Traditionally, the system has to “explore” (make random guesses) before it can “exploit” (make smart choices), leading to a poor user experience in the early stages.

For years, researchers have tried to solve this by “warm starting” models using historical data from similar campaigns. But collecting that data is expensive, slow, and riddled with privacy concerns.

A new paper, “Jump Starting Bandits with LLM-Generated Prior Knowledge,” proposes a fascinating solution: If we don’t have real user data yet, why not hallucinate it? By using Large Language Models (LLMs) to simulate user personas and their preferences, we can train our algorithms on synthetic humans before they ever meet a real one.

In this post, we will tear down this methodology, explore how LLMs can act as user simulators, and look at the empirical evidence showing how this approach significantly reduces the cost of learning.

Background: The Contextual Bandit

To understand the solution, we first need to understand the architecture being used: the Contextual Multi-Armed Bandit (CB).

The “Multi-Armed Bandit” name comes from a gambler standing in front of a row of slot machines (one-armed bandits). The gambler must decide which arms to pull to maximize their payout. Some machines pay out often; others rarely. The gambler faces the Exploration vs. Exploitation dilemma: do they stick with the machine they know pays out moderately well (exploit), or do they risk pulling a new machine that might pay out a jackpot or nothing at all (explore)?

In a Contextual Bandit, the gambler gets a hint before pulling the lever. This hint is the context (e.g., the user’s demographic info, device type, or past behavior). The goal is to learn a policy that selects the best “arm” (recommendation) based on the context.

The Metric: Regret

How do we measure success? We use a metric called Regret. Regret is the difference between the reward you could have gotten if you chose the optimal arm, and the reward you actually got.

The formula for Cumulative Regret.

In this equation:

\(\pi^*\) is the optimal policy (the perfect choice).
\(\pi\) is our current policy.
We want to minimize the difference between the two over time (\(T\)).

The problem with standard Contextual Bandits is that at \(t=0\), the regret is high. The model knows nothing. It makes random guesses. The cumulative regret climbs steeply until the model learns.

The Core Method: Contextual Bandits with LLM Initialization (CBLI)

The researchers propose a framework called CBLI (Contextual Bandits with LLM Initialization). The core philosophy is simple: LLMs have been trained on essentially the entire internet. They contain a massive repository of human knowledge, cultural norms, and behavioral patterns. While an LLM might not perfectly predict a specific individual’s behavior, it can simulate a “better than random” baseline.

The process works in two distinct phases:

Pre-training (The Simulation): We generate synthetic users and ask an LLM to simulate their preferences. We train the Bandit on this “fake” data.
Online Learning (The Reality): We deploy the pre-trained Bandit to real users. Because it has already “seen” synthetic examples, it starts with a working theory of what users like, rather than guessing randomly.

Diagram showing the two phases: Pre-training using synthetic users and LLM feedback, followed by Online Learning with actual users.

Step 1: Generating Synthetic Users

First, we sample user features from a population distribution. For example, if we are simulating a charity drive, we might generate a profile like: “A 35-year-old accountant from Toronto who enjoys hiking and is motivated by environmental causes.” This text description is transformed into a mathematical vector (embedding) that the Bandit can understand.

Step 2: Simulating Preferences (The Secret Sauce)

This is where the paper offers a crucial technical insight. How do you ask an LLM what a user likes?

If you give an LLM a user profile and an email draft and ask, “On a scale of 0 to 100, how likely is this user to donate?”, the LLM struggles. It tends to output clustered, unhelpful numbers (e.g., giving everything a score of 75).

Instead, the authors use Pairwise Comparisons. They present the LLM with the user persona and two different options (Arm A and Arm B) and ask: “Which of these two messages is more aligned with your interests?”

Algorithm 1 in the paper details this process:

Adopt the persona of the user.
Compare options \(k_1\) and \(k_2\).
Record the winner.
Repeat for various pairs to build a reward distribution.

The researchers found that LLMs are significantly better at ranking options than scoring them. As shown below, pairwise ranking (right) produces a much more distinct distribution of preferences compared to raw scoring (left), which results in flat, useless data.

Comparison of arm scoring methods. The left chart shows individual scoring results in similar values. The right chart shows pairwise scoring results in distinct, usable preference distributions.

By aggregating these pairwise “wins,” the system constructs a synthetic reward dataset. The Contextual Bandit is then pre-trained on this data using standard algorithms (like LinUCB).

Experiment 1: The Charity Email Campaign

To test this, the authors designed a scenario involving a humanitarian organization soliciting donations.

The Setup:

Users: 1,000 synthetic donor profiles generated by GPT-4o (simulating real people).
Arms (Choices): The Bandit must choose one of four writing styles for the donation email:

Formal
Emotional/Narrative
Informative/Educational
Personal/Relatable

The Oracle: GPT-4o acts as the “ground truth” (representing the actual human). The goal is to see if a Bandit pre-trained on responses from other models (like GPT-3.5 or Mistral) can predict the preferences of the GPT-4o “humans.”

The Results: The results were compelling. The graph below shows the accumulated regret over 1,000 steps.

The Brown line (Not Pretrained) skyrockets. This represents the standard “cold start.” It makes many mistakes early on.
The Purple line (Similarity) uses basic cosine similarity between user text and email text. It’s better than random, but not great.
The colored lines (GPT-4o, GPT-3.5, Claude, Mistral) represent the CBLI method. Even when using a smaller, cheaper model like GPT-3.5 to generate the synthetic data, the regret is drastically lower.

Graph showing accumulated regret. Non-pretrained models show high regret. Models pre-trained with LLMs (CBLI) show significantly lower regret, performing close to the Oracle.

This proves that the “fake” prior knowledge generated by an LLM is accurate enough to guide the Bandit toward the right email styles very early in the process.

The authors also quantified exactly how much “fake” data you need. The table below shows that pre-training with just 100 synthetic users reduces regret by roughly 5-9%. increasing that to 1,000 users yields a 14-25% reduction in regret.

Table showing reduction in cumulative regret. Increasing the number of synthetic users from 100 to 1000 improves performance, though with diminishing returns.

Experiment 2: Real-World Verification (Conjoint Analysis)

Critics might argue: “Experiment 1 just shows that LLMs are good at predicting other LLMs. What about real humans?”

To address this, the researchers utilized a Conjoint Survey Experiment. This is a type of survey used in social science where people are shown two complex options (like two political candidates or two products) and forced to choose one.

The Setup:

Dataset: A real-world survey regarding COVID-19 vaccine preferences (N=1970 participants).
Context: Demographics (age, gender, politics, income).
Arms: Hypothetical vaccines with different attributes (efficacy, side effects, origin country).
Task: The Bandit must predict which vaccine the real human participant chose in the survey.

This setup utilized a Sleeping Bandit framework. Unlike the email experiment where all 4 styles are always available, in this survey, the “arms” (specific vaccine profiles) changed every time. The algorithm had to learn the attributes that drove preference.

The Results: Once again, CBLI shined. The model was pre-trained on synthetic users (generated based on the survey demographics) and synthetic preferences (what an LLM thought those people would choose). It was then tested against the actual human choices.

Graph and Table showing results for the conjoint experiment. The graph shows LLM-initialized bandits beating the non-pretrained baseline on real human data.

As seen in the graph (Figure 3 in the image above), the non-pretrained model (purple line) accumulates the most error. The models pre-trained on LLM hallucinations (Blue, Orange, Green) performed significantly better on real human data.

This is a critical finding: LLM-generated preferences are essentially a low-fidelity proxy for human preferences. They aren’t perfect, but they are directionally correct enough to save the algorithm from flailing blindly at the start.

Implications for Privacy and Data Scarcity

One of the most interesting sub-experiments in the paper addressed privacy. Often, we cannot send rich user profiles (income, political views, health status) to an external LLM API due to privacy regulations (GDPR, HIPAA).

The researchers tested how the model performed when stripping away personal data.

Full: Uses all demographic info.
No Personal: Removes demographics, keeps only generic context.

Table 3 showing regret reduction with varying levels of user info. Even ‘No Personal’ data baselines show improvement over cold start.

Remarkably, even the “No Personal” baseline (where the LLM had almost no specific info about the user) achieved a 14.79% reduction in regret after 1,000 steps compared to a cold start. This suggests that LLMs capture universal or “average” human preferences that act as a safety net for the algorithm, even without invasive data collection.

Conclusion

The “Cold Start” problem has long been a tax on new personalization systems—you have to accept bad performance to buy future good performance. The CBLI framework effectively acts as a loan to pay that tax.

By using LLMs to hallucinate a synthetic population and their preferences, we can:

Jumpstart Learning: Skip the random guessing phase.
Save Money: Reduce the cost of acquiring interaction data.
Preserve Privacy: Train on synthetic surrogates before exposing the model to sensitive real-world data.

While the authors note limitations—specifically that LLMs can harbor biases that might skew the initial policy—the empirical evidence is clear. A Bandit that has “practiced” in a simulation is far superior to one learning on the job.

For students and practitioners in Machine Learning, this paper represents a shift in how we view LLMs: not just as chatbots, but as generative simulators that can provide prior distributions for classical control and reinforcement learning problems.

Background: The Contextual Bandit#

The Metric: Regret#

The Core Method: Contextual Bandits with LLM Initialization (CBLI)#

Step 1: Generating Synthetic Users#

Step 2: Simulating Preferences (The Secret Sauce)#

Experiment 1: The Charity Email Campaign#

Experiment 2: Real-World Verification (Conjoint Analysis)#

Implications for Privacy and Data Scarcity#

Conclusion#