How to Teach AI to Write Like Thousands of Humans: A New Approach to Synthetic Data for Person ReID
In the rapidly evolving world of computer vision, data is the new oil. But for specific tasks, like Text-to-Image Person Re-identification (ReID), that oil is running dry. The cost of manually annotating millions of images with detailed textual descriptions is astronomical.
Naturally, researchers have turned to Multimodal Large Language Models (MLLMs)—like GPT-4V or LLaVA—to generate synthetic captions. It sounds like the perfect solution: let the AI label the data. However, there is a catch. MLLMs are often too consistent. They tend to speak in a monotonous, “average” style, lacking the rich linguistic diversity that human annotators naturally provide.
In this post, we dive deep into a fascinating paper titled “Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification”. The authors introduce a novel framework called Human Annotator Modeling (HAM). Instead of settling for robotic descriptions, they teach an MLLM to roleplay thousands of different human annotators, each with unique writing styles. This results in a massive, diverse dataset that significantly boosts the performance of person retrieval systems.
The Problem: The “Boring AI” Bottleneck
Text-to-Image Person ReID is the task of retrieving images of a specific person from a gallery based on a textual description (e.g., “A young woman in a red dress and white sneakers carrying a black backpack”).
To train robust models for this task, you need massive datasets. While we have millions of unlabeled pedestrian images, we lack the text descriptions to go with them. Previous attempts to automate this using MLLMs faced a diversity problem.
If you ask a standard MLLM to describe a person, it will likely give you a grammatically perfect, standard description. If you ask it 1,000 times, you get the same sentence structure 1,000 times. Human language, however, is messy and varied. One person might say “black hair,” while another says “straight shoulder-length dark tresses.” This variation is crucial for training generalizable models.

As shown in Figure 1, previous methods (Top) tried to force diversity using rigid templates (e.g., “[gender] wears [shoes]”). This is limited and artificial. The new approach (Bottom) takes a different route: it extracts “style features” from real human data, clusters them, and teaches the MLLM to mimic these specific human styles.
The Solution: Human Annotator Modeling (HAM)
The core contribution of this paper is the HAM framework. The goal is to simulate the preferences of thousands of different annotators. The framework operates in three main stages: Style Feature Extraction, Clustering, and Prompt Learning.

1. Style Feature Extraction
How do you separate what someone is saying from how they are saying it?
The authors propose a clever method to isolate “style.” They take an existing human-written caption (e.g., “A man in a blue shirt”) and use a Large Language Model (LLM) to strip away the specific identity information. They replace specific attributes with vague, generic words.
For example:
- Original: “A woman with long brown hair wearing a red graphic t-shirt.”
- Processed: “A person with [hairstyle] wearing a [top].”
By removing the specific visual content (red, brown, graphic), what remains is the sentence structure and wording preference—the “style.” This processed text is then fed into the CLIP Text Encoder to produce a vector representation. This vector is the Style Feature.
2. Clustering: Finding the Personas
Once the authors extracted style features from thousands of real human annotations, they needed to group them. They used clustering algorithms (like KMeans) to group similar style vectors together.
If the algorithm finds \(K_1\) clusters, it effectively identifies \(K_1\) different “personas” or writing styles found in the training data. One cluster might represent annotators who are very verbose; another might represent those who are terse and focus only on clothing colors.
3. Prompt Learning
Now comes the “inception” part: teaching the MLLM to adopt these personas.
Instead of writing a manual instruction like “Write in a verbose style,” the authors use Prompt Learning. They assign a learnable vector (a “soft prompt”), denoted as \(\mathbf{P}_i\), to each style cluster.
The input to the MLLM looks like this:

Here:
- \(\mathbf{V}\) represents the image features (what the AI sees).
- \(\mathbf{T}\) represents the text tokens (the description).
- \(\mathbf{P}_i\) is the learnable prompt representing a specific human style.
The model is trained to generate the original human caption when conditioned on the image and the specific style prompt \(\mathbf{P}_i\). Crucially, the parameters of the MLLM (the “brain”) are frozen. Only the prompt vectors \(\mathbf{P}_i\) and a small adapter layer are updated. This ensures the MLLM retains its vast knowledge while learning to “steer” its output style.
The training uses a standard auto-regressive loss function, ensuring the generated text matches the target human text token by token:

Going Further: Uniform Prototype Sampling (UPS)
The researchers noticed a flaw in standard clustering (like KMeans). Real human data is often clustered around common, “average” styles. If you only model the dense clusters, you miss the rare, unique, or extreme writing styles that make a dataset truly diverse.
To fix this, they introduced Uniform Prototype Sampling (UPS).
Instead of just looking at where the data is, they looked at the Style Feature Space itself. They calculated the mean (\(\mu_s\)) and standard deviation (\(\sigma_s\)) of the style features across the entire dataset.

Using these statistics, they defined a “bounding box” for valid style features. Within this space, they performed uniform sampling to generate new cluster centers, \(\mathbf{c}_i\).

By sampling uniformly, they force the model to learn styles that are distributed evenly across the possible style space, rather than just the most common ones. This captures the “long tail” of human expression—the unique ways people describe things that don’t show up often but are crucial for robust training.
The final set of prompts includes both the density-based clusters (KMeans) and the uniform clusters (UPS), giving the best of both worlds: accurate representation of common styles and diverse coverage of rare styles.
Experiments and Results
To prove this works, the authors created a new database called HAM-PEDES, containing 1 million images annotated using their method. They then trained standard ReID models on this data and tested them on real-world benchmarks (CUHK-PEDES, ICFG-PEDES, and RSTPReid).
Does Style Modeling Help?
The first question is whether modeling styles is better than using templates. The ablation study below compares “Static captions” (basic MLLM output), “Template-based” captions, and the HAM approach.

Key Takeaways from Table 1:
- Static captions (Row 1) perform poorly. The lack of diversity hurts the model.
- Templates (Rows 2-3) help, but adding more templates (even 6.8K of them!) yields diminishing returns.
- HAM (Row 6) significantly outperforms templates.
- HAM + UPS (Row 13) provides the best performance. The combination of modeling common styles (KMeans) and exploring the full style space (UPS) yields a massive jump in accuracy (Rank-1 increases from ~35% to ~60% on CUHK-PEDES).
Comparison with Other Datasets
How does HAM-PEDES compare to other massive synthetic datasets like SYNTH-PEDES or those generated by generic MLLMs?

As seen in Table 2, models pre-trained on HAM-PEDES (bottom rows) significantly outperform those trained on other datasets, even when those datasets are larger or use more captions per image. On the RSTPReid benchmark, the HAM approach achieves a Rank-1 accuracy of 58.85%, beating the previous best by a wide margin.
This superiority holds true even when fine-tuning the models, as shown in Table 3 below. The initial parameters learned from HAM-PEDES provide a much better starting point for the model.

The Importance of Scale
One of the promises of synthetic data is scalability. Does adding more HAM-generated data continue to improve performance?

Figure 3 confirms that the performance scales almost linearly with data size. As the authors increased the pre-training data from 0.1M to 1.0M images, the Rank-1 accuracy (blue line) steadily climbed across all three datasets. This suggests that the diversity provided by HAM prevents the model from hitting a “saturation point” early on.
State-of-the-Art Performance
Finally, the authors compared their final model against current state-of-the-art (SOTA) methods.

The results in Table 4 are conclusive. When using HAM-PEDES for pre-training, the ReID model achieves new SOTA results on all benchmarks. For example, on the challenging RSTPReid dataset, their method (combined with the RDE architecture) achieves 72.50% Rank-1 accuracy, surpassing previous CLIP-based methods by a significant margin.
Conclusion and Implications
The “Modeling Thousands of Human Annotators” paper presents a sophisticated solution to the synthetic data problem. It recognizes that diversity is not just about content (what is in the image) but also about style (how it is described).
By extracting style features, clustering them into personas, and using prompt learning to control an MLLM, the HAM framework generates captions that feel distinct and human. The addition of UPS ensures that even rare description styles are represented, preventing the AI from regressing to the mean.
Key Takeaways:
- Templates are dead: Hard-coded templates cannot capture the nuance of human language.
- Style is a feature: Treating writing style as a mathematical vector allows us to manipulate and sample it effectively.
- Better Data > Better Architectures: The huge performance jumps came not from inventing a new ReID network, but from creating better training data for existing ones.
This methodology has implications far beyond Person ReID. Imagine applying HAM to medical imaging reports, e-commerce product descriptions, or creative writing assistants. Any field that relies on diverse, high-quality text data could benefit from teaching AI to mimic not just a human, but thousands of them.
](https://deep-paper.org/en/paper/2503.09962/images/cover.png)