Stop Generating, Start Re-aligning: A Better Approach to Synthetic Image Captions

The quest for Artificial General Intelligence often feels like a hardware race—bigger clusters, more GPUs. But seasoned researchers know that the bottleneck is increasingly becoming data quality. To build AI agents that surpass average human intelligence, we need training data that encapsulates superhuman knowledge.

In the realm of computer vision, specifically image captioning, we have a significant problem. Most existing training datasets consist of naive, generic descriptions. If you show a model a picture of a rare “common iguana,” a standard dataset might just label it “a lizard on a branch.” This offers minimal utility.

Conversely, the internet is full of “alt-text”—metadata uploaded by users. This text often contains expert-level specifics (e.g., specific species names, dates, or locations) but is frequently noisy, grammatically incorrect, or only loosely aligned with the image pixels.

This blog post explores Altogether, a new research paper from Meta FAIR and collaborators. The researchers propose a principled approach to fix this dilemma. Instead of captioning images from scratch (which loses specific details) or using raw noisy alt-text, they introduce a method to re-align existing alt-text to the image. This approach creates a high-quality, synthetic dataset that significantly boosts performance in image captioning, text-to-image generation, and zero-shot classification.

The Core Problem: Specificity vs. Quality

Current approaches to improving image captions generally fall into two camps:

Captioning from Scratch: Models (often large proprietary ones like GPT-4V) look at an image and generate a description. While grammatically perfect, these models often hallucinate or miss specific entities they don’t recognize. They might call a specific “1992 University of Miami T-shirt” just “a green shirt.”
Using Raw Alt-Text: This preserves the specific “1992 University of Miami” text but might include irrelevant metadata (like filenames) or miss visual descriptions entirely.

The authors of Altogether identified a key insight: The original creator who wrote the alt-text is likely the subject matter expert. They know the specific dog breed or the location of the holiday photo. An AI (or a random human annotator) looking at the image later cannot recover that lost context.

Therefore, the goal shouldn’t be to rewrite the caption, but to refine it—keeping the rich entities from the alt-text while aligning the structure and visual details to the actual image content.

The Altogether Method

The Altogether approach is twofold: a human annotation strategy to create a fine-tuning dataset, and a parameter-efficient model architecture trained to automate this process at scale.

The researchers realized that asking humans to write dense captions from scratch is hard and often leads to generic results. Instead, they employed an iterative process shown in the Venn diagram below.

A Venn diagram illustrating caption quality improvement via multiple rounds of re-aligning previous captions.

As illustrated in Figure 1:

Round 1 (Alt-text): We start with the raw metadata. It contains the specific entity (“common iguana”) but lacks visual context.
Round 2: Annotators refine this text, aligning it with visual cues (describing the color and position).
Round N: The final caption is a dense, chemically pure description that retains the expert knowledge (“iguana”) but adds the descriptive fidelity (“grey head,” “green body,” “climbing on a brown tree branch”).

2. The Model Architecture

How do we teach a neural network to perform this “re-alignment”? The researchers designed a lightweight captioner based on the ClipCap architecture.

The architecture needs to process two distinct inputs: the image itself and the original alt-text.

Architecture diagram showing the flow from image and alt-text to the text decoder.

As shown in Figure 2, the pipeline works as follows:

Image Encoder: A frozen CLIP image encoder processes the input image.
Mapping Network: A Transformer converts the CLIP embeddings into a fixed sequence of “visual tokens” (vectors that the language model can understand).
Text Decoder: A trainable Language Model (OPT 1.3B) receives the visual tokens and the tokenized alt-text.

The magic happens in the decoder. Because it attends to the alt-text tokens, it can “copy” specific entities (like the name of the iguana) into the final output. Because it attends to the visual tokens, it ensures the description actually matches the picture (e.g., removing text about objects that aren’t visible).

The Mathematics of Re-alignment

To formalize this, let’s look at the standard image captioning objective. Usually, a model predicts the probability of a caption token \(t\) given the image \(i\):

Standard image captioning equation.

However, Altogether changes the conditioning. The model is now conditioned on the image \(i\) and the previous caption (alt-text) \(t'\):

Re-alignment equation conditioning on previous text.

During the iterative annotation process described earlier, the “ground truth” for the next round becomes the input for the current round:

Iterative update equation.

This simple shift in training objective allows the model to act as an intelligent editor rather than just a generator.

Does It Work? Experimental Results

The researchers evaluated their model against state-of-the-art baselines, including massive proprietary models like GPT-4V. They utilized a subset of the WIT (Wikipedia Image-Text) dataset for testing, which is known for having rich, entity-heavy descriptions.

Qualitative Analysis

The ability of the model to filter noise while keeping signal is best demonstrated by examples.

Table showing qualitative examples of corrected captions.

In Table 3, notice the second row (the seashell).

Alt-text: “conch”, “a rock”.
Re-aligned Caption: “A photo of a conch shell on a sandy beach…”

The model kept the specific term “conch” (which a generic model might have just called a “shell”) but removed the incorrect tag “a rock” because the visual tokens didn’t support it. This demonstrates the model effectively rejecting hallucinations in the alt-text based on visual evidence.

Human Evaluation

Metrics like BLEU and CIDEr are notoriously poor at capturing factual correctness in dense captions. The researchers conducted a human study to see which captions people actually preferred.

Bar chart comparing human preference across different models.

Figure 3 shows a clear win for Altogether (Round 3) in purple.

Alignment: It has less hallucination than even GPT-4V.
Specificity: It contains significantly more named entities and specific details.
Usefulness: It retains the useful parts of the alt-text better than models that caption from scratch.

Downstream Task: Text-to-Image Generation

One of the most valuable applications for better captions is training image generators (like Stable Diffusion or DALL-E). If the training captions are better, the image generator should follow prompts better.

The researchers trained a Latent Diffusion Model (LDM) using their synthetic captions versus original alt-text.

Table showing CLIP similarity scores for text-to-image generation.

Table 5 demonstrates that models trained on Altogether (Round 3) synthetic captions achieve significantly higher CLIP scores (29.8 vs 27.0). This means the generated images are much more semantically aligned with the text prompts.

Downstream Task: Zero-Shot Classification

Finally, does this data help discriminative models (like CLIP itself)? The researchers trained CLIP models using different mixtures of real alt-text and synthetic captions.

Line graph showing zero-shot accuracy vs. synthetic caption ratio.

Figure 4 reveals a fascinating nuance. While synthetic captions are great, you shouldn’t rely on them 100% for classification tasks.

The orange line (Avg 26 Tasks) peaks around a 15% mix of synthetic captions.
If you replace all data with synthetic captions (ratio 1.0), performance drops.

Why? Likely because synthetic captions—while clean—might “smooth over” some of the messy, long-tail concepts found in raw alt-text that are crucial for zero-shot classification on diverse datasets. However, supplementing real data with 15% re-aligned data provides a clear boost (roughly +1.1% accuracy).

Conclusion

The Altogether paper provides a compelling blueprint for the future of dataset creation. It challenges the trend of relying solely on “black box” proprietary models to clean data.

By treating image captioning as a re-alignment task rather than a generation task, we can:

Preserve Intelligence: Keep the specific entities and expert knowledge buried in metadata.
Ensure Alignment: Use visual encoders to filter out noise and hallucinations.
Scale Efficiently: Use lightweight decoders to process billions of images.

For students and researchers, this highlights an important lesson: data isn’t just about quantity. The structural relationship between your metadata (alt-text) and your signal (pixels) is a resource that can be exploited to build smarter, more accurate models.

The Core Problem: Specificity vs. Quality#

The Altogether Method#

1. The Annotation Strategy: Iterative Refinement#

2. The Model Architecture#

The Mathematics of Re-alignment#

Does It Work? Experimental Results#

Qualitative Analysis#

Human Evaluation#

Downstream Task: Text-to-Image Generation#

Downstream Task: Zero-Shot Classification#

Conclusion#