Introduction
There is an old adage that says, “an image is worth a thousand words.” However, if you look at how we currently train Artificial Intelligence to understand images, the reality is much closer to “an image is worth a dozen words.”
State-of-the-art Vision-Language Models (VLMs)—the AI systems responsible for understanding photos and generating art—are largely trained on datasets scraped from the web. These datasets rely on “alt-text,” the short, often SEO-driven captions hidden in website code. While helpful, alt-text is rarely descriptive. It might say “Canon EOS R6” (camera metadata) or “Europe vacation” (location), but it rarely describes the visual scene, lighting, textures, or spatial relationships in detail.
This creates a ceiling for AI performance. If models are trained on vague, short text, they produce vague, short answers. They hallucinate details because they haven’t learned to be precise.
Enter ImageInWords (IIW), a new framework introduced by researchers from Google DeepMind, Google Research, and the University of Washington. Their goal was to shatter the “dozen words” barrier and curate hyper-detailed image descriptions—often exceeding 200 words—that capture every nuance of a scene. By combining human expertise with machine efficiency, they have created a dataset and a fine-tuning method that significantly outperforms existing benchmarks.
In this post, we will dissect how the ImageInWords framework works, why “human-in-the-loop” is the secret ingredient, and how this new data allows models to reconstruct images with startling fidelity.
The Problem: The “Data Diet” of VLMs
To understand why ImageInWords is necessary, we first need to look at the “diet” of current vision models.
Most VLMs consume massive datasets like LAION or COCO. While these are large, they are noisy. “Dense” captioning datasets have been created to try and fix this, such as DCI or DOCCI, which hire humans to write longer descriptions. However, these human-annotated datasets suffer from three main issues:
- Inconsistency: Without strict guidelines, one annotator might focus on colors while another focuses on actions.
- Fatigue: Writing a 200-word paragraph about a single image is mentally exhausting. Quality drops as annotators get tired.
- Hallucination: Surprisingly, even humans “hallucinate” in captions, assuming details that aren’t strictly visible (e.g., describing a person as “sad” when their face is neutral).
Purely model-generated captions are cheaper but suffer even worse from hallucinations and lack of grounding. The IIW researchers propose a middle ground: a human-in-the-loop framework where machines do the heavy lifting to start, and humans refine and polish the result.
The ImageInWords Framework
The core innovation of this paper isn’t just the dataset itself, but the process used to create it. The researchers designed a pipeline that breaks the daunting task of “describe everything” into manageable steps, using a method called Seeded Sequential Annotation.
1. The Power of Seeding
Starting from a blank page is hard. Editing a draft is easier. IIW uses a high-performing VLM to generate “seed” captions. These initial drafts might be imperfect, but they give human annotators a foundation to build upon. As the project progressed, the researchers actually used the data they collected to re-train the VLM, creating an active learning loop where the seeds got better and better over time.
2. Task 1: Object-Level Descriptions
Before describing the whole scene, the framework forces focus on the components. In Task 1, the system identifies salient objects using object detection.

As shown above, the interface presents the annotator with bounding boxes and generated descriptions for specific objects. The human’s job is not to write a novel yet, but to verify and refine these “Lego blocks” of the image. They fix the bounding boxes, correct labels, and ensure the specific attributes (color, texture, shape) of that specific object are accurate.

In the augmented view (above), you can see how the human annotator has refined the data. They might merge two boxes (e.g., “tire” and “wheel”) or add missing details. This granular focus ensures that when the final description is written, no small object is overlooked.
3. Task 2: Image-Level Synthesis
Once the objects are defined, Task 2 is about weaving them together. The annotator is given the refined object descriptions and a global seed caption. Their goal is to create a cohesive narrative.

The workflow illustrated in Figure 3 above demonstrates this flow. The system moves from isolated object data (left) to a comprehensive, flowy description (right).
To ensure high quality, the researchers employed Sequential Augmentation. Instead of asking one person to write the perfect description, they use multiple rounds.
- Annotator A drafts the description.
- Annotator B sees the image and Annotator A’s draft, and adds missing details or fixes errors.
- Annotator C refines it further.
This acts like a peer-review system. It reduces the cognitive load on any single person and results in significantly richer text.

The data in Figure 2 (above) proves why this works. As annotation rounds progress, the token count (length/detail) increases, but the time spent per round decreases. Furthermore, the agreement between annotators (Jaccard Similarity) improves, indicating the descriptions are converging on a high-quality “truth.”
4. Human-in-the-Loop Learning
An interesting side effect of this sequential process is that the human workers actually trained each other. By seeing the edits and additions made by previous annotators, the pool of workers implicitly learned the high standards of the project, improving their own initial drafts over time.

The Result: Richer Data
So, what does an ImageInWords description look like compared to previous datasets? It is significantly longer and more linguistically diverse.
While datasets like COCO or even DCI might average 15 to 100 words, IIW descriptions average over 200 tokens. They contain roughly 50% more nouns and verbs than the closest competitor.
The researchers also focused on “readability” and writing style. They didn’t just want a list of objects; they wanted a description that flowed naturally.

The charts above show various readability metrics (like Flesch-Kincaid grade level). The blue bars (IIW) consistently show a more “mature” writing style compared to DCI (orange) and DOCCI (green). This indicates that IIW descriptions read less like a robot’s list and more like professional prose.
Guidelines for “Painting a Picture”
A key part of achieving this quality was the instruction given to annotators. They were told to write as if they were instructing a painter to reproduce the image without seeing it. This included specific attention to Camera Angles.

By explicitly labeling the perspective—whether it’s a “Dutch tilt,” “Bird’s eye view,” or “Worm’s eye view”—the descriptions provide critical spatial context that simple object lists ignore.
Experiments: Does it Work?
Collecting data is expensive. The big question is whether this hyper-detailed data actually improves AI models. The researchers conducted extensive experiments to find out.
1. Human Side-by-Side (SxS) Evaluation
First, they simply asked humans to compare IIW descriptions against those from other datasets and GPT-4V. They rated them on comprehensiveness, specificity, and hallucinations.

The results (Table 2) were overwhelmingly in favor of IIW. On specificity and comprehensiveness, IIW beat the competitors by margins of over 60%. Even against GPT-4V (Table 3 below), IIW descriptions were preferred, highlighting that even powerful LLMs still miss visual nuances that human-guided frameworks catch.

2. Fine-Tuning Vision Models
The researchers fine-tuned a PaLI-3 5B model using the IIW dataset. They wanted to see if a model trained on this data would learn to “see” better.
To test this, they used the fine-tuned models to generate descriptions for images, and then used those descriptions to verify if they could handle complex reasoning tasks.

Table 13 shows the results on reasoning benchmarks (like distinguishing “the person pours milk” from “the milk is poured”). The IIW-trained model (PaLI-3 + IIW) achieved the highest accuracy on challenging subsets like Winoground and Visual Genome Relations (VG-R), outperforming larger models like InstructBLIP and LLaVA. This proves that training on detailed descriptions teaches the model to understand relationships and attributes, not just identify objects.
3. The Ultimate Test: Text-to-Image Reconstruction
Perhaps the most compelling experiment was Text-to-Image (T2I) reconstruction. If a description is truly “hyper-detailed,” you should be able to feed it into an image generator (like Imagen or DALL-E) and get an image that looks almost exactly like the original photograph.
The researchers took images, generated descriptions using models trained on DCI, DOCCI, and IIW, and then fed those text descriptions into a T2I model. Humans then ranked which generated image looked most like the original.

In the example above, look at the “Original Image” of the yellow lamp.
- DCI’s description (“A medium-close-up view…”) results in a generic yellow lamp.
- IIW’s description is much richer: “The fixture’s outer sphere is adorned with a network of squares… bathed in a warm, yellowish glow.”
- The resulting image from the IIW prompt (Ranked 1st) captures the specific grid texture and lighting mood that the others miss.

The statistics back this up. Figure 18(b) shows that regardless of the sentence length, images generated from IIW descriptions (yellow bars) consistently achieved “Rank 1” (most similar to original) compared to DCI and DOCCI. This confirms that the IIW framework captures the salient visual information—the stuff that actually makes the image look like itself.
Fine-Tuning Tasks Overview
To achieve these results, the models weren’t just trained to “predict the next word.” The researchers employed a variety of fine-tuning tasks derived from their rich annotation data.

As shown in Figure 13, the model is trained on a mix of:
- Region Tasks: Predicting bounding boxes for descriptions.
- Salient Object Tasks: Listing all important items in a scene.
- Description Tasks: Generating the full, hyper-detailed paragraph.
This multi-task approach ensures the model stays grounded (knows where things are) while learning to be expressive (knows how to describe them).
Conclusion and Implications
ImageInWords represents a shift in how we think about vision-language data. For years, the field has chased quantity—scraping billions of noisy image-text pairs from the web. This paper argues for quality.
By designing a rigorous, human-in-the-loop framework, the researchers created a dataset where the text is finally as rich as the pixels it describes.
- The annotations are more comprehensive and specific.
- Models trained on this data reason better about visual relationships.
- The descriptions allow for high-fidelity image reconstruction.
While the dataset itself (around 9k images) is small compared to web-scale databases, its impact on fine-tuning suggests that a small amount of “gold standard” data can be more valuable than mountains of noise. As VLMs continue to evolve, frameworks like ImageInWords will be essential to teach AI not just to label an image, but to truly understand and describe the world it sees.
For students and researchers, IIW provides a new benchmark (IIW-Eval) and a blueprint for how to curate high-quality multimodal data. It reminds us that in the age of automation, human guidance remains the key to unlocking the next level of AI capability.
](https://deep-paper.org/en/paper/2405.02793/images/cover.png)