Introduction

In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) have become ubiquitous. Models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet can describe complex images, interpret charts, and answer questions about the visual world with startling accuracy. However, these proprietary models are “walled gardens.” We interact with them via APIs, but we don’t know exactly how they were built or what data they were trained on.

This lack of transparency has created a significant hurdle for the open-source community. To compete, many “open” models rely on a technique called distillation. Essentially, researchers feed images into a proprietary model (like GPT-4V), ask it to generate detailed captions, and then use that synthetic data to train their own smaller models. While effective, this creates a dependency loop: open models are merely shadows of proprietary ones, learning to mimic their outputs rather than learning from foundational data. As a result, the scientific community has been missing a crucial piece of the puzzle: How do we build a state-of-the-art VLM from scratch, using only open data?

Enter Molmo and PixMo.

In a new paper from the Allen Institute for AI (AI2), researchers introduce a family of open VLMs (Molmo) and, perhaps more importantly, a massive collection of open datasets (PixMo) collected without the help of proprietary models. This blog post dives deep into how they achieved this, the innovative data collection techniques they used, and the architectural choices that allow Molmo to outperform leading proprietary models.

Figure 11: VLM Openness Comparison. Molmo stands out by offering open weights, data, and code, breaking the trend of closed or distilled models.

The Core Problem: The Distillation Trap

To understand why Molmo is significant, we first need to understand the status quo. As shown in the comparison chart above, most top-tier models are fully closed. The “Open Weights” models that do exist often rely on “Distilled” data.

Distillation is akin to a student copying the homework of the smartest kid in class. The student (the open model) might get good grades (benchmarks) by mimicking the answers, but they haven’t necessarily learned the underlying principles. If the smartest kid (the proprietary model) hallucinates or has a bias, the student copies that too.

The researchers behind Molmo set out to prove that high-quality, human-annotated data—collected efficiently—can outperform synthetic data distilled from proprietary giants.

Part 1: PixMo — Data is the Differentiator

The paper argues that the secret sauce of performant VLMs isn’t necessarily a magical new architecture, but rather high-quality multimodal data. They introduce PixMo (Pixels for Molmo), a suite of datasets designed to teach models distinct visual skills.

Figure 1: Overview of PixMo datasets (left) and the capabilities they unlock in Molmo (right).

The team focused on two main stages of training: Pre-training (teaching the model to “see” and describe) and Fine-tuning (teaching the model to follow instructions and interact).

1. PixMo-Cap: The “Speech” Trick for Dense Captions

For pre-training, VLMs need millions of image-text pairs. The standard approach is to scrape the web for images and their alt-text. However, web alt-text is often short, noisy, or irrelevant.

The researchers needed “dense captions”—paragraphs that describe every detail of an image. But asking human annotators to type 200-word descriptions is slow, expensive, and mentally draining. Annotators often get bored and write short, salient-only descriptions.

The Solution: They asked annotators to speak instead of type. Annotators recorded themselves describing images for 60 to 90 seconds. Speaking is faster than typing and naturally leads to more descriptive, stream-of-consciousness detail. These audio recordings were transcribed and then polished by a language-only LLM (which is allowed, as it’s not a VLM) to create the final caption.

Figure 12: Examples from PixMo-Cap. Notice the extreme level of detail in the text, achieved by transcribing spoken descriptions.

This method resulted in PixMo-Cap, a dataset of 712,000 images with highly detailed captions, collected efficiently without relying on GPT-4V.

2. PixMo-AskModelAnything: Human-in-the-Loop QA

For fine-tuning, the model needs to answer questions. The researchers developed a pipeline where human annotators worked alongside a language-only LLM.

An annotator selects an image and asks a question.
The system provides OCR (text recognized in the image) and the dense caption to a text-only LLM.
The LLM suggests an answer.
The human accepts, rejects, or edits the answer.

This creates high-quality Question-Answer (QA) pairs that are conversational and accurate.

Figure 13: Examples from PixMo-AskModelAnything. The model learns to answer specific questions about visual content.

3. PixMo-Points: Grounding Language in Pixels

One of the most exciting contributions of this paper is the focus on pointing. Most object detection datasets use bounding boxes (drawing a rectangle around an object). Bounding boxes are slow to draw.

The Solution: Just click. The researchers collected PixMo-Points, where annotators simply pointed (clicked) on objects described in text. This is much faster, allowing them to collect 2.3 million annotations.

Figure 34: Distribution of pointing counts. The dataset covers a massive range of counting scenarios.

Why is this important? It allows the model to “ground” its answers. If you ask, “How many people are in this photo?”, the model doesn’t just guess a number; it internally “points” to each person to count them.

Figure 14: Examples from PixMo-Points. The pink dots represent the model’s ability to locate specific objects mentioned in the prompt.

4. Synthetic Skills

Finally, to cover areas where human annotation is too hard or rare, they generated synthetic data using code rendering. This included:

PixMo-Clocks: 826k images of clocks to teach time-reading.
PixMo-Docs: Rendered charts, tables, and documents to teach OCR and data analysis.

Figure 17: Synthetic Clocks data. A simple but effective way to teach a specific visual skill.

Part 2: The Molmo Architecture

While the data is the star of the show, the architecture (how the model is built) is the engine. Molmo follows a standard but highly optimized design.

Figure 2: The Molmo Architecture. A Vision Encoder processes the image, a Connector adapts it, and an LLM generates the text.

The architecture consists of four main parts:

Pre-processor: Handles image cropping and resizing.
Vision Encoder: A Vision Transformer (ViT) that turns images into mathematical features. They used OpenAI’s CLIP ViT-L/14 (and also experimented with fully open encoders like MetaCLIP).
Connector: An MLP (Multi-Layer Perceptron) that projects visual features into the language model’s space.
LLM: A decoder-only language model (like Qwen2 or OLMo) that generates the text response.

The Innovation: Multiscale & Overlapping Crops

A major challenge in VLMs is resolution. If you shrink a high-res photo of a document to \(336 \times 336\) pixels, the text becomes unreadable. To solve this, researchers usually crop the image into tiles.

However, standard tiling has a flaw: context fragmentation. If a word or object is split perfectly down the middle by a crop line, the model might fail to recognize it because the separate tiles don’t “talk” to each other effectively at the borders.

The Solution: Overlapping Crops. Molmo uses a multi-crop strategy where the tiles overlap slightly. This ensures that every pixel in the image appears in at least one crop with sufficient surrounding context.

Figure 3: Overlapping crops (right) vs. non-overlapping (left). Notice how the bike frame is split in the left image, but fully visible in the context of the right image.

As visualized below, the image is turned into a sequence of tokens. The LLM reads the “patches” of the image just like it reads words in a sentence.

Figure 5: Tokenization process. The image is split into high-res crops and a low-res overview, all flattened into a sequence for the LLM.

Part 3: The Training Pipeline

The training process for Molmo is refreshingly simple compared to many competitors. It avoids complex multi-stage pipelines in favor of two distinct phases.

Phase 1: Pre-training (Learning to See)

The model is trained on the PixMo-Cap dataset (the speech-based dense captions). The goal here is simple: given an image, generate the caption.

The “Length Hint” Innovation: The researchers found that simply training on captions wasn’t enough. Sometimes you want a short caption; sometimes you want a novel. To give the model control, they added a “length hint” to the training prompt. They tell the model roughly how many characters the caption should be.

Figure 7: Precision vs. Recall with length hints. By adjusting the hint, the model can trade off between brevity (high precision) and exhaustiveness (high recall).

This simple trick prevents the model from rambling when it shouldn’t, or being too terse when detail is needed.

Phase 2: Instruction Fine-tuning (Learning to Behave)

After pre-training, the model understands images but isn’t necessarily helpful. Fine-tuning trains it to follow instructions (“Count the apples,” “Read this chart”).

They mix the PixMo datasets with established academic datasets. The chart below shows the data mixture. Note the heavy reliance on their new PixMo data (Green and Blue segments) compared to traditional academic datasets (Purple).

Figure 4: The Fine-Tuning Data Mixture. PixMo datasets make up the majority of the training diet.

Experiments and Results

So, does building from scratch with open data actually work? The results suggest a resounding “Yes.”

Academic Benchmarks

The researchers evaluated Molmo on 11 diverse benchmarks, ranging from document reading (DocVQA) to general visual questions (VQA v2.0) and math (MathVista).

Key Takeaways from the Leaderboard:

Molmo-72B (based on Qwen2-72B) outperforms open-weight competitors like LLaVA OneVision and Qwen2-VL.
More impressively, it outperforms proprietary giants like Claude 3.5 Sonnet and Gemini 1.5 Pro on average academic scores.
It sits just behind GPT-4o, effectively closing the gap between open and closed AI.

Human Evaluation

Academic benchmarks can be gamed. The ultimate test is human preference. The team ran a “Chatbot Arena” style evaluation where humans compared Molmo’s answers against other models side-by-side.

Figure 30: Human evaluation outcomes. Molmo-7B-D holds its own against massive proprietary models like Claude 3.5 Sonnet.

The Elo ratings (a ranking system used in chess and video games) derived from this study placed Molmo-72B second overall, sandwiched between GPT-4o and Gemini 1.5 Pro.

Table 9: Chatbot Arena Leaderboard. Molmo-72B ranks impressively high, beating nearly all other open models and many closed ones.

The Power of Pointing

One area where Molmo truly dominates is counting. Because it was trained on the PixMo-Points dataset, it handles counting questions differently. Instead of guessing a number (“There are 5 people”), it generates a point for every person it finds and then sums them up. This “Chain-of-Thought” via pointing drastically reduces hallucinations.

Figure 18: Molmo counting by pointing. By visually locating every instance, the model ensures accuracy.

Why It Works: Ablation Studies

The paper includes extensive “ablation studies”—experiments where they remove one piece of the system to see how important it is. These studies provided a crucial insight: Caption Quality is King.

The researchers found a strong correlation between the quality of the captions used in pre-training and the model’s performance on downstream tasks (like answering questions about charts).

Figure 9: Correlation between Caption F1 score (quality) and average benchmark performance. Better captions = Better VLM.

This validates their investment in the “speech-to-text” captioning pipeline. By focusing on getting the best possible description of an image, they built a foundation that transferred to every other visual task.

Conclusion: The Future is Open

The Molmo paper is a landmark for open-source AI. It demonstrates that the community does not need to rely on the “crumbs” of proprietary models (distillation) to compete. By innovating in data collection—specifically through speech-based captioning and point-based grounding—the authors created a model that rivals the best in the world.

Key Takeaways for Students and Researchers:

Data > Architecture: The improvements in Molmo came largely from PixMo (the data), not complex architectural changes.
Avoid Distillation: You can build better models by collecting your own high-quality human data rather than mimicking GPT-4.
Simplicity Wins: A standard ViT + LLM architecture, when trained correctly, is powerful.
Pointing is Powerful: Grounding language in pixels (x,y coordinates) is a highly effective way to improve spatial reasoning and counting.

For those interested in exploring further, the team has released the model weights, the code, and crucially, the PixMo datasets. This allows the scientific community to finally inspect, understand, and improve upon the foundations of state-of-the-art vision-language modeling.

Introduction#

The Core Problem: The Distillation Trap#

Part 1: PixMo — Data is the Differentiator#

1. PixMo-Cap: The “Speech” Trick for Dense Captions#

2. PixMo-AskModelAnything: Human-in-the-Loop QA#

3. PixMo-Points: Grounding Language in Pixels#

4. Synthetic Skills#

Part 2: The Molmo Architecture#

The Innovation: Multiscale & Overlapping Crops#

Part 3: The Training Pipeline#

Phase 1: Pre-training (Learning to See)#

Phase 2: Instruction Fine-tuning (Learning to Behave)#

Experiments and Results#

Academic Benchmarks#

Human Evaluation#

The Power of Pointing#

Why It Works: Ablation Studies#

Conclusion: The Future is Open#