Introduction: The Dilemma of “All-in-One” AI
In the rapidly evolving world of Artificial Intelligence, there is a massive race to build the ultimate “All-in-One” model. We’ve seen Multimodal Large Language Models (MLLMs) like GPT-4 and LLaVA that can see, read, and reason. We’ve seen Generative models like Stable Diffusion and Sora that can create breathtaking images and videos.
Naturally, the industry’s instinct has been to smash these capabilities together into a single, gigantic neural network—a “Jack-of-All-Trades” that can generate, edit, segment, and answer questions all at once. While models like Emu3 and Omni-Gen have made strides here, they face a significant hurdle: conflicting objectives. Training a model to strictly reason about an image (understanding) often conflicts with training it to dream up new pixels (generation). Furthermore, these massive omni-models are incredibly expensive to train and difficult to scale. If a better image generator comes out next month, you have to retrain your entire massive model to use it.
Enter Olympus.

Proposed by researchers from the University of Oxford and Microsoft, Olympus takes a fundamentally different approach. Instead of trying to be the tool for every job, Olympus acts as the Universal Task Router. It functions like a master conductor or a project manager. It uses the reasoning power of an MLLM to understand what you want, and then it intelligently delegates the heavy lifting to specialized “expert” models.
Whether you want to generate a 3D model of a dragon, edit a video of a fire, or simply ask what a person is wearing, Olympus handles the workflow. As shown in Figure 1 above, it covers over 20 different computer vision tasks, seamlessly bridging the gap between understanding and generation.
In this deep dive, we will explore how Olympus works, the clever “routing token” mechanism that drives it, and why this modular approach might be the future of scalable AI.
Background: Standing on the Shoulders of Giants
To understand why Olympus is such a clever solution, we need to look at the current landscape of Multimodal AI.
The Two Worlds of Vision AI
Currently, vision tasks are generally split into two camps:
- Vision-Language Understanding (VQA): These are models like LLaVA or MiniGPT-4. They take an image and text as input and output text. They excel at answering questions like “What color is the car?” or “Explain this meme.”
- Vision Generation & Editing: These are diffusion models (like Stable Diffusion) or GANs. They take text (or images) as input and output new pixels. They excel at “Draw a cyberpunk city” or “Make this photo look like a painting.”
The “All-in-One” Struggle
Recent works have tried to unify these by training massive transformers that predict “next tokens” regardless of whether those tokens represent text or image patches. While promising, this approach is computationally exhausting. For instance, Omni-Gen requires over 100 specialized GPUs and complex multi-stage training.
The “Tool-Use” Predecessors
The idea of using an LLM to call external tools isn’t entirely new. HuggingGPT was a pioneer in this space, using ChatGPT to interpret user prompts and call models from the Hugging Face hub. However, HuggingGPT relied on “prompt engineering”—essentially asking ChatGPT nicely to pick a tool. It wasn’t trained to be a router, making it prone to errors and harder to control for complex, multi-step visual tasks.
Olympus improves upon this by fine-tuning the MLLM specifically for routing. It doesn’t just guess; it learns a specialized vocabulary to control expert models with high precision.
The Core Method: Olympus as a Controller
The philosophy behind Olympus is simple: Let the MLLM do what it does best (contextual understanding) and let specialized models do what they do best (pixel-level manipulation).
The Architecture
The Olympus framework uses a Multimodal Large Language Model (specifically based on architectures like Mipha or LLaVA) as the central brain. When a user provides an instruction and an image, Olympus analyzes the request to determine the nature of the task.

As illustrated in Figure 3, the workflow splits into two paths:
- Direct Solving (Internal): If the user asks a question like “What is the animal doing?”, the MLLM uses its own weights to generate a text response. No external tools are needed.
- Routing (External): If the user asks to “Add books near the dog,” the MLLM recognizes this as an editing task. Instead of trying to manipulate the pixels itself, it generates a response containing Task-Specific Routing Tokens and a refined prompt.
This output is then parsed, and the appropriate expert model (in this case, an image editing model like InstructPix2Pix) is triggered to perform the action.
Task-Specific Routing Tokens
This is the “secret sauce” of Olympus. The researchers expanded the vocabulary of the MLLM with specific tokens representing different capabilities.

Looking at the table above, you can see how granular these controls are. There isn’t just a generic “do vision” token. There are specific tokens for:
<image_gen>for creating images.<image_edit>for modifying existing ones.<3D_gen_text>for creating 3D assets from text.<video_ref_seg>for segmenting objects in video.
By training the model to output these specific XML-like tags, Olympus ensures that the hand-off between the “brain” (MLLM) and the “hands” (Expert Models) is unambiguous.
The Expert Squad
Olympus doesn’t reinvent the wheel for every task. It leverages the best-in-class open-source models available today.

As shown in Table 9, Olympus utilizes industry standards like Stable Diffusion XL for generation, ControlNet for guided creation, and Wonder3D for 3D modeling. This modular design means that if a better version of Stable Diffusion is released tomorrow, Olympus can utilize it immediately without needing to be re-trained. It just needs to route the request to the new model.
Training the Router: The OlympusInstruct Dataset
You can’t just tell a standard LLM to output <image_edit> tokens and expect it to work. It needs to be trained. Since no dataset existed that combined visual conversation with this specific type of tool routing, the researchers built their own: OlympusInstruct.
They utilized GPT-4o to generate a massive set of instruction-response pairs. They carefully designed prompts to ensure diversity in language style, tone, and complexity.

Figure 4 shows how they prompted GPT-4o. They asked for variations in phrasing (“Is there a way to…”, “I would appreciate if you could…”) and complexity (short, moderate, extended).
The resulting dataset is substantial:

As seen in the statistics above, the dataset contains over 446,000 training samples. Notice the large bar for “Chain-of-Action”—we’ll discuss that next, as it’s one of Olympus’s most powerful features.
The training process itself is a standard next-token prediction task, fine-tuning the MLLM to predict the correct routing tokens and refined prompts based on the visual and textual input.

Chain-of-Action: The Power of Sequencing
Real-world requests are rarely simple. A user might say, “Generate a picture of a castle, and then turn it into a winter scene.” A standard model might struggle to do both in one shot.
Olympus introduces Chain-of-Action capabilities. Because it uses routing tokens, it can string them together.
If a user says: “Generate a majestic castle based on this pose, and then add green trees.”
Olympus predicts:
<pose_to_image>a majestic castle</pose_to_image> THEN <image_edit>adding green trees</image_edit>
It executes the first model, takes the result, feeds it into the second model, and returns the final output.

The dataset includes over 64,000 examples of these chained actions, with some instructions involving up to 5 sequential tasks. This turns Olympus from a simple router into a complex workflow automation tool.
Experiments and Results
Does this modular approach actually work better than previous methods? The researchers put Olympus to the test against HuggingGPT and standard MLLM benchmarks.
Routing Performance
The most critical metric for Olympus is: Does it pick the right tool?
The researchers created a benchmark called OlympusBench. They compared Olympus against HuggingGPT (powered by GPT-4o, a much larger model).

The results in Table 2 are striking. Olympus achieves 94.75% accuracy in routing single tasks, compared to 81.35% for HuggingGPT (GPT-4o). This huge gap (over 13%) demonstrates the value of fine-tuning specifically for routing rather than relying on the zero-shot reasoning of a general chatbot.
The gap becomes even wider for complex, multi-step tasks.

In Chain-of-Action scenarios (Table 13), Olympus maintains high precision (91.82%), drastically outperforming the prompt-engineering approach.
Multimodal Understanding
One fear with fine-tuning a model for a new task (routing) is “catastrophic forgetting”—that it might lose its original ability to understand images.

Table 11 shows that Olympus (based on Mipha-3B) holds its own against state-of-the-art models like LLaVA and Qwen-VL. In some benchmarks like VisWiz and MM-Vet, it even shows slight improvements. This confirms that teaching the model to route tasks does not degrade its visual reasoning capabilities.
Qualitative Results: Seeing is Believing
Numbers are great, but in Computer Vision, the proof is in the pixels. Let’s look at what Olympus can actually create.

In Figure 10, we see the versatility:
- Column 1: It takes a depth map (top) and generates a resort. It takes a scribble (bottom) and makes a realistic chair.
- Column 4: It performs complex segmentation and video generation.
But the most impressive demonstrations come from its Chain-of-Action and Iterative capabilities.

In Figure 11 (Column 1), we see distinct editing tasks: adding flowers to a cat or brightening a video. In Column 2, it transforms a 2D car image into a 3D model.
These examples prove that Olympus effectively bridges the gap between different modalities. It flows from Text -> Image -> 3D -> Video within a single unified interface.
Efficiency
Finally, a note on training cost. Training massive “All-in-One” models often requires thousands of GPU hours.

As shown in Figure 7, adding the capability to route 20 different tasks only increased the training time by about 23.6% compared to the base model. This is incredibly efficient because Olympus isn’t learning how to generate pixels; it’s only learning how to ask the experts to do it.
Conclusion: The Future is Modular
Olympus represents a shift in thinking about Multimodal AI. Rather than building a monolith, it builds a manager.
Key Takeaways:
- Unified Control: Olympus uses a single interface to control over 20 diverse vision tasks.
- Scalability: It integrates existing state-of-the-art models (like Stable Diffusion and Wonder3D) rather than re-training them, making it easy to upgrade.
- High Accuracy: By fine-tuning specifically for routing (using the OlympusInstruct dataset), it vastly outperforms prompt-based controllers like HuggingGPT.
- Complex Workflows: The Chain-of-Action capability allows for multi-step creative processes that mimic human workflows.
As the AI field produces more and more specialized “expert” models—for better 3D rendering, smoother video, or more precise medical imaging—frameworks like Olympus will become essential. They provide the connective tissue that turns a collection of isolated tools into a cohesive, intelligent system.
In the end, Olympus shows us that you don’t need to be a Jack-of-All-Trades to be a master. You just need to know who to call.
](https://deep-paper.org/en/paper/2412.09612/images/cover.png)