Creating an AI that can navigate and act in a complex, open-ended world like Minecraft is one of the grand challenges in artificial intelligence. The goal isn’t just to build an agent that performs single tasks; it’s to create one that can understand diverse instructions, plan multi-step actions, and execute them skillfully—much like a human player.

In recent years, AI agents have adopted a two-part strategy: a high-level planner and a low-level policy. The planner—often a Multimodal Large Language Model (MLLM) such as GPT-4V—serves as the brain, breaking down a complex goal like “I need a wooden sword” into a series of sub-goals:

  1. Chop a tree
  2. Craft planks
  3. Craft sticks
  4. Craft a sword

The policy is the muscle, turning each sub-goal into the control commands needed to play the game—keyboard presses and mouse movements.

While planners have become remarkably powerful, the overall performance of Minecraft agents has hit a bottleneck: the policy. Existing policies struggle to translate the planner’s sub-goals into precise actions. They lack an intuitive grasp of language and, more importantly, fail to model the cause-and-effect link between what the agent does and what it sees next.

This is where the new paper “Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy” makes its mark. The researchers present Optimus-2, a new agent that reimagines how policies connect goals, observations, and actions through a novel framework called the Goal-Observation-Action Conditioned Policy (GOAP).


The Problem with Today’s AI Policies

To appreciate what makes Optimus-2 different, it helps to look at where current goal-conditioned policies fall short.

A diagram comparing a general agent framework with existing policies and the new GOAP policy. The left side shows an MLLM planner breaking down a task. The right side contrasts a Transformer-XL-based policy with the more advanced GOAP architecture.

Figure 1. Existing goal-conditioned policies (right) vs. Optimus-2’s proposed GOAP (left). GOAP introduces an Action-guided Behavior Encoder to capture causal relationships and an MLLM to enhance language comprehension.

Existing approaches often rely on architectures such as Transformer-XL, which suffer from two key weaknesses:

  1. They ignore causality. When you play a game, the action you take—say, moving forward—directly causes your next view of the world. Policies in previous agents, however, treat each observation in isolation. They only look at the current frame and the goal, ignoring that the observation is a result of the previous action. Without modeling this cause-and-effect relationship, the policy misses a fundamental part of world dynamics.

  2. They underuse language signals. Most agents use minimal goal encoders that turn an instruction like “chop a tree” into a fixed text embedding. That embedding is then simply added to visual features, losing much of the rich semantic meaning. These models can’t handle nuanced or open-ended language in human-like ways.

To overcome these limitations, Optimus-2 introduces GOAP, a policy that aligns the goal, observation, and action sequences more intelligently.


Introducing Optimus-2 and the GOAP Policy

Optimus-2 combines a state-of-the-art MLLM planner with the novel GOAP policy. The planner decomposes instructions into actionable sub-goals, and GOAP executes them while modeling dynamics and semantics at a deeper level.

An overview of the Optimus-2 architecture, showing how text, image, and behavior tokens are fed into a Large Language Model to predict actions.

Figure 2. Overview of Optimus-2. The planner interprets the task and generates sub-goals, while GOAP uses multimodal inputs—text, image, and behavior tokens—to predict the next action.

GOAP’s architecture consists of two interconnected components:

  • Action-guided Behavior Encoder (ABE): captures how observations and actions evolve over time.
  • MLLM Backbone: aligns this multimodal understanding with the language sub-goal to predict the next action.

The Core Method: How GOAP Works

At its essence, GOAP learns the connection among goal (g), observation (o), and action (a). The policy \( p_{\theta} \) aims to predict the next action \( a_{t+1} \) given the historical observations \( o_{1:t} \) and the current goal \( g \):

\[ \min_{\theta} \sum_{t=1}^{T} -\log p_{\theta}\left(a_{t+1} \mid o_{1:t}, g\right) \]

1. Action-guided Behavior Encoder: Learning From History

The Action-guided Behavior Encoder enables GOAP to reason about what caused what. It turns the ongoing observation–action sequence into informative behavior tokens through two modules: the Causal Perceiver and the History Aggregator.

Causal Perceiver — Linking Actions and Observations

At each timestep \( t \), the observation \( o_t \) is passed through a Vision Transformer (ViT) to generate image features \( v_t \):

\[ v_t \leftarrow \mathsf{VE}(o_t) \]

Rather than treating these features as standalone, the Causal Perceiver fuses them with the action embedding \( a_t \) using cross-attention:

\[ Q = v_{t}W_v^Q, \quad K = a_{t}W_a^K, \quad V = a_{t}W_a^V \]

\[ \hat{v}_t = \mathrm{Softmax}\!\left(\frac{QK^T}{\sqrt{d}}\right)V \]

This mechanism adjusts the visual representation based on the previous action, explicitly modeling the causal link between what the agent did and what it now sees.

History Aggregator — Remembering the Past

Once the visual features have been enriched with causal context, the History Aggregator maintains the long-term dependencies. It uses a history-attention operation:

\[ \hat{B}_t = \mathrm{Softmax}\!\left(\frac{QK^T}{\sqrt{d}}\right)V \]

It continuously integrates information from past behavior tokens \( [B_1, B_2, \dots, B_{t-1}] \), compressing older experiences through a Memory Bank so that the sequence length stays manageable. The result is a set of behavior tokens that encapsulate the complete trajectory up to time \( t \).

2. MLLM Backbone: Understanding the Goal

Once GOAP has a compressed representation of what has happened so far, it must align that understanding with the current goal described in natural language. For this, it uses a strong MLLM backbone (DeepSeek-VL-1.3B).

The backbone processes three input streams:

  1. Text tokens (sub-goal g)
  2. Image tokens (current visual \( v_t \))
  3. Behavior tokens (historical summary \( B_t \))

It fuses these modalities to predict the next action embedding \( \bar{a}_{t+1} \):

\[ \bar{a}_{t+1} \leftarrow \mathrm{MLLM}\left([g, v_t, B_t]\right) \]

This embedding is then converted into real control commands via a pretrained Action Head from VPT:

\[ a_{t+1} \leftarrow \mathrm{AH}(\bar{a}_{t+1}) \]

The model is trained end-to-end using both a behavior cloning loss and a KL-divergence term against a teacher model, ensuring stable knowledge transfer from previous large-scale policies.


The MGOA Dataset: High-Quality Data for Better Training

Training a data-hungry model like GOAP requires aligned examples of goals, observations, and actions. However, existing Minecraft datasets are limited—some lack alignment, others aren’t public. To fill this gap, the authors created the Minecraft Goal-Observation-Action (MGOA) dataset.

A comparison of the new MGOA dataset with existing popular datasets for Minecraft agent training.

Table 1. MGOA provides ~30M aligned goal–observation–action samples across 8 atomic tasks, exceeding existing datasets in scale and coverage.

They designed an automated pipeline to generate high-quality data at scale. A diagram illustrating the automated pipeline for generating the MGOA dataset, from instruction generation with GPT-4 to data filtering.

Figure 9. Automated dataset generation pipeline. GPT-4 creates natural-language instructions, a Minecraft agent executes them, and successful interactions are recorded as goal–observation–action pairs.

Using GPT-4 to create goal descriptions and a trained agent (STEVE-1) to execute them, they recorded all gameplay, but only retained episodes where the task was completed successfully and efficiently. This yielded 25,000 videos and about 30 million samples, offering a benchmark dataset with precise multimodal alignment.


Experiments and Results

The researchers evaluated Optimus-2 across three types of tasks:

  1. Atomic Tasks — short-term skills (chop logs, collect seeds).
  2. Long-Horizon Tasks — chains of subtasks (crafting complex items).
  3. Open-Ended Instruction Tasks — flexible natural language instructions.

Atomic Tasks: Mastering the Basics

A table showing the performance of GOAP compared to other policies on four basic atomic tasks in Minecraft.

Table 2. GOAP outperforms previous policies such as GROOT and STEVE‑1 across all atomic tasks.

GOAP achieved sizable gains over the previous state-of-the-art. For example, it improved success rates by up to 35% when mining stone, showing that the model effectively learned causal and historical context to master foundational skills.

Long-Horizon Tasks: Chaining Skills Together

A table comparing the success rate of Optimus-2 against other SOTA agents and a human baseline on complex, long-horizon task groups.

Table 3. Optimus‑2 surpasses previous SOTA agents across all long-horizon task categories, closing the gap to human-level performance.

Optimus-2 achieved the highest success rates across seven task groups—from Wood to Redstone—demonstrating its ability to sequence sub-goals and execute complex behavior patterns. In the Diamond and Redstone categories, it came closest to human-level proficiency.

Open-Ended Instructions: Real Understanding

A table demonstrating GOAP’s superior performance on tasks with open-ended natural language instructions, where other policies struggle or fail.

Table 4. GOAP successfully executes open-ended language instructions that other agents fail to understand.

When tested with flexible natural-language instructions—such as “I need some iron ores, what should I do?”—GOAP dramatically outperformed prior models.

A visual example showing how Optimus-2 successfully interprets and executes an open-ended instruction that causes other agents like VPT and STEVE-1 to fail.

Figure 3. Only Optimus‑2 correctly interprets and executes the instruction “I need some iron ores.” Other agents fail due to limited language comprehension.

This capability stems from the inclusion of an MLLM backbone within the policy itself, allowing the agent to reason through instructions rather than treating them as static goal embeddings.


Why GOAP Works: Ablation Insights

The authors conducted ablation studies to confirm the importance of each architectural component.

An ablation study table showing that removing components of the Action-guided Behavior Encoder significantly degrades performance on atomic tasks.

Table 5. Removing the Causal Perceiver or History Aggregator sharply reduces performance, proving their necessity.

  • Action-guided Behavior Encoder matters. Removing either the Causal Perceiver (CP) or the History Aggregator (HA) led to a ~40–45% average drop in atomic task performance, proving that modeling causality and long-term history is essential.

A bar chart showing a massive drop in success rate on open-ended tasks when the LLM backbone is replaced with a Transformer-XL.

Figure 4. LLM vs. Transformer‑XL: replacing the LLM backbone severely degrades open-ended instruction task performance.

  • MLLM backbone matters. Replacing it with a standard Transformer-XL obliterated success on open-ended tasks, highlighting that MLLM language priors are vital for instruction comprehension.

A bar chart comparing training results on different datasets, proving the effectiveness of the new MGOA dataset.

Figure 5. GOAP trained with the MGOA dataset (green bars) outperforms models trained on prior datasets such as OCD.

  • MGOA dataset matters. Training on the new MGOA dataset improved performance by roughly 70% on average compared to using only the earlier contractor dataset (OCD), showing how crucial aligned, high-quality data is to policy learning.

Visualizing Behavior Representations

An intuitive way to see the difference is through t‑SNE embeddings of task representations.

A t-SNE visualization comparing the latent representations from ViT, MineCLIP, and the Action-guided Behavior Encoder. The proposed encoder produces clearly separated clusters for different tasks.

Figure 6. The Action-guided Behavior Encoder (right) produces cleanly separated clusters corresponding to different atomic tasks, unlike ViT or MineCLIP.

The clusters produced by GOAP are sharply distinct for each task, confirming that its encoder learns semantically meaningful and task-aware behavioral representations.


Conclusion and Outlook

Optimus-2 marks a leap forward in building AI agents capable of understanding, planning, and acting in complex open-world environments. By addressing the policy bottleneck head-on, it introduces:

  1. An Action-guided Behavior Encoder that explicitly models the causal link between actions and observations while managing long-term history through attention and memory.
  2. An MLLM-enhanced policy backbone, enabling true multimodal reasoning and comprehension of open-ended language instructions.

Combined with the large-scale, high-quality MGOA dataset, the Optimus‑2 framework empowers the research community to train agents that think and act more like humans. The results show clear gains on atomic, long-horizon, and open-ended tasks, bringing the dream of fully autonomous, instruction-following AI within reach.

Optimus-2 doesn’t just play Minecraft—it learns, reasons, and creates within it. This fusion of powerful language models with causal, multimodal action understanding represents a major step toward intelligent agents that can master not only games, but the complex worlds beyond them.