Imagine you are in your kitchen, hands covered in flour, trying to follow a complex recipe for croissants. You have the text instructions, but you aren’t sure if your dough looks right. You ask a voice assistant, “Is this rolled out enough?”
A standard Large Language Model (LLM) might hallucinate a generic response or simply read the next step of the text. It cannot see your dough, nor can it show you a video clip of what it should look like. This is the “multimodal gap” in procedural planning. While LLMs excel at text, real-world tasks—cooking, furniture assembly, DIY repairs—are inherently visual.
In a recent paper titled “Show and Guide: Instructional-Plan Grounded Vision and Language Model,” researchers from NOVA LINCS propose a solution: MM-PlanLLM. This is a novel architecture designed not just to chat, but to ground the conversation in a specific plan (like a recipe) and seamlessly switch between text and video to guide the user.
In this post, we will break down how MM-PlanLLM works, how it aligns visual data with textual steps, and why this represents a significant step forward for AI assistants.
The Core Problem: Grounding and Multimodality
To be a helpful assistant for complex tasks, an AI needs to satisfy three conditions:
- Plan-Grounding: It must stick to the specific steps of the plan (don’t skip from step 1 to step 10).
- Visual Understanding: It needs to understand user-uploaded images to track progress.
- Visual Demonstration: It should be able to retrieve specific video moments that illustrate a step.
Existing models often fail here. Some are great at general image description but terrible at following sequential plans. Others are strict plan-followers but are “blind” (text-only).
As shown in Figure 1, MM-PlanLLM aims to bridge this. It aligns the textual plan with visual inputs (user images) and visual outputs (video clips), guiding the user through stages of text-plan alignment, text-video alignment, and vision-plan alignment.

The MM-PlanLLM Approach
The researchers propose a model that handles three specific tasks simultaneously to achieve this fluid interaction. Let’s look at them in detail.
1. Plan-Grounded Answer Generation (PGAG)
This is the baseline capability. The model must generate a text response (\(R\)) based on the dialogue history (\(D\)) and the user’s latest query (\(U\)). Unlike a standard chatbot, this generation is conditioned on the specific procedural plan (\(P\)).
The mathematical objective is to minimize the difference between the generated words and the ground truth response, formulated as a cross-entropy loss:

This ensures the model answers questions like “What do I do next?” correctly according to the written recipe.
2. Conversational Video Moment Retrieval (CVMR)
This is where the model goes beyond text. If a user asks, “Show me how to fold the dough,” the model needs to find the exact frames in an instructional video that correspond to that specific step.
The researchers treat this as a retrieval problem. They introduce a special token, [RET]. When the model generates this token, it triggers a search mechanism. The model calculates the probability of a specific video frame (\(f_k\)) being relevant to the current dialogue context.
The loss function here combines the retrieval loss (finding the right frame) with the text generation loss (responding to the user):

3. Visually-Informed Step Generation (VSG)
This task addresses the “Am I doing this right?” scenario. The user uploads an image (\(I\)) of their current progress. The model must analyze this image, compare it to the plan, and determine which step the user is on or what they should do next.
The model generates the appropriate response (\(R\)) conditioned on the image and the plan:

Architecture: How It Fits Together
To achieve these three tasks without training a massive model from scratch, the researchers use a modular architecture. They combine a pre-trained Large Language Model (LLM) backbone (specifically exploring Llama-2) with a Visual Encoder (CLIP ViT).
The key innovation lies in how these two giants talk to each other. They don’t just concatenate features; they use Task-Specific Projection Layers.
![Figure 2: Comprehensive illustration of the MM-PlanLLM architecture, including the 3 training stages employed for model training. *Denotes the [RET] token embedding representations and the Language Modeling Head of the LLM remain trainable.](/en/paper/2409.19074/images/005.jpg#center)
As illustrated in Figure 2:
- For Video Retrieval (Top): The dialogue context and candidate video frames are processed separately. Linear layers (\(W_t\) and \(W_i\)) project both the text
[RET]token and the visual frames into a shared “retrieval space.” If the text embedding and the video frame embedding are close in this space, it’s a match. - For Step Generation (Bottom): When a user uploads an image, it is encoded by the visual encoder. A projection layer (\(W_c\)) maps this visual embedding into the LLM’s own embedding space. Effectively, the image becomes a word that the LLM can “read.”
The Multistage Training Strategy
You cannot simply throw complex dialogue data at this architecture and hope it learns. The researchers devised a three-stage training process to gradually build the model’s capabilities:
Stage 1: Visual Projection Layers (Bootstrapping): The LLM and Visual Encoder are frozen. Only the linear projection layers are trained using massive datasets of image-caption pairs (CC3M). This teaches the model the basic correspondence between visual concepts and text.
Stage 2: Task Data Specialization: The model is fine-tuned on domain-specific data—in this case, cooking. They use the Tasty Dataset, which contains recipes annotated with specific video timestamps. This bridges the gap between general images and specific instructional actions (e.g., distinguishing “cutting” from “chopping”).
Stage 3: Multimodal Plan-Grounded Dialogue: Finally, the model is trained on full conversations. Since high-quality, multimodal instructional dialogue data doesn’t exist at scale, the authors generated a synthetic dataset called TastyVidDial. This stage unfreezes the LLM and fine-tunes the whole system to handle interleaved text and visual requests.
Experiments and Results
The researchers evaluated MM-PlanLLM against several baselines, including FROMAGe (a state-of-the-art multimodal model) and PlanLLM (a text-only plan-grounded model).
Text-Only Performance
A major concern when adding multimodal capabilities is “catastrophic forgetting”—does the model become dumber at text because it’s focusing on pictures?
Table 1 shows the performance on text-only plan following. MM-PlanLLM achieves a BERTScore of 83.28, which is very close to the specialized text-only PlanLLM (88.66). This indicates that the multimodal training did not significantly degrade its ability to understand text instructions.

Multimodal Performance
When it comes to the new tasks—retrieving videos and analyzing user images—MM-PlanLLM dominates the baselines.
In the Conversational Video Moment Retrieval (CVMR) task, MM-PlanLLM significantly outperformed FROMAGe. This is largely because standard multimodal models are trained on static images and captions, whereas MM-PlanLLM’s training included specific alignment between plan steps and video timestamps.
Multimodal Alignment
One of the most impressive results is how well the model aligns text with specific video moments.
Figure 3 plots the similarity between the text representation of a step and the video frames. The similarity peaks at “Distance 0” (the correct step) and drops off as you move to previous or future steps. This proves the model isn’t just guessing; it genuinely understands which part of the video matches the text.
![Figure 3: Text-query to visual plan alignment. MM-PlanLLM effectively learns to align textual [RET] token representations with that of the target step frames. We remove outliers for clarity.](/en/paper/2409.19074/images/008.jpg#center)
Similarly, Figure 4 shows the reverse: when given an image, how close is the generated text to the correct plan step? The vast majority of generated answers align perfectly (Distance 0) or are adjacent steps (Distance 1), showing strong situational awareness.

Ablation Study: Did the Training Stages Matter?
The researchers conducted an ablation study to see if their complex three-stage training was actually necessary.
Table 3 reveals the answer is a resounding “yes.”
- Phase 1 only: Performance is poor (near random).
- Phase 1 + 2: Significant jump in capability.
- Phase 1 + 2 + 3: The model achieves its full potential, particularly in retrieval accuracy (R@1 jumps from 3.45 to 6.72).
This confirms that domain specialization and dialogue-specific training are crucial for this type of task.

Qualitative Analysis: Seeing the Model in Action
Numbers are great, but how does it look in practice?
Figure 6 shows the Video Moment Retrieval in action. The left column shows the target step (e.g., “Add pumpkin puree”), and the right shows the frames the model retrieved. Green boxes indicate successes. The model is generally able to filter through a long video and pick out the exact moment an ingredient is added or a technique is performed.

Finally, Figure 7 illustrates a real dialogue. The user asks, “Done, what is the next step?” or “How do I do step 2?”. The model fluidly switches between explaining the step in text and providing a video demonstration, mimicking a helpful human instructor.

Conclusion
MM-PlanLLM represents a significant maturity in the field of Multimodal Large Language Models. By moving beyond simple image captioning and tackling instructional grounding, the authors have created a blueprint for the next generation of AI assistants.
The implications extend far beyond cooking. The same architecture could be applied to fixing a car engine, assembling flat-pack furniture, or guiding a medical student through a procedure. By tightly coupling the “Plan” (text) with the “Reality” (vision), MM-PlanLLM brings us closer to AI that can truly see, guide, and assist in the physical world.
While the model has limitations—such as a relatively small context window and a reliance on synthetic training data—the multistage training approach and the specialized projection layers offer a robust path forward for future research in plan-grounded AI.
](https://deep-paper.org/en/paper/2409.19074/images/cover.png)