Imagine you are in your kitchen, hands covered in flour, trying to follow a complex recipe for croissants. You have the text instructions, but you aren’t sure if your dough looks right. You ask a voice assistant, “Is this rolled out enough?”

A standard Large Language Model (LLM) might hallucinate a generic response or simply read the next step of the text. It cannot see your dough, nor can it show you a video clip of what it should look like. This is the “multimodal gap” in procedural planning. While LLMs excel at text, real-world tasks—cooking, furniture assembly, DIY repairs—are inherently visual.

In a recent paper titled “Show and Guide: Instructional-Plan Grounded Vision and Language Model,” researchers from NOVA LINCS propose a solution: MM-PlanLLM. This is a novel architecture designed not just to chat, but to ground the conversation in a specific plan (like a recipe) and seamlessly switch between text and video to guide the user.

In this post, we will break down how MM-PlanLLM works, how it aligns visual data with textual steps, and why this represents a significant step forward for AI assistants.

The Core Problem: Grounding and Multimodality

To be a helpful assistant for complex tasks, an AI needs to satisfy three conditions:

  1. Plan-Grounding: It must stick to the specific steps of the plan (don’t skip from step 1 to step 10).
  2. Visual Understanding: It needs to understand user-uploaded images to track progress.
  3. Visual Demonstration: It should be able to retrieve specific video moments that illustrate a step.

Existing models often fail here. Some are great at general image description but terrible at following sequential plans. Others are strict plan-followers but are “blind” (text-only).

As shown in Figure 1, MM-PlanLLM aims to bridge this. It aligns the textual plan with visual inputs (user images) and visual outputs (video clips), guiding the user through stages of text-plan alignment, text-video alignment, and vision-plan alignment.

Figure 1: Example of a plan-grounded multimodal dialogue. The proposed model has the ability to understand and respond to multimodal input, provide relevant information from multiple knowledge sources, and guide the user through a complex task while adhering to a structured plan.

The MM-PlanLLM Approach

The researchers propose a model that handles three specific tasks simultaneously to achieve this fluid interaction. Let’s look at them in detail.

1. Plan-Grounded Answer Generation (PGAG)

This is the baseline capability. The model must generate a text response (\(R\)) based on the dialogue history (\(D\)) and the user’s latest query (\(U\)). Unlike a standard chatbot, this generation is conditioned on the specific procedural plan (\(P\)).

The mathematical objective is to minimize the difference between the generated words and the ground truth response, formulated as a cross-entropy loss:

Equation for PGAG Loss

This ensures the model answers questions like “What do I do next?” correctly according to the written recipe.

2. Conversational Video Moment Retrieval (CVMR)

This is where the model goes beyond text. If a user asks, “Show me how to fold the dough,” the model needs to find the exact frames in an instructional video that correspond to that specific step.

The researchers treat this as a retrieval problem. They introduce a special token, [RET]. When the model generates this token, it triggers a search mechanism. The model calculates the probability of a specific video frame (\(f_k\)) being relevant to the current dialogue context.

The loss function here combines the retrieval loss (finding the right frame) with the text generation loss (responding to the user):

Equation for CVMR Loss

3. Visually-Informed Step Generation (VSG)

This task addresses the “Am I doing this right?” scenario. The user uploads an image (\(I\)) of their current progress. The model must analyze this image, compare it to the plan, and determine which step the user is on or what they should do next.

The model generates the appropriate response (\(R\)) conditioned on the image and the plan:

Equation for VSG Loss

Architecture: How It Fits Together

To achieve these three tasks without training a massive model from scratch, the researchers use a modular architecture. They combine a pre-trained Large Language Model (LLM) backbone (specifically exploring Llama-2) with a Visual Encoder (CLIP ViT).

The key innovation lies in how these two giants talk to each other. They don’t just concatenate features; they use Task-Specific Projection Layers.

Figure 2: Comprehensive illustration of the MM-PlanLLM architecture, including the 3 training stages employed for model training. *Denotes the [RET] token embedding representations and the Language Modeling Head of the LLM remain trainable.

As illustrated in Figure 2:

  • For Video Retrieval (Top): The dialogue context and candidate video frames are processed separately. Linear layers (\(W_t\) and \(W_i\)) project both the text [RET] token and the visual frames into a shared “retrieval space.” If the text embedding and the video frame embedding are close in this space, it’s a match.
  • For Step Generation (Bottom): When a user uploads an image, it is encoded by the visual encoder. A projection layer (\(W_c\)) maps this visual embedding into the LLM’s own embedding space. Effectively, the image becomes a word that the LLM can “read.”

The Multistage Training Strategy

You cannot simply throw complex dialogue data at this architecture and hope it learns. The researchers devised a three-stage training process to gradually build the model’s capabilities:

  1. Stage 1: Visual Projection Layers (Bootstrapping): The LLM and Visual Encoder are frozen. Only the linear projection layers are trained using massive datasets of image-caption pairs (CC3M). This teaches the model the basic correspondence between visual concepts and text.

  2. Stage 2: Task Data Specialization: The model is fine-tuned on domain-specific data—in this case, cooking. They use the Tasty Dataset, which contains recipes annotated with specific video timestamps. This bridges the gap between general images and specific instructional actions (e.g., distinguishing “cutting” from “chopping”).

  3. Stage 3: Multimodal Plan-Grounded Dialogue: Finally, the model is trained on full conversations. Since high-quality, multimodal instructional dialogue data doesn’t exist at scale, the authors generated a synthetic dataset called TastyVidDial. This stage unfreezes the LLM and fine-tunes the whole system to handle interleaved text and visual requests.

Experiments and Results

The researchers evaluated MM-PlanLLM against several baselines, including FROMAGe (a state-of-the-art multimodal model) and PlanLLM (a text-only plan-grounded model).

Text-Only Performance

A major concern when adding multimodal capabilities is “catastrophic forgetting”—does the model become dumber at text because it’s focusing on pictures?

Table 1 shows the performance on text-only plan following. MM-PlanLLM achieves a BERTScore of 83.28, which is very close to the specialized text-only PlanLLM (88.66). This indicates that the multimodal training did not significantly degrade its ability to understand text instructions.

Table 1: Instructional plan following generation results, on automatic metrics. PlanLLM results as reported in (Glória-Silva et al., 2024)

Multimodal Performance

When it comes to the new tasks—retrieving videos and analyzing user images—MM-PlanLLM dominates the baselines.

In the Conversational Video Moment Retrieval (CVMR) task, MM-PlanLLM significantly outperformed FROMAGe. This is largely because standard multimodal models are trained on static images and captions, whereas MM-PlanLLM’s training included specific alignment between plan steps and video timestamps.

Multimodal Alignment

One of the most impressive results is how well the model aligns text with specific video moments.

Figure 3 plots the similarity between the text representation of a step and the video frames. The similarity peaks at “Distance 0” (the correct step) and drops off as you move to previous or future steps. This proves the model isn’t just guessing; it genuinely understands which part of the video matches the text.

Figure 3: Text-query to visual plan alignment. MM-PlanLLM effectively learns to align textual [RET] token representations with that of the target step frames. We remove outliers for clarity.

Similarly, Figure 4 shows the reverse: when given an image, how close is the generated text to the correct plan step? The vast majority of generated answers align perfectly (Distance 0) or are adjacent steps (Distance 1), showing strong situational awareness.

Figure 4: Image-query to text plan alignment. Most similar plan step to the provided visual input, as measured by BS using the generated answer.

Ablation Study: Did the Training Stages Matter?

The researchers conducted an ablation study to see if their complex three-stage training was actually necessary.

Table 3 reveals the answer is a resounding “yes.”

  • Phase 1 only: Performance is poor (near random).
  • Phase 1 + 2: Significant jump in capability.
  • Phase 1 + 2 + 3: The model achieves its full potential, particularly in retrieval accuracy (R@1 jumps from 3.45 to 6.72).

This confirms that domain specialization and dialogue-specific training are crucial for this type of task.

Table 3: Impact of the several training stages on model performance on the three main tasks.

Qualitative Analysis: Seeing the Model in Action

Numbers are great, but how does it look in practice?

Figure 6 shows the Video Moment Retrieval in action. The left column shows the target step (e.g., “Add pumpkin puree”), and the right shows the frames the model retrieved. Green boxes indicate successes. The model is generally able to filter through a long video and pick out the exact moment an ingredient is added or a technique is performed.

Figure 6: Five examples of CVMR results from the TastyVidDial test set. These examples demonstrate that the model is adept at identifying the key elements that should be characterized in the target frame.

Finally, Figure 7 illustrates a real dialogue. The user asks, “Done, what is the next step?” or “How do I do step 2?”. The model fluidly switches between explaining the step in text and providing a video demonstration, mimicking a helpful human instructor.

Figure 7: Real multimodal dialogues carried out by a volunteer interacting with MM-PlanLLM. These dialogues showcase the model’s ability to carry out full conversations with interleaved multimodal requests.

Conclusion

MM-PlanLLM represents a significant maturity in the field of Multimodal Large Language Models. By moving beyond simple image captioning and tackling instructional grounding, the authors have created a blueprint for the next generation of AI assistants.

The implications extend far beyond cooking. The same architecture could be applied to fixing a car engine, assembling flat-pack furniture, or guiding a medical student through a procedure. By tightly coupling the “Plan” (text) with the “Reality” (vision), MM-PlanLLM brings us closer to AI that can truly see, guide, and assist in the physical world.

While the model has limitations—such as a relatively small context window and a reliance on synthetic training data—the multistage training approach and the specialized projection layers offer a robust path forward for future research in plan-grounded AI.