If you have ever searched for a tutorial on “how to fix a leaking faucet” or “history of the Roman Senate,” you have likely encountered the “People Also Ask” section on search engines. Increasingly, these suggestions point not just to text articles, but to specific chapters within videos. This feature is incredibly useful, but it presents a massive challenge for Artificial Intelligence: How can a machine watch a video and automatically generate meaningful, deep questions about the specific entities (people, places, concepts) discussed within it?

Most current AI models are good at surface-level observation. They can look at a frame and ask, “What color is the cat?” or “How many people are in the room?” However, they struggle to ask “Entity-centric Information-seeking” (ECIS) questions—the kind that facilitate learning, like “What is the significance of the Julian laws discussed in this chapter?”

In the research paper “ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos,” researchers introduce a novel framework to bridge this gap. They propose a new dataset and a sophisticated multimodal architecture that teaches AI to ask the questions that actually matter.

The Problem: AI Asks Shallow Questions

Question Generation (QG) has been studied extensively for text, but Video QG is still in its infancy. Existing approaches usually fall into two traps:

  1. Transcript dependency: They only read the subtitles, ignoring the visual richness of the video.
  2. Visual superficiality: They focus on common objects and attributes (e.g., “What is the person holding?”), which are rarely what a user searching for information cares about.

The researchers define a new goal: ECIS (Entity-centric Information-seeking) questions. These are questions that target specific entities within the video context—such as a specific landmark, a chemical process, or a historical figure—and seek detailed information about them.

Bing’s People Also Ask (PAA) module displays a question along with a relevant video thumbnail. Figure 1: An example of the “People Also Ask” feature on Bing. The goal of this research is to automate the generation of these high-quality, video-linked questions.

As illustrated in Figure 1, the practical application is clear. If an AI can generate these questions accurately, it can power search engines, educational tools, and video-based chatbots.

Why Current Models Fail

To understand why a new approach is needed, we can look at how traditional models handle video content. In the example below (Figure 2), a traditional QG model asks, “Is the food really cheap?” While grammatically correct, this question is vague. It lacks context.

In contrast, the proposed ECIS system generates: “What is special about outdoor dining area at Khaja Ghar?” This question identifies the entity (“Khaja Ghar”) and seeks specific information (“outdoor dining area”).

Two examples comparing generic QG models vs. the proposed ECIS QG model. Figure 2: A comparison of generated questions. The traditional model (top) asks generic questions, while the ECIS model (bottom) asks specific, entity-rich questions derived from the video content.

The Solution: The ECIS-VQG Architecture

Generating these complex questions requires more than just a language model; it requires a system that can “see,” “read,” and “reason” simultaneously. The researchers developed a pipeline that processes four distinct types of data from a video:

  1. Video Title: The global context.
  2. Chapter Title: The local topic.
  3. Transcript: The spoken content.
  4. Visuals: Frame captions and video embeddings.

Step 1: Filtering the Noise

Not every part of a video is worth asking questions about. An “Intro” or “Outro” chapter usually contains filler content. The researchers first developed a Chapter Title Classifier based on BERT. This module categorizes chapters into four types:

  • Useless (UL): e.g., “Intro,” “Subscribe.”
  • Self-Complete Questions (SCQ): Titles that are already good questions (e.g., “What is a Wormhole?”).
  • Not Self-Complete (NSC): Titles that contain keywords but aren’t full questions (e.g., “Structure of DNA”).

The system discards the “Useless” chapters and passes the informative ones to the generator.

Step 2: Multimodal Representation

The core of the architecture is how it handles the “Not Self-Complete” chapters to turn them into full questions. This is where the ECIS Questions Generator comes in.

The architecture (Figure 3) is a masterclass in multimodal fusion. It employs a Transformer-based encoder-decoder (specifically leveraging models like BART and T5).

Architecture of the proposed method indicating various components like input representations, chapter titles classifier, and Transformer encoder-decoder model. Figure 3: The complete architecture. Notice how visual inputs (blue) and textual inputs (orange) are processed and fused to generate the final question.

Handling Visuals

Raw video frames are heavy and noisy. To make them useful for the language model, the researchers use a clever two-pronged approach:

  1. Captioning & Summarization: They extract frames and use a model called BLIP to generate captions for them. Since raw captions can be disjointed, they use GPT-3.5-Turbo to summarize the frame captions and the transcript into a coherent paragraph.
  2. Embeddings: They use CLIP, a model trained to understand images and text together, to create vector embeddings of the video clips.

Fusion Mechanism

The model doesn’t just concatenate these inputs. It uses a Cross-Attention mechanism. The text tokens (from the transcript and titles) act as the “query,” while the video embeddings act as the “key” and “value.” This allows the model to “pay attention” to specific visual features that are relevant to the current text being processed.

Step 3: Contrastive Loss

A major innovation in this paper is the training objective. Standard models use Cross-Entropy Loss, which simply encourages the model to predict the correct next word. However, this often leads to generic, “safe” questions.

To force the model to be specific, the researchers added a Contrastive Loss. They pair the generated question with a “negative” example—a generic, non-entity-centric question (e.g., “What is in the image?”). The model is penalized if its output is too similar to the generic question and rewarded if it is distinct.

The Data: VIDEOQUESTIONS Dataset

One of the biggest hurdles in this field was the lack of data. Existing datasets focus on movies or daily activities, not information-heavy content. To solve this, the authors curated VIDEOQUESTIONS, a dataset comprising 411 YouTube videos from categories like Education, Science & Technology, and Travel.

They manually annotated over 2,200 questions, ensuring they were entity-centric. The data analysis (Figure 4) shows that while chapter titles are short, the transcripts and frame captions provide the dense context necessary for deep question generation.

Length distribution of chapter title, frame captions, video title, and transcript. Figure 4: Data distribution. Transcripts (cyan) and frame captions (orange) provide the bulk of the textual data, scaling with video duration.

Experiments and Results

The researchers compared their method (specifically variations of BART and T5) against several baselines, including standard T5 models and large language models (LLMs) like Alpaca, GPT-3.5, and even GPT-4o.

Quantitative Success

The results were compelling. As shown in Table 2, the proposed model—specifically BART trained with Contrastive Loss and Multimodal inputs (Block E)—outperformed others across major metrics like BLEU, ROUGE, and METEOR.

Table showing ECIS question generation results comparing various models. Table 1: Main results. The best performing model (Row E) uses BART with Cross-Entropy + Contrastive Loss (CC), Summarized inputs, and CLIP embeddings (\(E_C\)).

Key takeaways from the results:

  • Contrastive Loss Works: Models trained with the combined loss function (CC) consistently beat those trained with cross-entropy alone.
  • Multimodality Matters: Adding video embeddings (using CLIP) provided a statistically significant boost, proving that the model is actually utilizing the visual information, not just reading the transcript.
  • Summarization Helps: Using GPT-3.5 to clean up the noisy frame captions and transcripts before feeding them into the generator improved performance.

Qualitative Analysis

Numbers tell only half the story. Looking at the actual questions generated helps visualize the improvement.

In Table 1 (below), we see the difference between a “Non-ECIS” question (likely from a standard model) and the “ECIS Generated Question.” For a video about plants, the generic model asks, “What will we do with the roots?” The ECIS model asks, “Do you have to cut the branches from the root of Syngonium Cuttings?” The inclusion of the entity “Syngonium Cuttings” makes the question significantly more valuable for search and retrieval.

Table comparing non-ECIS questions vs. ECIS questions. Table 2: Comparison of generic questions vs. those generated by the proposed system. The ECIS questions are specific and self-complete.

The authors also conducted human evaluations, asking experts to rate questions on Context Relevance, Engagement, and Fluency. The proposed BART model achieved the highest scores, verifying that the generated questions are not just mathematically similar to the ground truth, but actually read better to humans.

Conclusion

The ECIS-VQG paper marks a significant step forward in how AI interacts with video content. By moving away from generic object recognition and towards entity-centric understanding, the researchers have paved the way for smarter video search and educational tools.

Three innovations defined their success:

  1. A new problem definition that prioritizes information-seeking over simple description.
  2. A curated dataset of real-world, information-heavy YouTube videos.
  3. A robust architecture that fuses visual and textual data while using contrastive loss to avoid generic outputs.

As video content continues to dominate the internet, the ability for AI to “watch” a video and ask the right questions will be essential for making that content accessible and searchable. This research brings us one step closer to that reality.