When we watch a movie, we don’t just see a sequence of images; we see a story told through a specific language. A “low-angle shot” makes a character look powerful. A “smash cut” creates sudden shock. A “match cut” draws a thematic connection between two different times or places. As humans, we intuitively understand this visual grammar.
However, for Video Large Language Models (Vid-LLMs)—the AI systems designed to understand video content—this “grammar” of film has largely been a foreign language. While modern AI has become exceptionally good at identifying what is happening in a video (e.g., “a man is running”), it has historically struggled to understand how the video is constructed (e.g., “the camera is tracking the man with a handheld shake to imply urgency”).
This gap in understanding is what a new research paper, “VEU-Bench: Towards Comprehensive Understanding of Video Editing,” aims to bridge. The researchers introduce a comprehensive benchmark to test AI on video editing concepts, a sophisticated pipeline to generate training data, and a new expert model named Oscars.
In this post, we will dissect this paper to understand why Video Editing Understanding (VEU) is the next frontier for computer vision, how the researchers built a dataset to teach it, and the surprising ways that learning film technique actually helps AI understand reality better.
The Problem: Nouns vs. Verbs in Video
To understand why current AI models struggle with video, we can use a linguistic analogy. Current video models are excellent at recognizing “nouns”—the objects, people, and settings within a frame. They are getting better at “verbs” regarding the action inside the scene.
But video editing introduces a different layer of verbs and adjectives: the techniques used to present the content. These include:
- Shot Attributes: Angles, sizes, and camera movement.
- Cuts: The harsh or smooth breaks between shots.
- Transitions: The visual effects bridging scenes.
Previous benchmarks focused mainly on simple classification—asking a model, “Is this a close-up?” But true understanding requires more. It requires reasoning (identifying why a shot changed) and judging (evaluating the artistic function of a cut).
Because editing elements are abstract concepts derived from specialized techniques (you can’t “see” a cut in the real world; it is an artificial construct), they require a higher level of abstract reasoning. This is where VEU-Bench comes in.
Introducing VEU-Bench
The researchers propose VEU-Bench (Video Editing Understanding Benchmark), a framework designed to evaluate models across a wide spectrum of editing tasks. Unlike previous attempts that looked at isolated elements, VEU-Bench is holistic.
The Three-Level Hierarchy
As illustrated in the overview below, the benchmark is structured across 10 dimensions and organized into three varying levels of cognitive difficulty.

The structure is vital for understanding the scope of the research:
- Recognition (The “What”):
- Can the model identify the shot type? (e.g., “This is an over-the-shoulder shot.”)
- This is the most basic level, usually formatted as multiple-choice questions.
- Reasoning (The “Why” and “How”):
- This level requires the model to provide evidence. It asks questions about dynamic changes.
- Example: “Explain the camera motion.”
- Answer: “The camera tilts vertically upward from the woman’s mouth to her eyes…”
- Judging (The “Intent”):
- This is the most advanced level. It asks the model to interpret the function of an edit in the context of the story.
- Example: “What is the function of this cut?”
- Answer: “This emphasizes the reaction of the character.”
The 10 Dimensions of Editing
The benchmark categorizes video editing components into three physical categories:
- Intra-frame (Single Frame): Features visible in a still image, such as Shot Size, Shot Angle, Shot Location, Shot Subject, Shot Type, and Shot Color.
- Intra-shot (Temporal): Features that change over time within a continuous shot, specifically Shot Motion (pan, tilt, zoom) and Shot Speed (slow motion, timelapse).
- Inter-shot (Between Scenes): The relationship between two different shots, covering Cut Types (match cut, jump cut) and Transitions (wipes, dissolves).
Constructing the Dataset
Creating a benchmark for such abstract concepts is difficult. You cannot simply scrape the internet and expect high-quality labels for “match cuts” or “narrative emphasis.”
The researchers curated a dataset of 30,000 videos and roughly 50,000 QA samples. They sourced raw videos from existing datasets like AVE, MovieCuts, and AutoTransition, but realized that existing labels were insufficient for deep reasoning.

As shown in the statistics above, the dataset covers a wide distribution of video durations (mostly short clips suitable for AI context windows) and diverse answer lengths. But the real innovation lies in how they generated the complex reasoning and judging annotations without relying solely on expensive human experts or potentially hallucinating AI.
The Ontology-Based Annotation Pipeline
To solve the data quality issue, the authors developed a semi-automated pipeline that combines a Knowledge Base with Multimodal Large Language Models (MLLMs).

Here is how the pipeline works, step-by-step:
- Knowledge Base Construction: The researchers built a professional knowledge base derived from film editing textbooks. This database defines abstract patterns. For example, it defines a “Match Cut” as connecting two similarly shaped objects across frames.
- Attribute Selection: For a specific video clip, the system identifies the relevant editing tag (e.g., “Match Cut”).
- MLLM Rewriting: This is the clever part. An MLLM is given the video and the abstract definition from the Knowledge Base. It is then tasked with rewriting the abstract definition to fit the specific video content.
- Abstract Rule: “Connects similarly shaped objects.”
- Video Specific: “The cut connects the circular shape of the bone thrown in the air to the circular shape of the spaceship in orbit.”
This “rewriting” strategy ensures the generated answers are both theoretically correct (based on the knowledge base) and contextually accurate (based on the video).
Measuring Success: A New Scoring System
How do you grade an AI on an open-ended question like “Why was this transition effective?” Standard metrics like exact word matching fail here.
The researchers adopted a method using GPT-4 as a judge, but they added a twist to prevent the judge from being biased toward flowery language. They introduced a scoring equation that balances two factors:
- Pattern Matching (PM): Does the answer align with the professional definition in the ontology?
- Information Matching (IM): Does the answer correctly identify the specific visual objects in the video?
The matching score is calculated as:

Finally, to get the overall score for open-ended tasks (\(S_{oe}\)), they combine the accuracy (did the model get the main category right?) with the match score:

This rigorous scoring system ensures that a model only gets high marks if it understands both the film theory and the video content.
Experiments: How Do Current Models Perform?
The researchers tested 11 state-of-the-art Video LLMs, including open-source heavyweights like LLaVA-Video and proprietary giants like GPT-4o and Gemini-1.5-Pro.
The results, visualized in the radar chart below, were stark.

As you can see, most models (the cluster of shapes in the center) have very limited coverage. They perform decently on simple tasks like identifying the Subject or Location, but their performance collapses on technical dimensions like Transitions, Cuts, and Speed.
Some key takeaways from the baseline evaluation:
- Random Guessing: In complex categories like Shot Motion, some models performed worse than random chance.
- Reasoning Gap: Models struggled significantly more with Reasoning and Judging than with simple Recognition. They might guess a shot is a “Close-up,” but they cannot explain why.
Enter “Oscars”: The Expert Model
To prove that this capability can be learned, the researchers fine-tuned a model they call Oscars (named after the Academy Awards). They utilized the Qwen2-VL-7B model as a base and trained it on the VEU-50K dataset they created.
The results were transformative.

Looking at Table 2:
- Oscars (the far-right column) dominates the board.
- It surpasses its base model (Qwen2-VL-7B) by 39.6%.
- Remarkably, it outperforms the commercial giant Gemini-1.5-Pro by 4% and achieves performance comparable to GPT-4o.
- The improvement is most visible in the hardest categories: Reasoning regarding Cuts and Transitions saw massive jumps in accuracy.
This demonstrates that the “blindness” of current AI to video editing isn’t a permanent limitation of the technology—it’s simply a lack of specialized training data.
The “Ripple Effect”: Does Editing Help General Understanding?
One of the most fascinating findings of the paper appears in the secondary experiments. The researchers asked: If we teach a model to understand cuts and camera moves, does it get better at understanding general video content?
The answer is yes.

Table 3 (top of the image above) shows Oscars’ performance on general video benchmarks (VideoMME, MVBench, TempCompass) that have nothing to do with editing theory.
- Attribute Perception: +7.3%
- Temporal Order: +8.5%
- Unexpected Action: +6.0%
By forcing the model to pay attention to cuts and camera motion, the model seemingly developed a better grasp of time and cause-and-effect. Understanding that a “cut” changes the scene helps the model realize that the action has shifted, preventing it from confusing events happening in different locations.
The Role of Prompt Engineering
The researchers also analyzed how to best ask these questions. They tested Simple Prompts, Context Prompts (which include definitions), and Guided Prompts.

The ablation study reveals that Context Prompts (the orange bars) generally provide a significant boost, especially for smaller open-source models like VideoLLaMA2. This suggests that even if a model hasn’t been fine-tuned, providing it with the “glossary” of film terms in the prompt can unlock better performance.
Validating the Judge
Finally, to ensure their automated scoring system was fair, the researchers compared their LLM-based scores against human evaluation.

The scatter plots show a strong positive correlation (\(p=0.86\)) between the automated scores and human judgment when using the full Pattern Matching + Information Matching method (Left). This validates that VEU-Bench is a reliable proxy for human-level evaluation.
Conclusion
The VEU-Bench paper marks a significant step forward in multimodal AI. It highlights that true video understanding goes beyond recognizing objects; it requires understanding the syntax of the visual medium.
By treating video editing as a structured language with nouns (shots), verbs (cuts), and grammar (transitions), the researchers have shown that:
- Current models are largely “illiterate” in film language.
- We can automate the creation of high-quality textbooks (datasets) to teach them.
- Models like Oscars can master this language, rivaling proprietary giants.
- Most importantly, learning this abstract language improves the model’s ability to understand general reality.
As Vid-LLMs continue to evolve, benchmarks like VEU-Bench will be essential in moving us from AI that simply “watches” video to AI that truly “comprehends” it.
](https://deep-paper.org/en/paper/2504.17828/images/cover.png)