Forget Pixels, Let's Generate Code: A Deep Dive into Code2Video for Creating Educational Videos

We’ve all seen the incredible leaps in AI-powered video generation. Models like Sora can turn a simple text prompt into a stunning, photorealistic clip. But what happens when you need to create a video that doesn’t just look good, but actually teaches something?

Think about the educational videos you see on YouTube channels like 3Blue1Brown—they are packed with precise animations, clear formulas, and a logical flow that guides you through complex topics. These videos don’t just entertain; they build and reinforce knowledge step-by-step.

Current video generation models, which operate in pixel space, struggle with this. They excel at producing aesthetic textures and fluid motion, but they lack the fine-grained control essential for structured educational content. Text often comes out garbled, diagrams can be inconsistent, and the logical progression of a lesson is nearly impossible to enforce.

The authors of CODE2VIDEO, from the National University of Singapore, propose a radical rethinking of how such videos should be produced: generate code that renders the video instead of generating pixels directly. This code-centric approach offers the controllability, interpretability, and scalability that educational videos demand.

In this article, we’ll explore their groundbreaking system—Code2Video—an agent-based framework using Large Language Models (LLMs) to write Python code for creating high-quality educational videos. We’ll also unpack MMMC, a new benchmark designed to evaluate not just a video’s visual appeal, but its ability to effectively teach.

An overview of the Code2Video framework and its evaluation on the MMMC benchmark. The code-centric paradigm offers scalability, interpretability, and control, unlike pixel-based methods.

Figure 1: Overview of Code2Video. The system uses three agents—Planner, Coder, Critic—to turn a learning topic into an educational video, evaluated on the MMMC benchmark for efficiency, aesthetics, and knowledge transfer.

The Problem with Pixels for Pedagogy

Standard text-to-video models (diffusion-based or autoregressive) work by predicting frames or denoising latent representations—synthesizing videos pixel by pixel. While this suffices for short, visually pleasing clips, it breaks down for educational content:

Temporal Coherence: An effective lesson follows a narrative arc—introducing concepts, illustrating with examples, and scaffolding toward more advanced ideas. Pixel-based generators have no inherent grasp of pedagogical progression.
Spatial Clarity: Visual precision matters. In math tutorials, a single formula obscured by an animation or illegible text undermines comprehension. Pixel models often fail at precise object placement and crisp text rendering.
Controllability & Editability: Need to adjust one number in a formula or tweak animation timing? In pixel-generated outputs, this requires regenerating the entire clip—risking loss of quality and consistency.

Code solves these problems. Every position, timing, and element is explicitly defined. The video is re-rendered from code with predictable changes, lending both structure and flexibility.

MMMC and TeachQuiz: Evaluating Educational Effectiveness

Before diving into how Code2Video works, we need to understand how success is measured. Traditional metrics for video quality fall short for educational material—clarity, logical sequencing, and knowledge transfer matter more than cinematic flair.

MMMC—Massive Multi-discipline Multimodal Coding—is the benchmark introduced in this work. It comprises:

117 professionally produced educational videos from 3Blue1Brown, across 13 subjects (calculus, physics, topology, etc.)
All videos use Manim, a powerful Python library for mathematical animation, making them ideal references for code-generated output.

An overview of the MMMC benchmark, showing its diverse categories, common learning topics, and topic length distribution.

Figure 2: MMMC covers a wide range of subjects and carefully curated topics to ensure evaluations are comprehensive.

Evaluation involves three dimensions:

Efficiency: Average generation time and token consumption.
Aesthetics (VLM-as-a-Judge): A Vision-Language Model assesses videos on five criteria—Element Layout (EL), Attractiveness (AT), Logic Flow (LF), Visual Consistency (VC), and Accuracy & Depth (AD)—each on a 100-point scale.
TeachQuiz: The most novel metric—quantifying knowledge transfer.

The two-stage process of TeachQuiz, which measures the knowledge transfer from a video by calculating the score improvement after a model “re-learns” a concept it was forced to forget.

Figure 3: TeachQuiz isolates the educational value of a video by comparing scores before and after learning from it.

TeachQuiz: Forcing the Model to Learn from the Video

TeachQuiz works in two stages:

Unlearning: The VLM is prompted to “forget” a target concept—blocking prior knowledge—and answer a multiple-choice quiz. Scores drop sharply.
Learning-from-Video: The unlearned model watches the AI-generated educational video, then retakes the quiz using only the video’s content.

The TeachQuiz score is the gain:

\[ \widetilde{S}(\mathcal{V}) = S(\mathcal{V}) - S(\mathcal{V} \mid \text{unlearn}) \]

This isolates the video’s teaching power, independent of pre-existing knowledge. Higher scores mean better instructional quality.

The Code2Video Framework: Three Agents, One Goal

The complete pipeline of Code2Video, from a user query to a final educational video, orchestrated by the Planner, Coder, and Critic agents.

Figure 4: Code2Video turns a topic query into a pedagogically coherent, visually clear educational video.

1. The Planner – Designing the Lesson Blueprint

The Planner acts as the instructional architect:

Outline Generation: Breaks the topic into logically ordered sections, tailored to the audience.
Storyboard Construction: Expands outline into lecture lines and paired animations.
External Database Access: Retrieves reference images and visual assets for clarity and consistency—cached for reuse.

Output: A structured teaching plan ensuring logical flow and visual cohesion.

2. The Coder – Engineering the Animations

The Coder turns the storyboard into executable Manim code.

Challenges: LLMs rarely produce perfect, runnable code in one shot. Syntax and runtime errors can derail rendering.

Solutions:

Parallel Code Generation: Builds code for each section independently—speeding generation and isolating errors.
ScopeRefine Debugging: A hierarchical fix strategy:
- Line Scope: Attempt minimal, local fixes.
- Block Scope: Expand to relevant code block if needed.
- Global Scope: Regenerate section only as a last resort.

3. The Critic – Refining the Visual Layout

Even bug-free code can yield flawed visuals—overlapping elements, occluded text, imbalance.

An illustration of the Visual Anchor Prompt system. By discretizing the canvas into a grid, the Critic can give precise, actionable feedback for fixing layout issues.

Figure 5: Visual Anchor Prompt discretizes the canvas into a grid system, enabling precise spatial guidance.

Visual Anchor Prompt: Transforms continuous positioning into discrete anchors (6×6 grid). The Critic uses:

Occupancy Table: Tracks element positions and scales.
Actionable Grid Instructions: e.g., “Move cat icon from D2 to B2.”
Iterative feedback loop with the Coder—ensuring final layout clarity and legibility.

Results: Code vs. Pixels

A summary of the main results, comparing Code2Video with human-made videos, pixel-based models, and direct code generation across efficiency, aesthetics, and TeachQuiz scores.

Table 1: Code2Video massively outperforms pixel-based models, approaching human quality on key metrics.

Key Findings:

Pixel-Based Diffusion Models: Near-zero scores on Logic Flow and TeachQuiz; outputs incoherent for teaching.
Direct Manim Code Generation: Large gains—validates the code-centric approach.
Code2Video’s Agentic Framework: Significant boost over direct code generation—up to 50% higher aesthetic scores, 46% better TeachQuiz with Claude Opus 4.1.
Closing the Human Gap: Still trailing 3Blue1Brown in storytelling nuance, but remarkably effective in structured delivery.

A qualitative comparison showing the clarity and coherence of Code2Video’s output versus the blurry and inconsistent output of a leading pixel-based model, Veo3.

Figure 6: Clear text, stable layouts, coherent animations—Code2Video vs. pixel-based Veo3.

Why It Works: Component Ablations

Ablation study results showing the impact of each component on the final video quality. The Planner is the most critical component.

Table 2: Removing the Planner severely degrades quality—pedagogical structure is vital.

Planner: Most critical—without it, aesthetics and TeachQuiz drop ~40 points.
External Database & Visual Anchors: Stabilize layouts and strengthen conceptual grounding.
Critic: Improves refinement, removing layout defects.

Ablation study on efficiency components, showing that parallelization and ScopeRefine are essential for keeping generation time and token costs manageable.

Table 3: Parallelization and ScopeRefine are indispensable for practical generation times.

Efficiency insights:

Removing Parallel Execution → From 15 min to 86+ min per topic.
Removing ScopeRefine → Debugging costs explode.

Human Study: Real Learners, Real Insights

Results from the human study, showing alignment with VLM scores but also highlighting human sensitivity to layout errors and limited attention spans.

Table 4: Human preferences align with VLM evaluations but exhibit higher sensitivity to visual flaws.

Observations:

Sensitivity to Layout: Humans penalized even minor occlusions more heavily than VLMs.
Completion Willingness: Long videos—even high quality—saw reduced completion among younger viewers.
Correlation: Strong link between visual appeal and learning outcomes (r = 0.971).

Conclusion: A New Chapter for Generative Educational Media

Code2Video signals a shift in generative video—from pixel synthesis to code synthesis:

Control & Interpretability: Explicit scripting of every visual and timing detail.
Agentic Collaboration: Planner, Coder, and Critic decompose the creative process—boosting quality and reliability.
Innovative Evaluation: TeachQuiz and MMMC benchmark redefine what “good” means for educational videos.

While the artistry of human educators remains the gold standard, Code2Video offers a clear path toward scalable, high-quality, AI-assisted teaching media—where lessons are architected in the precise language of code, and every frame serves learning.

The Problem with Pixels for Pedagogy#

MMMC and TeachQuiz: Evaluating Educational Effectiveness#

TeachQuiz: Forcing the Model to Learn from the Video#

The Code2Video Framework: Three Agents, One Goal#

1. The Planner – Designing the Lesson Blueprint#

2. The Coder – Engineering the Animations#

3. The Critic – Refining the Visual Layout#

Results: Code vs. Pixels#

Why It Works: Component Ablations#

Human Study: Real Learners, Real Insights#

Conclusion: A New Chapter for Generative Educational Media#