Why AI Commentators Struggle (and How 'MatchTime' Fixes It)

Imagine watching a soccer match where the commentator screams “Goal!” two minutes after the ball has already hit the net. It would be disorienting, annoying, and largely useless. Yet, this is the precise problem plaguing Artificial Intelligence when we try to teach it to understand sports.

For years, researchers have been trying to build systems that can automatically narrate sports videos. The potential is immense: from automated highlights to assisting visually impaired fans. However, current models often fail to sound professional or accurate.

In this deep dive, we will explore a fascinating paper titled “MatchTime: Towards Automatic Soccer Game Commentary Generation.” The researchers identify a critical bottleneck in existing datasets—temporal misalignment—and propose a sophisticated pipeline to fix it. We will break down how they cleaned up the data, the architecture of their new commentary model (MatchVoice), and why “aligning” time is the secret ingredient to better AI storytelling.

The Problem: When “Live” Isn’t Live

To train an AI to commentate, you need data. Specifically, you need video clips of soccer matches paired with the text of what the commentator said. The go-to dataset for this has been SoccerNet-Caption.

However, there is a major flaw in how this data was collected. The text commentaries were scraped from live text broadcast websites. If you follow sports online, you know the issue: a goal happens, the writer types it out, and it appears on the feed 30 seconds to 2 minutes later.

When you train a model on this data, you are essentially showing it a video of a player celebrating and telling it, “This is a video of a corner kick,” because the text is lagging two minutes behind the visuals.

Overview of the misalignment problem. On the left, timelines show how ‘Goal’ text appears long after the visual event. On the right, performance metrics show that aligned models perform better.

As shown in Figure 1 above, existing datasets are rife with these “Temporal Misalignments.” The researchers verified this by manually watching 49 matches and correcting the timestamps.

The results of their investigation were startling. As illustrated in the histogram below, the offset between what you see and what the text says can exceed 100 seconds.

A histogram showing the distribution of time offsets. Most data is misaligned, with some errors spanning over 100 seconds.

The average absolute offset was nearly 17 seconds. For a fast-paced game like soccer, 17 seconds is an eternity. This noise confuses the AI, making it impossible to learn the relationship between a visual action (a slide tackle) and the corresponding language (“a fierce challenge”).

The Solution: The MatchTime Pipeline

To solve this, the authors couldn’t just manually fix thousands of hours of video. They needed an automated way to synchronize the text commentary with the video events. They developed a two-stage pipeline: Coarse Alignment followed by Fine-Grained Alignment.

Stage 1: Coarse Alignment with ASR and LLMs

The first step relies on a clever insight: the video file usually contains audio of the actual broadcast commentary. This audio is perfectly synchronized with the video.

The researchers used WhisperX, an automatic speech recognition (ASR) tool, to transcribe the background audio. However, live audio is messy—commentators stutter, scream, or go on tangents. It’s not a clean list of events.

To fix this, they utilized LLaMA-3, a Large Language Model. They fed the transcribed audio into LLaMA-3 with a prompt to summarize the narration into clear “event descriptions” with timestamps.

Simultaneously, they took the scraped textual commentary (the ones with the bad timestamps) and asked LLaMA-3 to match them with the ASR summaries based on semantic similarity. This gets the text roughly in the right neighborhood of the video timeline.

Stage 2: Fine-Grained Temporal Alignment

The ASR method gets us close, but it’s not perfect. Sometimes the audio is missing, or the commentator is talking about a player’s history rather than the action on screen. To achieve precision, the authors turned to Contrastive Learning.

They designed a model to look at the text and the video frames and determine mathematically which frame matches the text best.

The Temporal Alignment Pipeline. (a) shows the coarse alignment using WhisperX and LLaMA. (b) shows the fine-grained alignment using visual and text encoders.

As shown in panel (b) of the figure above, the model uses two encoders (based on the CLIP architecture):

Text Encoder: Converts the commentary sentence into a mathematical vector (\(C\)).
Visual Encoder: Converts video frames into mathematical vectors (\(V\)).

The goal is to find the video frame \(V_j\) that is most similar to the commentary \(C_i\). This is calculated using an affinity matrix:

Equation for the affinity matrix, calculating the dot product between textual and visual embeddings normalized by their magnitudes.

This equation essentially computes the cosine similarity between the text and every candidate video frame. The higher the value, the more likely that frame represents the text.

The model is trained using a contrastive loss function. In simple terms, this loss function punishes the model if it matches a caption to the wrong frame and rewards it for matching the correct frame (based on a small set of manually annotated data).

Equation showing the final timestamp assignment. The new timestamp is chosen based on the frame with the maximum similarity score.

Once trained, the model scans the video around the “rough” timestamp provided by Stage 1. It looks at frames from 45 seconds before to 30 seconds after. The frame with the highest similarity score determines the new, corrected timestamp.

The result of this massive data cleaning effort is a new dataset called MatchTime.

The Model: MatchVoice

With a clean, aligned dataset in hand, the researchers turned their attention to the generative task: building an AI that can watch a video clip and produce professional commentary. They call this model MatchVoice.

Architecture Breakdown

MatchVoice is a Video-Language Model (VLM). Its job is to translate pixels into sentences.

The architecture of MatchVoice. It moves from visual encoding to a temporal aggregator, then to an MLP projection, and finally to an LLM decoder.

The architecture, depicted above, consists of three main stages:

Visual Encoder: The video is fed into a pre-trained visual encoder (like CLIP or a specialized soccer encoder like Baidu). This converts the raw images into a sequence of feature vectors, representing the visual content of the frames. Importantly, these parameters are usually “frozen” (not updated during training) to save computational resources and retain pre-trained knowledge.
Temporal Aggregator: A video clip consists of many frames. Simply feeding all of them to a language model is inefficient. The researchers use a Perceiver-like aggregator. This module uses a mechanism called “Attention” to look at the stream of visual features and compress them into a fixed number of “summary” tokens that capture the most important temporal information (like the motion of a ball or a player’s run).
LLM Decoder: The compressed visual info is passed through a projection layer (a simple MLP) to translate it into the “language space” of the LLM. Finally, LLaMA-3 acts as the decoder. It takes these visual tokens as a prompt and generates the commentary text token by token.

Experiments and Results

The researchers conducted extensive experiments to validate two things:

Did the alignment pipeline actually fix the timestamps?
Does the MatchVoice model actually write better commentary?

1. Does Alignment Work?

The short answer is yes. The researchers compared the “Offset” (time difference between text and action) before and after using their pipeline.

Table showing alignment statistics. The average offset dropped significantly, and the percentage of accurate alignments within a 10s window jumped from 35% to 80%.

As shown in Table 2, the Average Absolute Offset dropped from 13.89 seconds to just 6.89 seconds.

More impressively, look at the window10 row. This measures how often the text is within 10 seconds of the action. Without alignment, only 35.32% of captions were accurate. With the MatchTime pipeline (labeled Contrastive-Align with pre-processing), this jumped to 80.73%.

2. Does Better Data Mean Better Commentary?

The researchers trained MatchVoice on both the original (messy) data and the new (aligned) MatchTime data. They compared the results against several baseline models.

Quantitative comparison table. MatchVoice trained on MatchTime outperforms all other methods across metrics like BLEU, METEOR, and CIDEr.

Table 3 reveals several key findings:

Zero-shot fails: General-purpose video models (like Video-LLaMA) perform terribly on soccer. They lack domain knowledge.
Alignment is King: Look at the jump in performance when shifting from the “Trained on original SoccerNet” block to the “Trained on our aligned MatchTime” block. Every single metric improves.
State-of-the-Art: The full MatchVoice model (especially when using Baidu visual features and LoRA tuning) achieves the highest scores across the board (e.g., a CIDEr score of 42.00 vs. 11.97 for the baseline).

Qualitative Analysis

Numbers are great, but does the commentary actually sound good? Let’s look at some examples.

Comparison of generated commentary. MatchVoice provides detailed, play-by-play narration that closely matches the Ground Truth (GT).

In Figure 5, we can see the difference:

SN-Caption (Baseline): Often produces generic or incorrect statements.
MatchVoice (Ours): Captures nuance. In example (b), it correctly identifies a corner kick leading to a header. In example (d), it identifies a player asking for medical attention.

The alignment visualization below further proves the point. The orange text represents the original timestamp (often completely missing the action), while the green text shows the corrected timestamp aligned with the specific frame of the event.

Visualizing the alignment. The green timestamps align perfectly with the relevant visual frames, such as a yellow card or a substitution.

One interesting ablation study focused on the Window Size—how much video context the model needs to see.

Table showing that a 30-second window yields the best results for commentary generation.

Table 4 shows that a 30-second window is the “Goldilocks” zone. If the window is too short (10s), the model misses context. If it’s too long (60s), the model might get confused by multiple events happening in sequence.

Conclusion and Future Implications

The “MatchTime” paper highlights a fundamental truth in Machine Learning: Data quality is often more important than model complexity.

By accepting that existing datasets were fundamentally broken due to time lags, the researchers built a robust pipeline to fix the data first. Their two-stage alignment process—using audio-based coarse alignment followed by vision-based fine alignment—created a superior training set.

The resulting model, MatchVoice, demonstrates that when AI sees the action at the exact moment the commentary describes it, it learns to narrate significantly better.

Key Takeaways:

Misalignment is pervasive: Scraping live feeds leads to massive time delays that hurt AI training.
Automated Cleanup is possible: Combining ASR, LLMs, and Contrastive Learning allows for scaling data correction without human effort.
Context Matters: A 30-second video window provides the optimal context for soccer commentary.

While the model still has limitations—it struggles to distinguish very similar actions (like a free kick vs. a corner) and doesn’t yet recognize specific player names consistently—this work paves the way for the next generation of automated sports broadcasting. In the near future, that AI commentator might just be accurate enough to keep you on the edge of your seat.

The Problem: When “Live” Isn’t Live#

The Solution: The MatchTime Pipeline#

Stage 1: Coarse Alignment with ASR and LLMs#

Stage 2: Fine-Grained Temporal Alignment#

The Model: MatchVoice#

Architecture Breakdown#

Experiments and Results#

1. Does Alignment Work?#

2. Does Better Data Mean Better Commentary?#

Qualitative Analysis#

Conclusion and Future Implications#