Imagine you are standing on a stage, translating a speech from English to Japanese in real-time. The speaker says, “I am not here to say that men are to blame…”
If you wait for the full sentence to finish before speaking, you are performing consecutive interpretation. But if you start translating immediately after “I am not here to…” while the speaker keeps talking, you are performing simultaneous interpretation (SI).
For humans, this is exhausting. For machines, it is an algorithmic nightmare—especially between languages with vastly different grammatical structures, like English and Japanese. English puts the verb early (SVO: Subject-Verb-Object), while Japanese waits until the very end (SOV: Subject-Object-Verb). To translate accurately, a machine usually wants to wait for the verb. To translate fast, it can’t afford to wait.
This trade-off between latency (delay) and quality is the central problem of Simultaneous Machine Translation (SiMT).
Today, we are diving into a fascinating paper titled “Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair.” The researchers propose a novel solution to the data scarcity problem in SiMT: using Large Language Models (LLMs) like GPT-4 to synthesize high-quality training data that mimics the strategies of professional human interpreters.
The Core Problem: Why is SiMT So Hard?
In standard “offline” Machine Translation (like typing a paragraph into Google Translate), the model sees the entire source sentence before generating a translation. It can look ahead, rearrange words, and perfect the grammar.
In SiMT, the model receives the input as a stream. It must decide when to READ more words and when to WRITE (translate) a word.
The Word Order Chasm
The challenge peaks when translating between “distant” language pairs.
- English: “I ate an apple.”
- Japanese: “Watashi wa (I) ringo wo (apple) tabeta (ate).”
If the system translates word-for-word as they come in (“I ate…”), the Japanese output might sound unnatural or require a verb that hasn’t been spoken yet. If it waits for the English sentence to finish to get the word order right, the latency becomes unacceptable for a live conversation.
The Data Bottleneck
To teach AI to handle this, we need training data: pairs of English speech and Japanese translations that follow simultaneous interpretation styles (preserving the order of information chunks).
- Human SI Corpora: Recordings of real interpreters are rare, expensive to transcribe, and often “noisy” (containing summaries, omissions, or errors).
- Offline Corpora: Standard translation datasets rearrange the word order too much, forcing the SiMT model to wait too long.
This brings us to the researchers’ solution: The LLM-SI-Corpus.
The Solution: Synthesizing Interpretation Data
The authors propose a method to convert existing, high-quality offline Speech Translation (ST) corpora into “Interpretation-style” data using LLMs.
The idea is to take a standard English sentence and ask an LLM (specifically GPT-3.5 or GPT-4) to rewrite the Japanese translation so that it follows the word order of the English source, essentially mimicking a technique called Chunk-Wise Monotonic Translation (CWMT).
The Pipeline
Let’s look at how the researchers visualized the data landscape.

As shown in Figure 1, the source material is TED Talks.
- Offline Translation: This is the standard data where the Japanese is grammatically perfect but structurally different from English (high latency).
- NAIST-SIC: This is real human interpretation data. It’s valuable but inconsistent.
- LLM-SI-Corpus (Ours): This is the new contribution. The researchers use the offline text and process it through an LLM to create a “perfect” simultaneous interpretation dataset.
The “CWMT” Strategy
The core of their prompt engineering is based on Chunk-Wise Monotonic Translation (CWMT). This is a guideline taught to human interpreters. It involves:
- Chunking: Breaking the English sentence into meaningful phrases (clauses, prepositional phrases).
- Translating: Translating each chunk locally.
- Connecting: Stitching the translated chunks together in the original English order, using connecting words (conjunctions, demonstratives) to make it flow in Japanese.
This strategy minimizes the need to look ahead (reducing latency) while keeping the content accurate.
Creating the Corpus: The Prompt
How do you get an LLM to think like a simultaneous interpreter? You have to be specific. The researchers designed a structured prompt that forces the LLM to follow the CWMT workflow.

Figure 2 details this process. The prompt acts as a “System Instruction” for the AI:
- Role: “You are a skilled simultaneous interpreter.”
- Step 1 (Chunking): Split the source text based on grammatical boundaries (like relative pronouns or prepositions).
- Step 2 (Translation): Translate each chunk.
- Step 3 (Concatenation): Connect them naturally without reordering the chunks.
The output is requested in JSON format, ensuring the researchers can extract the aligned chunks for analysis. By applying this prompt to the NAIST-SIC-Aligned-ST corpus (based on TED Talks), they generated a massive dataset of “ideal” simultaneous translations.
Visualizing the Style Difference
To understand why this matters, we need to compare the different translation styles.

Table 4 provides a concrete example.
- Source: “And (1) I’m / (2) not here to / (3) say that / (4) men are to / (5) blame…”
- OFFLINE: The translation completely flips the structure. It moves concepts from the end of the English sentence to the beginning of the Japanese sentence. An SiMT model trained on this would have to wait until it hears “blame” or “crisis” before it can start generating the beginning of the Japanese sentence.
- SI (Simultaneous Interpretation): This style follows the English order much more closely. Phrase (4) “men” and (5) “blame” appear early in the translation, just like in the source.
The LLM-SI-Corpus aims to replicate the SI style but with the consistency of a machine.
Experiments: Does it Work?
The researchers fine-tuned Speech-to-Text translation models using their new LLM-generated corpus and compared them against models trained on standard data.
Experimental Setup
- Task: Speech-to-Text Translation (English Audio -> Japanese Text).
- Policy: They used a “wait-k” policy. This is a fixed rule where the model waits for k chunks of audio before generating a token. By varying k, they can measure the trade-off between latency and quality.
- Metrics:
- Latency: Average Lagging (AL) — how far behind the speaker is the translation? (Lower is better).
- Quality: BLEU (text overlap), BLEURT, and COMET (semantic similarity).
Key Results
The results were impressive. Let’s look at the performance on the tst-COMMON dataset.

In Figure 3, we are looking at Quality vs. Latency.
- X-Axis (Latency): Lower values (left) mean the system is faster.
- Y-Axis (Quality): Higher values (up) mean the translation is better.
- The Goal: We want lines to be in the top-left corner (Fast and Accurate).
Observations:
- Low Latency Dominance: Look at the Green (GPT-3.5) and Red (GPT-4) lines. At low latency (left side of the graph, where AL < 1000ms), they consistently score higher on BLEU, BLEURT, and COMET than the Orange (Offline) baseline.
- Semantic Stability: In metrics like COMET and BLEURT (which measure meaning rather than exact word matching), the LLM corpora maintain high quality even when the system is forced to be very fast.
- Beating Human Data: The Purple (SIC) line represents the model trained on real human interpretation data. It generally performs worse. This confirms that human data, while “real,” is often too noisy (filled with errors and fillers) to be good training data for these models.
How Does it Compare to Manually Created CWMT?
The researchers also tested their model against a test set that was manually annotated by humans following strict CWMT guidelines (the “Chunk-wise” dataset).

Figure 5 shows the results on this specialized test set. Here, the LLM-SI-Corpus models (Green/Red) dominate across the board. This validates that the LLMs successfully learned the “Chunk-Wise” strategy requested in the prompt. They are producing translations that align perfectly with the structure expected of a high-quality simultaneous interpreter.
Discussion: Qualitative Analysis
Numbers are great, but what does the actual output look like? Does the LLM really preserve the word order better?
Let’s examine Table 11, which compares the outputs of different models.

- Source: “(1) I just came back from a community that / (2) holds the secret / (3) to human survival.”
- Offline/Reference: Translates in the order (3) -> (2) -> (1). In Japanese, this structure (Human survival’s secret holding community) is grammatically natural but requires hearing the end of the sentence before saying the beginning.
- GPT-3.5 / GPT-4: They translate in the order (1) -> (2) -> (3). “Just came back from a community. It holds a secret. To human survival.”
The LLM models successfully twisted the Japanese grammar to fit the English flow. This allows the SiMT system to output phrase (1) immediately, drastically reducing the lag for the listener.
GPT-3.5 vs. GPT-4
Interestingly, the researchers found that while GPT-4 generally produces higher quality and more fluent text, GPT-3.5 was sometimes stricter about adhering to the monotonic word order. GPT-4 occasionally “improved” the sentence by reordering it slightly for better flow, which technically adds latency. However, both models vastly outperformed the offline baselines in terms of the latency-quality trade-off.
Conclusion and Future Implications
This paper presents a compelling argument for using synthetic data in simultaneous translation. The LLM-SI-Corpus demonstrates that we don’t necessarily need thousands of hours of expensive human interpretation recordings to train effective SiMT systems.
Key Takeaways:
- LLMs as Data Generators: LLMs can be prompted to follow complex linguistic guidelines (like CWMT) to convert offline text into interpretation-style text.
- Better than Human Data: Synthetic data is cleaner and more consistent than noisy human interpretation transcripts, leading to better model training.
- Solving the Word Order Puzzle: The resulting corpus effectively teaches models to handle the English-Japanese word order difference, allowing for high-quality translation with significantly lower latency.
This approach is highly scalable. The cost of generating this corpus was relatively low (approx. $20 for GPT-3.5), suggesting that this method could be easily applied to other difficult language pairs (like German-English or Chinese-English) to democratize real-time translation technology.
As LLMs continue to improve, the line between “translation” and “interpretation” in AI will likely blur, bringing us closer to a universal translator that works as fast as we speak.
](https://deep-paper.org/en/paper/2404.12299/images/cover.png)