Imagine you are a simultaneous interpreter at the United Nations. A diplomat is speaking in English, and you are interpreting into Japanese. The English speaker starts a sentence: “The crucial agreement that we signed yesterday…”

In English, the verb (signed) comes early. But in Japanese, the grammatical structure is Subject-Object-Verb (SOV). A natural Japanese translation might require you to wait until the very end of the English sentence to know what happened to the agreement. But you can’t wait. You have to speak now. If you wait too long, the speaker will be three sentences ahead, and you will lose track.

This is the fundamental challenge of Simultaneous Speech Translation (SiST): the trade-off between Latency (how far behind the translation is) and Quality (how accurate and natural the translation is).

In a recent paper titled “Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model,” researchers from the Nara Institute of Science and Technology propose a novel solution. They utilize Large Language Models (LLMs) to create a new dataset that mimics the strategies of human interpreters, specifically a method known as the “Salami Technique.”

In this post, we will explore how they re-engineered translation data to teach AI models to translate faster without losing the plot.

The Problem: The Word Order Bottleneck

To train AI models for translation, we typically use “offline” translation corpora. These are datasets containing sentence pairs (e.g., an English sentence and its perfect Japanese translation). These translations are usually created by humans who have the luxury of reading the entire source sentence before writing the translation.

While this results in high-quality, natural text, it is terrible for simultaneous translation.

In offline translation, the translator often flips the word order entirely to suit the target language’s grammar. If an AI model trains on this data, it learns to wait for the end of the sentence to figure out the word order. In a live setting, that waiting creates unacceptable delays.

The researchers illustrate this problem clearly in the image below:

Figure 1: An example of an English-Japanese parallel sentence. The top shows standard MuST-C translation with crossing arrows indicating reordering. The bottom shows Simul-MuST-C preserving word order.

Look at the top half of Figure 1. The arrows crisscross chaotically. This represents the standard MuST-C dataset. To translate the first English phrase (“Some individual services”), the model has to jump to the end of the Japanese sentence.

Now look at the bottom half (Simul-MuST-C). The arrows are parallel. The Japanese translation follows the English word order almost linearly. This concept is called Monotonicity. The more monotonic the translation, the easier it is to translate in real-time.

The Inspiration: The Salami Technique

Human interpreters don’t translate sentences; they translate units of meaning. When a sentence is complex, they chop it up into small, manageable segments—like slicing a salami. They translate each slice immediately, connecting them as they go.

This Salami Technique allows interpreters to maintain the word order of the source language (English) while producing understandable output in the target language (Japanese, German, or Chinese), even if the grammar isn’t “textbook perfect.”

The researchers asked a crucial question: Can we use LLMs to rewrite existing datasets using the Salami Technique, and use that data to train better SiST models?

The Core Method: Building Simul-MuST-C

Collecting real data from professional simultaneous interpreters is incredibly expensive and difficult. Instead of hiring humans, the researchers turned to GPT-4o.

They took the MuST-C v2.0 dataset (a popular multilingual speech translation corpus based on TED Talks) and processed it through an LLM. Their goal was to transform the “offline” translations into “simultaneous-style” translations.

The 3-Step Prompting Strategy

The researchers designed a specific prompt template to force the LLM to think like an interpreter. As shown in Figure 2, the process involves three distinct steps:

  1. Segmentation: The LLM breaks the long English source sentence into short, meaningful chunks (the “salami slices”).
  2. Translation: It translates each chunk individually into the target language.
  3. Combination: It combines the translated chunks into a single, linear sentence.

Figure 2: The prompt template and its example for constructing the Simul-MuST-C. It shows the task definition, instructions, and input/output structure.

By explicitly instructing the model to use the Salami Technique, the researchers generated a new dataset called Simul-MuST-C for three language pairs:

  • English-to-Japanese (En-Ja)
  • English-to-Chinese (En-Zh)
  • English-to-German (En-De)

These languages were chosen because they have varying degrees of grammatical difference from English. Japanese is very different (SOV vs SVO), Chinese is somewhat similar but has reordering issues, and German is structurally closer to English.

Did It Work? Analyzing Monotonicity

Before training any translation models, the researchers needed to verify if the new dataset was actually more monotonic (linear) than the original.

They measured this using a word alignment score where a higher number means the word order matches the source more closely.

Table 2: Comparison of word order monotonicity. Simul-MuST-C shows higher scores across all language pairs compared to MuST-C.

As Table 2 shows, the improvement is striking, particularly for English-to-Japanese (En-Ja).

  • MuST-C (Original): 0.572 monotonicity score.
  • Simul-MuST-C (New): 0.815 monotonicity score.

This confirms that GPT-4o successfully rewrote the Japanese translations to follow the English word order. The improvement for Chinese (En-Zh) and German (En-De) was smaller, largely because those languages naturally align better with English to begin with.

Here is a concrete example of what this looks like in text:

Table 3: Text examples showing word order monotonicity. In En-Ja, the new dataset keeps “60 to 80 percent” at the end, matching the English source.

In Table 3, look at the English-Japanese example.

  • Source: “…at the 60 to 80 percent level.” (appears at the end of the sentence).
  • MuST-C: Moves this phrase to the beginning of the Japanese sentence (Label 4).
  • Simul-MuST-C: Keeps this phrase at the end of the Japanese sentence, matching the English flow.

Experimental Setup

To test if this new data actually helps AI models, the researchers trained Speech-to-Text translation models using two different datasets:

  1. Baseline: Trained on the original MuST-C (standard translations).
  2. Proposed: Trained on Simul-MuST-C (Salami-style translations).

They evaluated the models using a Wait-k policy. This is a common strategy in simultaneous translation where the model waits for k words (e.g., 3, 5, or 7 words) before generating a translation. Small k means low latency (fast), while large k usually means better quality.

The Metrics

  • Quality: Measured using BLEU (text overlap) and COMET/COMET-QE (semantic similarity).
  • Latency: Measured using Average Token Delay (ATD)—essentially, how long the user waits for the translation.

The Results

The experiments revealed that the effectiveness of the Simul-MuST-C dataset depends heavily on the grammatical distance between the languages.

1. English-to-Japanese (The Big Winner)

Because English and Japanese have such different word orders, forcing the model to learn from “monotonic” data had a massive impact.

Figure 3: Results for En-Ja. The Simul-MuST-C model (red lines) consistently shows lower latency and better reference-free quality (COMET-QE) than the baseline.

Figure 3 presents the results for En-Ja. Let’s break down the COMET-QE_ATD graph (bottom right).

  • The Red Line (Simul-MuST-C) is to the left of the Green Line (MuST-C). This means for the same quality, the new model has significantly lower latency.
  • The gap widens as k increases. This indicates the Simul-MuST-C model finishes translating much faster.

Interestingly, when looking at BLEU scores (top left), the baseline seems competitive. However, BLEU compares the output to a reference translation. Since the standard reference translations are “offline” (heavily reordered), the baseline model gets a higher score simply for mimicking that reordered style. When using COMET-QE (which doesn’t rely on reference translations and looks at meaning), the Simul-MuST-C model is superior.

2. English-to-Chinese (Moderate Improvement)

For Chinese, the results were positive but less dramatic than Japanese.

Table 5: Example of generated sentences in En-Zh. Simul-MuST-C maintains the position of the word “program” relative to the source.

Table 5 shows a qualitative comparison. In the MuST-C output, the word for “program” is pushed to the end of the sentence. In the Simul-MuST-C output, it appears early, matching the English “There is a program…”. This alignment reduces the cognitive load on the model, allowing it to generate translations sooner.

3. English-to-German (Minimal Change)

For English-to-German, the difference was negligible. Since German word order is already quite similar to English, the original dataset was already highly monotonic (over 0.92 score). Applying the Salami technique didn’t change the structure enough to make a major difference in model performance.

Discussion: The Quality-Latency Trade-off

This research highlights a fascinating nuance in AI translation: Naturalness vs. Speed.

Ideally, a translation should be perfectly natural (native-sounding) and instantaneous. But in simultaneous interpretation, you often have to sacrifice a bit of grammatical perfection to keep up with the speaker.

The Simul-MuST-C dataset teaches models to make that same sacrifice. By training on data that follows the source word order, the model learns to produce translations that might feel slightly “interpreted” (segmented) rather than “written” (polished), but they arrive much faster and are semantically accurate.

This is particularly vital for grammatically distant pairs like English-Japanese. Without this technique, the model is forced to hallucinate or wait silently until the sentence ends—both of which are failures in a live conversation.

Conclusion

The researchers successfully demonstrated that we don’t need expensive human-curated datasets to train simultaneous translation systems. By using LLMs like GPT-4o to simulate the “Salami Technique,” we can generate vast amounts of training data that align the target language with the source.

Key Takeaways:

  1. Monotonicity Matters: Aligning word order between source and target reduces latency.
  2. LLMs as Data Generators: GPT-4o effectively simulated human interpreting strategies to create high-quality training data.
  3. Distance Dictates Impact: The method is a game-changer for grammatically distant languages (En-Ja) but less critical for similar ones (En-De).

This work paves the way for real-time translation systems that are not just accurate, but truly simultaneous, bridging communication gaps faster than ever before.