Introduction

In the rapidly evolving world of Natural Language Processing (NLP), we often take for granted how much data is available—for English. If you want to train a chatbot to answer questions about history, science, or pop culture in English, you have access to massive datasets like SQuAD or HotpotQA. But what happens if you want to build that same system for Swahili, Finnish, or Bengali?

The data simply isn’t there in the same volume.

This creates a significant “digital divide.” To bridge it, researchers rely on Automatic Question Generation (QG). The goal of QG is to have an AI read a paragraph of text (context) and an answer, and then generate the corresponding question. If we can do this automatically, we can synthesize massive training datasets for low-resource languages without hiring thousands of human annotators.

However, there is a catch. Most current methods rely on Cross-Lingual Transfer (XLT). This usually involves training a model on English data and then asking it to perform the task in a target language. The problem? The models often suffer from “interrogative code-switching.” They might use words from the target language, but they structure the sentence like English, or worse, they leave the crucial question words (Who, What, Where) in English. Imagine a generated question that looks like: “How long is Pyhäjärven pituus?” instead of the proper Finnish structure.

In this post, we will dive deep into a new research paper that proposes QuIST (Questions by learning Interrogative Structures in Target languages). This novel method solves the code-switching problem by teaching small language models the structure of questions in a target language using just a handful of examples, without requiring massive training datasets in that language.

The Background: Cross-Lingual Transfer and Its Flaws

Before understanding QuIST, we need to understand the baseline approach: Zero-shot Cross-lingual Transfer.

In a typical setup, researchers take a Multilingual Pretrained Language Model (mPLM), such as mBERT or mT5. These models have seen text in 100+ languages during their pre-training phase. To teach them Question Generation (QG), researchers fine-tune them on a high-quality English QA dataset.

The logic is that the model learns the concept of creating a question. When you feed it a context in Korean, it relies on its internal multilingual knowledge to generate a Korean question.

The Problem: Interrogative Code-Switching

While this sounds good in theory, in practice, the model often experiences “catastrophic forgetting” regarding the target language’s grammar. Because it was fine-tuned heavily on English QG data, it overfits to English sentence structures.

The result is Interrogative Code-Switching. The model generates a sentence that is partially in the target language but retains English interrogative particles (like “When did” or “How many”) or English word order. This renders the generated data useless for training robust QA systems in the target language.

Previous attempts to fix this involved freezing parts of the model or using “adapters,” but these solutions often required some amount of training data in the target language or complex architectural changes.

The Solution: QuIST

The researchers propose a method called QuIST that is both simple and highly effective. The core philosophy is to separate the content of the question from the structure of the question.

QuIST operates in two distinct stages:

Question Type Classification (QTC): Figure out what kind of question we need (e.g., asking about a person, a place, or a time).
Question Generation (QG) with Exemplars: Use a few standard examples (exemplars) of that question type to guide the model’s generation process.

Let’s break down the architecture.

Figure 2: Overview of our proposed method: The QG model generates questions utilizing the question exemplars corresponding to the question type determined by the QTC model.

As shown in Figure 2 above, the process works as follows:

Stage 1: Question Type Classification (QTC)

A single answer can prompt different questions depending on the context. For example, the number “1930” could be the answer to “When was the first World Cup?” (Time) or “How many items were sold?” (Quantity).

Therefore, the system first needs to understand the intent. The researchers defined eight universal question types based on English interrogatives:

When
Where
What
Which
Who
Why
How (way/manner)
How (number/quantity)

The QTC model (based on mBERT) takes the Answer and the Context as input and predicts one of these eight labels. Crucially, this classifier is trained only on English data, but because mBERT is multilingual, it works surprisingly well when inference is run on other languages.

Stage 2: Generation with Exemplars

This is where QuIST shines. Once the system knows the question type (e.g., “Where”), it retrieves a set of Question Exemplars.

These exemplars are simply a small list of generic questions in the target language that fit that type. For example, if the target language is Korean and the type is “Where,” the exemplars might look like:

Where do you find a lead out? (Translated to Korean structure)
Where is the library? (Translated to Korean structure)

The QG model (based on mT5) receives three inputs:

The Context (the paragraph).
The Answer (the specific span of text).
The Exemplars (the template questions).

The model is trained to look at the exemplars to understand the syntactic structure (how to form a “Where” question in this language) and look at the context/answer to get the semantic content.

The Training vs. Inference Trick

Here is the most important takeaway for students: The model never sees the target language during training.

Training: The model is trained using English contexts, English answers, and English exemplars. It learns the skill of using exemplars to guide its sentence structure.
Inference: When deployed for a new language (e.g., Swahili), the researchers simply swap the English exemplars for Swahili exemplars. Because the model learned the skill of following the exemplar’s structure, it automatically adapts to the Swahili sentence structure without any parameter updates.

Table 1: Data utilized by QuIST and baseline models.

Table 1 illustrates this efficiency. While baseline models often require target language data (\(Q^{tgt}\) or \(S^{tgt}\)) during training, QuIST only requires English data (\(C-Q-A^{en}\)) and English exemplars (\(Q^{en}\)) for training. The target language exemplars are only needed at the very end, during inference.

Experimental Results

The researchers evaluated QuIST across nine linguistically diverse languages, including Bengali (bn), German (de), Finnish (fi), Korean (ko), and Swahili (sw). They compared it against several strong baselines, including standard fine-tuned mT5 models and adapter-based approaches.

Automatic Evaluation

The primary metrics used were BLEU, METEOR, and ROUGE-L, which measure how closely the generated text matches human-written references.

Table 2: Automatic evaluation results for the nine target languages.

Table 2 presents the ROUGE-L scores. Here, we see distinct trends:

Baselines Struggle: The simple Baseline_EncDec (fine-tuning encoder and decoder on English) performs terribly on languages like Bengali (0.72) and Korean (2.17), likely due to severe code-switching.
QuIST Dominates: QuIST (specifically utilizing 15 exemplars, noted as QuIST15) significantly outperforms the baselines. For instance, in Finnish, QuIST scores 38.79 compared to the best baseline’s 20.26.
Competing with Giants: The table also compares QuIST (using a relatively small 1.2B parameter model) against GPT-3.5-turbo. QuIST achieves comparable, and in some cases (like Korean and Swahili), superior performance to the massive commercial LLM.

Analyzing Code-Switching

The core motivation of this paper was to stop models from using English grammar in non-English questions. Did it work?

Figure 3: Percentage of code-switched synthetic questions.

Figure 3 provides a stark visualization. The bars represent the percentage of generated questions that contain code-switching (using the wrong language’s words or grammar).

Look at B-EncDec (the first bar in each group): It is almost 100% for many languages. The model has completely forgotten how to speak the target language.
Look at QuIST (the darker blue bar): The rate of code-switching drops dramatically, often lower than 10-20%. This confirms that providing the model with “Exemplars” effectively reminds it of the target language’s interrogative structure.

Human Evaluation

Automatic metrics don’t always tell the whole story, especially with grammar. The authors employed native speakers to rate the questions on Grammar, Clarity, and Answerability.

Table 3: Human evaluation results.

Table 3 highlights the human preference. In languages like German (de) and Indonesian (id), QuIST achieves near-perfect scores for clarity and answerability.

A Note on Swahili (sw): Interestingly, while QuIST performed well on automated metrics, human evaluators rated it lower than GPT-3.5 on answerability. This suggests that while QuIST gets the grammar right (fixing the code-switching), it sometimes struggles to grab the correct content in very low-resource languages compared to massive models like GPT.

To visualize this, let’s look at a specific Swahili example.

Figure 4: Examples of synthetic questions in Swahili.

In Figure 4, we see the challenge.

The Context discusses Malawi, Zambia, and Zimbabwe.
BaselineEncDec produces a “Frankenstein” sentence mixing English (“Along with…”) and Swahili.
QuIST produces a grammatically correct Swahili sentence: “Ni nchi ipi iliyohesabiwa kuwa sehemu ya Afrika ya Kusini?” (Which country is considered part of South Africa?).
However, the human ground truth was asking specifically about the current name of Northern Rhodesia. While QuIST’s grammar was perfect, the semantic focus was slightly off, which explains the variation in human ratings.

Why This Matters: Data Augmentation

The ultimate goal of QG is often to generate synthetic data to train other AI models (Question Answering systems). If QuIST generates better questions, does that help us build better QA bots?

Table 4: Exact match scores of multilingual QA models trained on datasets synthesized using different methods

Table 4 confirms this utility. The researchers trained QA models using synthetic data generated by the different methods.

English-only: Training a QA model only on English data yields an average score of 49.86.
QuIST-augmented: Training on data generated by QuIST boosts that average to 59.65.
Notably, QuIST-generated data proved more effective for training downstream models than data generated by GPT-3.5-turbo (57.79). This is a massive win for open-source, efficient research.

Analysis of the Method

How Many Exemplars Do You Need?

One might wonder: do we need thousands of example questions? The paper explores this in detail.

Table 5: Performance of XLT-QG models using question exemplars in different ways.

Table 5 and the comparisons in Table 2 (QuIST1, QuIST5, QuIST10, QuIST15) show that performance generally improves as you add more exemplars, up to about 10 or 15.

Human vs. Machine Translation: Table 5 (Row 1 vs Row 2) shows that using human-written exemplars (QuIST) is far superior to using machine-translated exemplars. This highlights that high-quality, native syntactic structures are essential for the model to learn properly.

Static vs. Dynamic Exemplars

Should the exemplars be random every time, or fixed? The researchers found that using a static (fixed) set of exemplars for each question type worked best.

Table 7: Comparison of models using dynamic and static exemplars during training.

As seen in Table 7, the “Static” approach consistently outperforms the “Dynamic” one. This suggests that the model benefits from stability—seeing the same high-quality structural templates repeatedly during training allows it to internalize the pattern more effectively than constantly switching templates.

Prompting GPT-3.5

The researchers also explored if their method (QTC + Exemplars) could help Large Language Models like GPT-3.5.

Figure 6: The input and output template for 10-shot inference of GPT-3.5-turbo.

Using templates like the one in Figure 6, they found that even powerful LLMs benefit from this structured approach, particularly in low-resource languages.

Conclusion

The QuIST paper presents a compelling narrative for the future of multilingual AI. It moves away from the “brute force” approach of massive data collection or massive model scaling. Instead, it relies on a clever linguistic insight: languages differ in vocabulary, but interrogative structures are repetitive and learnable.

By decoupling the “what to ask” (content) from the “how to ask” (structure), QuIST allows a model trained solely on English to generate high-quality questions in languages as diverse as Finnish and Telugu.

Key Takeaways for Students:

Efficiency: You don’t always need to train on the target language to generate text in it.
Exemplars: Providing “templates” (exemplars) during inference can guide a model’s syntax, solving the code-switching problem.
Scalability: This method can be applied to any new language simply by writing 10-15 example questions, making it incredibly scalable for low-resource languages.

As AI continues to expand globally, techniques like QuIST will be essential in ensuring that technology serves speakers of all languages, not just English.

Introduction#

The Background: Cross-Lingual Transfer and Its Flaws#

The Problem: Interrogative Code-Switching#

The Solution: QuIST#

Stage 1: Question Type Classification (QTC)#

Stage 2: Generation with Exemplars#

The Training vs. Inference Trick#

Experimental Results#

Automatic Evaluation#

Analyzing Code-Switching#

Human Evaluation#

Why This Matters: Data Augmentation#

Analysis of the Method#

How Many Exemplars Do You Need?#

Static vs. Dynamic Exemplars#

Prompting GPT-3.5#

Conclusion#