Imagine you are an English learner whose native language is Japanese. You are chatting with a friend and you type: “According to the test, my shortcomings are 靴下 and ご主人様.”
To a bilingual speaker, this sentence makes perfect sense. It’s a classic example of Code-Switching (CSW)—the fluid alternation between two or more languages in a single conversation. It is a sign of linguistic competence, not confusion. However, if you feed that sentence into a standard Grammatical Error Correction (GEC) system (like the ones powering your favorite writing assistants), it will likely fail. It might flag the Japanese characters as “spelling errors,” try to delete them, or hallucinate English words in their place.
For millions of multilingual speakers, this is a daily frustration. Standard NLP tools demand monolingual purity, while human communication is increasingly hybrid.
In this post, we are diving deep into a fascinating research paper that tackles this exact problem. The researchers propose a novel pipeline that uses Large Language Models (LLMs) to generate synthetic training data, allowing them to train a GEC model that respects code-switching while fixing English grammar.
If you are a student of NLP, this paper is a masterclass in how to solve the “low-resource” data problem using modern generative AI.
The Problem: Why GEC Stumbles on Code-Switching
Grammatical Error Correction (GEC) has made massive strides in recent years. Modern systems can fix complex syntax errors, subject-verb agreement, and punctuation with high accuracy. However, these systems are almost exclusively trained on monolingual data.
When a monolingual model encounters a sentence like “When we call ダッシュボード, do we actually mean a glove compartment?”, it faces two main issues:
- Representation Failure: Sequence-to-Sequence (Seq2Seq) models (like T5) often treat foreign tokens as noise. They might try to “translate” the Japanese into English or simply omit it, destroying the user’s intent.
- Ambiguity: Edit-based models (like GECToR) might recognize the tokens but struggle with the “switching points”—the boundaries where one language ends and another begins. The syntax often shifts at these boundaries, confusing the model about which grammatical rules apply.
The biggest hurdle to fixing this? Data Scarcity. There are massive datasets for English GEC, but datasets containing grammatically corrected code-switched text are virtually non-existent. Without data, you cannot train the model.
The Solution: Synthetic Data Generation
Since the researchers couldn’t find a large enough dataset of code-switched errors and corrections, they decided to build one. This is the core contribution of the paper: a robust methodology for Synthetic Data Generation.
They approached this in two steps:
- Step 1: Generate grammatically correct Code-Switched text.
- Step 2: Inject grammatical errors into that text to create “source” (wrong) and “target” (correct) pairs.
Step 1: Generating Code-Switched Text
The authors explored three different methods to generate mixed-language sentences. Understanding why the first two failed is just as important as understanding why the third succeeded.
Attempt 1: Translation-Based
The team took English sentences, parsed them into syntactic trees, and then randomly selected specific sub-trees (phrases) to translate into another language using Machine Translation. While this worked technically, it relied heavily on the quality of the parser and the translator. It often resulted in unnatural switches that didn’t reflect how humans actually speak.
Attempt 2: Parallel Corpus-Based
They used the Europarl corpus (specifically English paired with Spanish, French, and German). By aligning words between the languages, they swapped out equivalent phrases. Again, this was limited by the alignment algorithms and didn’t capture the nuanced “style” of learner language.
Attempt 3: LLM Prompting (The Winner)
This is where the paper leverages the power of modern LLMs. The researchers realized that code-switching isn’t just random word swapping; it has specific pragmatic functions (like quoting, emphasizing, or bridging a lexical gap).
They used OpenAI’s GPT-3.5 with a “few-shot” prompting strategy. They fed the LLM examples of authentic code-switched sentences from a small existing dataset (Lang-8) and asked it to generate new sentences following similar switching patterns but with different topics.
This approach was superior because LLMs inherently possess “world knowledge” and a grasp of pragmatics. If the prompt showed a sentence where a Japanese word was used because there was no English equivalent, the LLM could generate a new sentence mimicking that specific reason for switching.
Validating the Data
How do we know the LLM data was actually good? The researchers compared their synthetic datasets against genuine code-switched data using several linguistic metrics:
- CMI (Code Mixing Index): Measures the level of mixing.
- Burstiness: Measures whether switches happen in clusters or are spread out.
- I-Index (Probability of Switching): How likely a switch is to occur.

As shown in Table 1, the “LLM CSW” dataset (second column) aligns much more closely with the “Genuine CSW” data (first column) across almost every metric compared to the Translation or Corpus-based methods. For example, the I-Index (switching probability) for the LLM data is 0.21, identical to the genuine data. The Translation method overshot this at 0.30, creating unnatural text.
Step 2: Injecting Errors
Now that they had correct code-switched sentences, they needed to “break” them to create training data for the GEC model. They used two strategies:
- Rule-based Injection (PIE): They randomly inserted specific types of errors that are common among English as a Second Language (ESL) learners. These included errors with nouns, pronouns, punctuation, and determiners.
- Backtranslation: They took the correct sentences, translated them to another language, and then translated them back to English. This process naturally introduces the kind of phrasing errors and awkwardness found in learner writing.
The result was the creation of Syn-CSW PIE and Syn-CSW Rev-GECToR, two massive synthetic datasets ready for training.
The Model: GECToR Architecture
For the actual correction model, the researchers chose GECToR (Grammatical Error Correction: Tag, Not Rewrite).
Why GECToR?
Unlike Seq2Seq models (like ChatGPT or T5) which read a sentence and rewrite it entirely, GECToR works by tagging each token. It looks at a word and decides: Keep, Delete, Append, or Replace.
This is crucial for code-switching. A Seq2Seq model often sees a Japanese or Spanish word and gets confused, potentially omitting it in the output. GECToR is less likely to accidentally delete the non-English text because its default action is to “Keep” a token unless it is explicitly wrong.
The researchers modified GECToR by adding a specific class to the error detection head to handle CSW tokens, essentially teaching the model to recognize foreign script as a valid part of the sentence structure rather than noise.
The Training Curriculum
Training a neural network is a bit like teaching a student. You don’t start with the hardest exam questions. You start with general knowledge and slowly specialize. The authors used a 3-Stage Training Schedule to fine-tune their model.
Stage 1: Pre-training
The model was first exposed to a massive dataset (the “1 Billion Word” corpus) mixed with their synthetic CSW data. This stage builds a robust foundation of English grammar.
Stage 2: Fine-tuning
In this stage, the model was trained on a mixture of high-quality GEC datasets (like NUCLE and FCE) along with the synthetic CSW datasets.

As seen in Table 4, the Lang-8 corpus dominates this stage (80.54%), but the Syn-CSW PIE data (5.73%) ensures the model begins to see code-switching patterns mixed in with standard learner errors.
Stage 3: Refinement
This is the final polish. The model is trained on the highest quality data available (W&I Locness) and a specific sampling of genuine and synthetic CSW data.

Table 5 shows a shift in strategy. Here, the W&I Locness dataset makes up the majority (67.23%), but the contribution of synthetic CSW data is significant (nearly 28% combined). This forces the model to specialize in the target task immediately before testing.
Experimental Results
So, did it work? The researchers compared their model against two baselines:
- T5-Small: A standard Seq2Seq model.
- Standard GECToR: A state-of-the-art monolingual model.
They evaluated the models using Precision, Recall, and F0.5 Score (a metric that weighs precision higher than recall, which is standard in GEC because you don’t want to “correct” things that aren’t actually wrong).

Note: In the table above, the column “Genuine CSW” represents performance on the specific code-switched test set.
The results in Table 2 are striking:
- T5-Small failed spectacularly on the CSW data, achieving an F0.5 score of only 13.09. It simply couldn’t handle the mixed languages.
- Standard GECToR did better (53.67), but still struggled.
- The Proposed Model (Stage 3) achieved an F0.5 score of 55.02 on the CSW data.
But the real magic happened during Inference Tweaking. By adjusting the confidence thresholds (how sure the model needs to be before making a correction), the researchers boosted the F0.5 score on Code-Switched text to 63.71.
This represents a massive improvement over the baseline T5 model and a statistically significant improvement over the standard GECToR system.
The “Trade-Off”
You might notice in the table that the model’s performance on the standard “BEA-2019 Test” (pure English) dipped slightly compared to the standard GECToR (66.59 vs 71.22).
This highlights a classic “Specialist vs. Generalist” trade-off. By optimizing the model to be inclusive of code-switching, it became slightly less aggressive or accurate on purely monolingual text. However, for the target audience—ESL learners who code-switch—this is a worthy trade. The model is no longer “breaking” their sentences.
Visualizing the Improvements
The improvement isn’t just in the numbers; it’s in the types of errors corrected.
- Baseline Model: Often flagged CSW tokens as
NOUNerrors orSPELLINGerrors. - New Model: correctly identified
PUNCTUATION,VERB TENSE, andDETERMINERerrors around the code-switched text without disturbing the foreign words.

Note: While the specific table description is omitted in the image deck, the data in the final provided image corresponds to the error analysis in the paper.
Looking at the error breakdown (Table 6 in the paper contexts), the proposed model achieved a recall of 46.43% on Pronoun errors in CSW text—a notoriously difficult category because pronouns often drop or switch in multilingual speech.
Conclusion and Implications
This research papers offers a blueprint for making NLP more inclusive. By using LLMs to hallucinate realistic training data, the authors overcame the data scarcity bottleneck that has plagued Code-Switching research for years.
Here are the key takeaways for students:
- LLMs as Data Generators: When you don’t have enough data, you might be able to generate it. Few-shot prompting with GPT-3.5 proved more effective than complex translation pipelines.
- Architecture Matters: For tasks involving mixed languages, tagging architectures (like GECToR) often behave more predictably than generation architectures (like Seq2Seq).
- Curriculum Learning: A multi-stage training schedule allows a model to learn general rules before adapting to specific, difficult distributions.
Most importantly, this work validates the linguistic identity of ESL learners. Instead of forcing students to strip their native language from their writing to satisfy an algorithm, this model meets them where they are. It corrects their English grammar without penalizing their multilingualism. As we move toward more globalized AI, this kind of sensitivity to “real-world” language use will be essential.
](https://deep-paper.org/en/paper/2410.10349/images/cover.png)