Introduction: The Linguistic Divide in AI

The current landscape of Artificial Intelligence is experiencing a massive linguistic disparity. While Large Language Models (LLMs) like GPT-4 and Claude have revolutionized how we interact with technology, their prowess is heavily skewed toward high-resource languages—primarily English.

For the 237 million native speakers of Bangla—the fifth most spoken language in the world—this gap is palpable. While proprietary giants like GPT-4 perform reasonably well, they are closed systems. Meanwhile, open-source attempts to build “Bangla LLMs” have largely struggled, often failing to outperform even the base models they were built upon.

Why has it been so hard to build a good open-source Bangla LLM? The answer usually lies in the data.

In this deep dive, we will explore a new research paper titled “TigerLLM - A Family of Bangla Large Language Models.” The researchers take a different approach to solving this problem. Instead of throwing massive amounts of low-quality data at a model, they focus on precision, culture, and educational content. They introduce two new models (1B and 9B parameters) that set a new state-of-the-art for Bangla, proving that in the world of LLMs, quality often beats quantity.

The Problem: Garbage In, Garbage Out

To understand why TigerLLM is significant, we first need to look at the existing landscape of Bangla NLP. Previous initiatives often relied on a method called model distillation using translated datasets.

Here is the typical (and flawed) workflow for many low-resource language models:

  1. Take a high-quality English instruction dataset (like Alpaca or OpenOrca).
  2. Run it through Google Translate to convert it to Bangla.
  3. Fine-tune a model on this translated text.

The result? Models that speak “Translated Bangla”—awkward phrasing, loss of cultural nuance, and grammatical errors. As shown in the comparative analysis below, many existing initiatives like titu-Gemma or Bangla-LLaMA rely on these translated datasets and often lack reproducibility.

Table 1: Comparative analysis of Bangla LLM initiatives and their methodological approaches.

As you can see in Table 1, TigerLLM distinguishes itself by avoiding translated datasets entirely. Instead, it relies on two novel, native resources: the Bangla-TextBook corpus and the Bangla-Instruct dataset.

Ingredient 1: The Bangla-TextBook Corpus

The researchers argue that for a model to truly understand a language, it needs to understand the educational foundation of that language’s speakers.

Most LLMs scrape the web (Common Crawl), which is noisy and full of errors. The TigerLLM team took a different route. They curated the Bangla-TextBook corpus, consisting of 10 million tokens derived exclusively from open-source educational materials published by the National Curriculum and Textbook Board of Bangladesh.

This corpus covers Grades 6 through 12, spanning literature, science, history, and more. While 10 million tokens is small compared to the trillions used to train GPT-4, the density of information and grammatical correctness in textbooks is incredibly high. This allows the model to learn formal, correct Bangla rather than internet slang or broken translations.

Ingredient 2: The Bangla-Instruct Pipeline

The second, and perhaps most innovative component, is the Bangla-Instruct dataset. This is a collection of 100,000 instruction-response pairs designed to teach the model how to follow commands and answer questions.

Rather than translating English instructions, the authors designed a sophisticated pipeline involving human volunteers and two state-of-the-art teacher models: Claude-3.5-Sonnet and GPT-4o.

The Seed-Generate-Filter Cycle

The process begins with humans. Fifty university students from Bangladesh created 500 “seed tasks” covering diverse topics like cultural heritage, mathematics, and local social issues.

As illustrated in Figure 1 below, the pipeline uses these seeds to generate new, synthetic data that retains the quality of the original human inputs.

Figure 1: The Bangla-Instruct generation pipeline.

  1. Seed Pool: The process starts with the 500 human-written tasks.
  2. Instruction Generation: Claude-3.5 is prompted to look at the seeds and generate new instructions that follow similar linguistic patterns.
  3. Task Identification: GPT-4o analyzes the new instruction to determine what type of task it is (e.g., open-ended generation, classification).
  4. Response Drafting: Claude-3.5 writes a detailed response to the new instruction.
  5. Filtering: This is the crucial quality control step.

The Four-Pillar Filter

Automated data generation can spiral out of control if unchecked. To prevent this, the researchers implemented a strict filtering equation using GPT-4o as a judge.

The acceptance of a generated pair \((i, r)\) is determined by the following logic:

Equation for filtering criteria based on Language, Culture, Quality, and Novelty.

The function \(\mathcal{F}\) accepts a data pair only if it satisfies all four conditions:

  • \(\mathcal{L}\) (Language): Is the grammar correct? Is the Bengali Word Ratio > 95%?
  • \(\mathcal{C}\) (Culture): Is it culturally sensitive? Does it avoid religious or political bias?
  • \(\mathcal{Q}\) (Quality): Is the response coherent and factually accurate?
  • \(\mathcal{N}\) (Novelty): Is this task sufficiently different from what we already have?

Approximately 63% of generated pairs pass this filter, ensuring that the final 100,000 dataset is clean, diverse, and natively Bangla.

The Architecture: Training TigerLLM

With the data prepared, the researchers moved to the training phase. They chose two strong base models to build upon: LLaMA 3.2 (1B) and Gemma 2 (9B).

The training process was divided into two distinct stages: Continual Pretraining and Finetuning.

Figure 2: Evolution of TigerLLM showing the pretraining and finetuning stages.

Stage 1: Continual Pretraining

In this stage, the base models (which already understand general language concepts) are immersed in the Bangla-TextBook corpus. The goal is to adapt the models’ internal weights to the specific nuances of the Bangla language.

This was done on a cluster of NVIDIA A100 GPUs. As shown in the loss curve below, the model rapidly learned the patterns of the textbook data, with the loss stabilizing near zero, indicating successful knowledge absorption.

Figure 3: Continual Pretraining - Loss per Steps.

Stage 2: Instruction Finetuning

Once the models “learned” the textbook content, they needed to learn how to behave as assistants. This is where the Bangla-Instruct dataset comes in. The researchers used full fine-tuning (rather than parameter-efficient methods like LoRA) to maximize performance.

The training dynamics here were stable, showing a sharp drop in loss initially as the model learned to format its outputs as instructions, followed by a steady convergence.

Figure 4: Finetuning - Loss per Steps.

Under the Hood: Hyperparameters

For students interested in reproducing these results, the choice of hyperparameters is critical. The researchers provided transparent documentation of their settings. For the 1B model, they used a learning rate of 1e-5 and a batch size of 16.

Table 4: Final set of hyperparameters for finetuning TigerLLM (1B).

For the larger 9B model, adjustments were made, including a lower learning rate of 1e-6 to ensure stability during the fine-tuning of the larger parameter set.

Table 5: Final set of hyperparameters for finetuning TigerLLM (9B).

Experiments & Results: David vs. Goliath

The true test of TigerLLM is how it performs against other models on standardized benchmarks. The researchers tested on five key Bangla benchmarks, including MMLU-bn (Multi-task Language Understanding) and BanglaQuaD (Question Answering).

The results, presented in Table 2, are striking.

Table 2: Performance comparison of TigerLLM with other models on various Bangla-specific benchmarks.

Key Takeaways from the Results:

  1. Tiny Giant: The TigerLLM (1B) model is exceptionally efficient. Despite having only 1 billion parameters, it achieves a score of 0.61 on MMLU-bn. Compare this to the base LLaMA 3.2 model, which scored only 0.22, or the “Bangla-LLaMA” initiative, which scored 0.02.
  2. Beating Proprietary Models: The TigerLLM (9B) model scores 0.72 on MMLU-bn, outperforming GPT-3.5 (0.55) and Gemini-Flash 1.5 (0.66). This is a significant achievement for an open-source model running on significantly less hardware.
  3. The Failure of Translation: Models like Titu-LLM and Bangla-LLaMA, which used translated datasets, often performed worse than the base models they started with. This validates the authors’ hypothesis that bad data damages model capabilities.
  4. Consistency: TigerLLM wins across the board—whether it’s reasoning, coding, or general knowledge.

Conclusion

The development of TigerLLM offers a crucial lesson for the AI community, particularly for those working in low-resource settings. It demonstrates that you do not need trillions of tokens or massive proprietary clusters to build state-of-the-art models.

By prioritizing high-quality, domain-specific data (like textbooks) and using a rigorous, culturally aware pipeline for instruction generation, the researchers created a model that punches well above its weight class.

TigerLLM not only provides a powerful tool for Bangla speakers today but also establishes a reproducible blueprint for other under-represented languages. With the release of the models, the dataset, and the corpus, the “AI divide” has become just a little bit smaller.