We live in an age of data abundance. From relational databases to Wikipedia infoboxes, structured tables hold a massive amount of the world’s knowledge. For English speakers, accessing this data is becoming increasingly intuitive thanks to Large Language Models (LLMs) that can perform Table Question Answering (TableQA). You can ask, “Which year had the highest revenue?” and the model retrieves the answer from a financial table.
But what if you speak Bengali or Hindi?
Despite having hundreds of millions of speakers, these are considered “low-resource” languages in the NLP world because they lack the massive, annotated datasets required to train sophisticated neural models. Simply translating English datasets doesn’t work well for tables, which have rigid structures and cultural contexts that get lost in translation.
In this post, we will dive into a recent research paper, “Table Question Answering for Low-resourced Indic Languages,” which proposes a novel, budget-friendly pipeline to solve this problem. We will explore how the researchers generated massive datasets for Bengali and Hindi without expensive manual annotation and how they trained models that outperform general-purpose LLMs like GPT-3.5 on these specific tasks.
The Core Problem: Why TableQA is Hard
TableQA is fundamentally different from standard text-based Question Answering (QA). In text QA, the answer is usually a span of text hidden within a paragraph. In TableQA, the model must:
- Understand Structure: It needs to know which cell belongs to which row and column header.
- Perform Reasoning: Questions often require aggregation (sums, counts, averages), comparison (greater than, less than), or logic (filtering rows).
For low-resource languages, two major hurdles exist:
- Data Scarcity: There are no large-scale TableQA datasets for languages like Bengali.
- Cultural Disconnect: Translating an English dataset about “US Presidents” into Bengali doesn’t help a model learn about “West Bengal Road Networks.” The cultural entities don’t match the language.
The researchers addressed this by building a fully automatic pipeline to generate high-quality training data, creating the BanglaTabQA and HindiTabQA datasets.
The Solution: A Scalable Data Generation Pipeline
The heart of this research is a three-step methodology designed to create synthetic yet high-quality training data. The goal was to take raw tables from the web and turn them into pairs of Natural Language Questions and Answer Tables.
Step 1: Extracting Native Tables
Instead of translating English tables, the researchers scraped tables directly from the Bengali and Hindi Wikipedia dumps. This ensures that the content—names, places, and events—is culturally relevant and linguistically natural.
Step 2: The SQL Bridge
How do you teach a machine to ask questions about a table? The researchers used SQL (Structured Query Language) as an intermediate step.
They used templates from the SQUALL dataset to generate SQL queries. However, they introduced a clever twist: Code-Mixed SQL. They kept the SQL keywords (SELECT, WHERE, COUNT) in English but filled the table names, columns, and values with the native Bengali or Hindi terms extracted in Step 1.
For example, a query might look like:
SELECT count(district) FROM table WHERE road_section = 'Shimlapal...'
Step 3: From Code to Conversation
The final step is converting that rigid SQL query into a fluid Natural Language Question (NQ). To do this, they trained a sequence-to-sequence model (called SQL2NQ) to translate the code-mixed queries into natural Bengali or Hindi questions.
The entire workflow is visualized below:

Figure 1: The automated pipeline. Notice how the system starts with a Wikipedia table (left), generates a code-mixed SQL query (center), and transforms it into a natural Bengali question while simultaneously executing the SQL to find the correct answer (right).
Quality Control: Filtering the Noise
Automated generation often produces noisy or nonsensical data. To ensure quality, the researchers implemented a strict quality control (QC) mechanism.
They used a sentence similarity model (LaBSE) to compare the generated natural language question against the original SQL query. If the semantic meaning didn’t match closely enough, the sample was discarded.
Determining the “cutoff” point for quality was crucial. The researchers plotted the similarity scores of positive pairs (matching SQL and Question) against “hard negatives” (unrelated pairs).

Figure 2: The distribution of similarity scores. The blue area represents valid SQL-Question pairs, which peak near 1.0. The orange area represents negative pairs. The researchers selected a threshold of 0.74 (where the orange tail ends) to ensure only high-quality data entered the training set.
Analyzing the Dataset
The resulting dataset, BanglaTabQA, is massive, containing over 19,000 tables and 2 million training samples. But is it complex enough to teach a model reasoning?
To answer this, we can look at the complexity of the SQL queries generated. The number of keywords in a SQL query is a good proxy for difficulty (e.g., a simple lookup has fewer keywords than a filtered aggregation).

Figure 3: Query complexity distribution. Most queries contain 3 to 5 SQL keywords, providing a healthy mix of simple and moderately complex reasoning tasks.
Furthermore, the dataset isn’t limited to simple “fetch this value” tasks. It covers a wide range of operations essential for TableQA, including arithmetic, sorting, and filtering.

Figure 4: Distribution of operation types. “Filtering” (e.g., finding rows that match a condition) and “Arithmetic” (e.g., counting or summing values) make up a significant portion of the dataset, ensuring models learn true computational reasoning.
Mathematical Formulation
Before looking at the results, it is helpful to understand how the model actually “reads” a table. Neural networks handle sequences of text, not grids. Therefore, the table must be linearized.
The input to the model is a concatenation of the Question (\(Q\)) and the Table (\(T\)). The table is flattened row by row, using special tokens to mark column headers and row breaks.
The input sequence looks like this:

Here, \(q\) represents the question tokens, \(h\) represents headers, and \(t\) represents table cells.
The model is then trained to generate the Answer Table (\(T_{out}\)), which is also a linearized sequence containing the resulting headers and values:

This formulation allows standard Encoder-Decoder models (like BART or T5) to process structured tabular data as if it were a translation task.
Experiments and Results
The researchers trained several models on their new datasets, specifically fine-tuning multilingual models like mBART-50 and M2M100. They compared these against strong baselines, including GPT-3.5, GPT-4, and OdiaGenAI (a Llama-based model tuned for Bengali).
Key Findings
- Specialized Models Win: The models trained on BanglaTabQA (denoted as BnTQA-mBart) significantly outperformed GPT-3.5 and open-source Llama baselines.
- On Table Exact Match, BnTQA-mBart achieved 35.88% on the test set.
- GPT-4 achieved 26.83%.
- GPT-3.5 achieved only 6.04%.
This highlights that while GPT-4 is powerful, a smaller model fine-tuned on high-quality, language-specific data can outperform generic LLMs in specialized tasks.
The Importance of Numeric Understanding: One major failure mode for baseline models was the inability to handle Bengali numerals. The specific fine-tuning allowed the new models to perform arithmetic operations using the native script effectively.
Cross-Lingual Transfer: Perhaps the most exciting result is the Zero-shot Cross-lingual Transfer. The researchers took the model trained only on Bengali (BnTQA) and tested it on the Hindi dataset (HindiTabQA).
- Without seeing a single Hindi training example, the Bengali model could answer questions and reason about Hindi tables.
- After simple post-processing (translating the output script from Bengali to Hindi), the model achieved remarkable accuracy, proving that the model learned the logic of TableQA, not just the text patterns.
Validation Scores
The validation scores for the intermediate SQL-to-Question models were also analyzed. The Hindi model showed higher ROUGE scores (a metric for text overlap) than the Bengali model. This suggests that the Bengali questions were perhaps more linguistically diverse or less strictly aligned with the SQL keywords, making the generation task slightly harder but potentially more natural.

Table 1: Validation scores for the SQL-to-Question generation models. Higher scores indicate the model is successfully learning to translate structured queries into natural language.
Conclusion and Implications
This research papers offers a blueprint for democratization in AI. By moving away from translation-based approaches and embracing automated, structure-aware generation pipelines, the authors demonstrated that we can build state-of-the-art systems for low-resource languages on a limited budget.
The release of BanglaTabQA and HindiTabQA fills a critical gap, providing the first large-scale benchmarks for TableQA in Indic languages. Moreover, the methodology is generalizable. The same pipeline—scraping local tables, generating code-mixed SQL, and filtering with semantic similarity—can be applied to Swahili, Vietnamese, or any other language with a web presence.
For students and researchers, this highlights an important lesson: data quality and relevance often matter more than model size. A 600-million parameter model trained on native data can beat a trillion-parameter generic model when the cultural and linguistic context is preserved.
](https://deep-paper.org/en/paper/2410.03576/images/cover.png)