Introduction

In the era of Big Data, we are drowning in tables. From financial quarterly reports to sports statistics and election results, tabular data is the backbone of structured information. However, raw tables are often dense and difficult for humans to parse quickly. This is where Long-Form Table Question Answering (LFTQA) comes in.

Imagine asking an AI, “How did the voting demographics change in North Carolina between the 2008 and 2012 elections based on this spreadsheet?” The AI shouldn’t just give you a number; it should write a coherent paragraph analyzing the shift, citing specific rows, and synthesizing the data into a narrative.

Recent advancements in Large Language Models (LLMs) like GPT-4 and Llama-3 have made them exceptionally good at this task. They can read a table and generate fluent, human-like responses. But there is a hidden problem lurking beneath this progress: How do we know if the AI is actually right?

For years, researchers have relied on automated metrics like BLEU and ROUGE to grade AI text. These metrics work by comparing the AI’s output to a “correct” reference answer, counting word overlaps. But what happens when the AI gives a correct answer using completely different words? Or worse, what if the AI writes a beautiful, flowing paragraph that is mathematically wrong?

This blog post explores a critical research paper, “Revisiting Automated Evaluation for Long-form Table Question Answering,” which exposes the deep flaws in our current evaluation systems. The researchers introduce a new meta-evaluation dataset, LFTQA-Eval, and demonstrate that the tools we’ve trusted for years are essentially useless for reasoning tasks, prompting a shift toward LLM-based evaluation methods.

Background: The Challenge of Long-Form Table QA

Before diving into the solution, we must understand the complexity of the task. LFTQA is not simple retrieval. It requires reasoning.

Standard Question Answering might ask, “What is the capital of France?” The answer is a single entity: “Paris.” LFTQA, however, requires the model to perform multiple mental hops:

  1. Scanning: Locating relevant rows and columns.
  2. Aggregation: Summing numbers, comparing dates, or finding averages.
  3. Synthesis: weaving those facts into a paragraph.

Figure 1: An example of the Long-form Table Question Answering (LFTQA) task investigated in our work.

As shown in Figure 1, looking at election data requires the model to understand parties, percentages, and names, and then generate a sentence like “David Price won with a significant margin of 74.4%.”

The Evaluation Gap

In traditional Natural Language Processing (NLP), we use metrics that measure n-gram overlap.

  • BLEU: Checks if the machine used the same 4-word phrases as the human.
  • ROUGE: Checks if the machine recalled the same words as the human.

These work well for translation (English to French) where there are only so many ways to say a sentence correctly. But in LFTQA, an AI can be factually correct but lexically different.

  • Reference: “The revenue increased by 50%.”
  • AI: “Sales figures jumped by half.”

A standard metric sees “revenue” vs “sales” and “50%” vs “half” and gives a low score, even though the logic is perfect. Conversely, an AI might say “The revenue increased by 10%,” which looks very similar to the reference textually (high score) but is factually a hallucination.

The researchers set out to prove that this gap exists and to quantify just how unreliable these metrics are.

The Core Method: Constructing LFTQA-Eval

To prove that automated metrics are failing, the researchers couldn’t just guess. They needed a “Gold Standard”—a set of data where human experts meticulously graded the AI’s performance. They called this benchmark LFTQA-Eval.

The construction of this benchmark was a multi-step process involving data collection, LLM generation, and rigorous human annotation.

1. Data Sources

The team utilized two existing high-quality datasets for table reasoning:

  • FeTaQA: Focuses on free-form answers based on Wikipedia tables.
  • QTSumm: Focuses on query-focused summarization, requiring deeper reasoning and longer answers.

Table 1: Basic statistics of the FETAQA and QTSUMM test sets used in our experiments.

Table 1 highlights the differences between these datasets. Note that QTSumm requires significantly longer answers (avg 67.8 words) compared to FeTaQA (18.9 words), making it a harder test bed for long-form coherence.

2. Generating AI Responses

To get a representative sample of modern capabilities, the researchers didn’t just test one model. They collected outputs from eight distinct LLMs, ranging from open-source models to proprietary giants:

  • Open Source: Llama-2 & 3, Mistral, DeepSeek, Qwen.
  • Proprietary: GPT-3.5 and GPT-4.

They randomly sampled 150 examples from the development sets, resulting in nearly 3,000 unique responses to evaluate.

3. The Human Annotation (The Ground Truth)

This is the most critical part of the study. Automated metrics are only “good” if they agree with human judgment. Therefore, the researchers hired human annotators to grade every single AI response on two specific criteria:

  1. Faithfulness: Does the answer contain only true information found in the table? Does it avoid hallucinating numbers or facts?
  2. Comprehensiveness: Does the answer include all the relevant information asked for by the question?

Fluency (grammar/spelling) was excluded because modern LLMs rarely make grammatical errors anymore; the real problem is whether they are lying or missing data.

Methodology: Measuring Correlation

Once they had the Human Scores (the truth) and the Automated Scores (from BLEU, ROUGE, etc.), the next step was to see if they matched.

If a metric is good, it should give a high score to an answer that humans rated 5/5, and a low score to an answer humans rated 1/5. This relationship is measured using Pearson Correlation.

The formula used for instance-level correlation is:

()\nr _ { \\mathrm { i n s } } ( H , M ) = \\frac { \\sum _ { i } \\mathcal { C } ( H _ { i } , M _ { i } ) } { n } ,\n()

Here, \(H\) represents the vector of Human scores, and \(M\) represents the Metric scores. A correlation of 1.0 means the metric is perfect. A correlation of 0.0 means the metric is random noise.

Experiments & Results: The Failure of Traditional Metrics

The results of the study were stark. The researchers tested a wide variety of metrics:

  • n-gram metrics: BLEU, ROUGE, METEOR.
  • Embedding metrics: BERTScore (uses semantic vector similarity).
  • Fact-checking metrics: TAPAS-Acc (specialized for tables).
  • LLM-based metrics: G-Eval (using GPT-4 to grade the answer).

The correlation results are displayed in Table 2, and they are quite shocking for anyone relying on traditional NLP evaluations.

Table 2: Results of instance-level Pearson correlations between automatic metrics and human judgments on FETAQA and QTSUMM datasets.

Key Takeaways from the Results:

  1. Traditional Metrics are Broken: Look at BERT-Score and TAPAS-Acc. Their correlations are consistently below 0.1, and sometimes near zero (e.g., 0.008 for Faithfulness on FeTaQA). This effectively means these metrics are statistically irrelevant for judging table reasoning.
  2. BLEU and ROUGE are Mediocre: While slightly better than BERT-Score, scoring around 0.2 to 0.4, they are still weak. You cannot reliably tell if an LFTQA system is working based on a ROUGE score.
  3. G-Eval is the Winner (but not perfect): The rows at the bottom show G-Eval using GPT-4 (“G-Eval40”). It achieves the highest correlations, reaching up to 0.66. This confirms that using a smart LLM to grade another LLM is currently our best option, vastly outperforming mathematical word-counting.

Why Do The Metrics Fail? A Case Study Analysis

To understand why the numbers in Table 2 are so low, the researchers conducted a qualitative analysis. They looked at specific examples where the metrics gave a bad score to a good answer (or vice versa). They identified three main culprits.

1. The Ambiguity of Questions

Sometimes, the problem isn’t the metric or the model, but the question itself. If a question is vague, the “Gold Standard” reference answer might interpret it one way, while the AI interprets it another way.

Table 3: Case studies on evaluation errors due to the effects of questions.

As seen in Table 3, a question like “Who were the top three scorers?” might be interpreted as asking for a list of names, or a total combined score. If the AI provides names but the reference provides a sum, metrics like ROUGE will punish the AI heavily, even if the user would have been happy with the names.

2. The Flaw in Ground Truth (Reference Answers)

We often assume the human-written reference answer in a dataset is perfect. The study reveals this is false. Reference answers often contain “fluff”—extra details not requested by the user.

Table 4: Case studies on evaluation errors due to the effects of ground truth answers.

In Table 4, we see a clear example of this.

  • The Question: Asks specifically for the track with the lowest and highest BPM.
  • The Generated Answer: Correctly identifies “Rhythm & Police” and “Mission: Impossible Theme.”
  • The Ground Truth: Mentions the BPM values (175 and 195) but fails to name the tracks.

In this case, the AI actually did a better job than the human reference. However, because the AI’s answer didn’t match the text of the flawed reference, automated metrics would give it a failing grade. This highlights a critical limitation of reference-based evaluation: if your reference is bad, your evaluation is meaningless.

3. Verbosity vs. Conciseness

LLMs are often tuned to be concise and direct. Human reference answers in these datasets tend to be narrative and verbose.

Table 5: Case studies on evaluation errors due to the effects of generated answers.

Table 5 illustrates this mismatch.

  • Ground Truth: “Between the years 1980 to 1985 altogether, Agderfly added three airplane models…” (Long, flowery sentences).
  • AI Answer: “The quantity… is 3 and their build years are…” (Direct, parallel structure).

The AI provides the exact same facts but uses far fewer words. Metrics like ROUGE recall penalize the AI for “missing” the extra words present in the human narrative, even though the informational content is identical.

The Solution: LLM-Based Evaluation (G-Eval)

Given these failures, the paper advocates for using LLMs as judges. This technique, often called G-Eval, involves prompting a powerful model (like GPT-4) to act as the evaluator.

Instead of counting matching words, G-Eval is given a rubric. It reads the table, the question, and the answer, and then reasons about whether the answer is correct.

Figure 3: G-Eval for Evaluating the Comprehensiveness of the LLM generated answer.

Figure 3 shows the exact prompt used to evaluate Comprehensiveness. Notice how it instructs the model to:

  1. Review the table and question to understand the scope.
  2. Analyze the answer for missing information.
  3. Assign a rating from 1 to 5.

This “Chain of Thought” approach allows the evaluator to understand that “Sales jumped by half” is the same as “Revenue increased by 50%,” solving the synonym problem that plagues BLEU and ROUGE.

Conclusion & Implications

The research presented in “Revisiting Automated Evaluation for Long-form Table Question Answering” serves as a wake-up call for the NLP community.

The Key Takeaways:

  1. Stop trusting BLEU/ROUGE for reasoning tasks. If you are building a system to analyze data tables, these metrics may mislead you into thinking your model is performing poorly when it is actually doing well (or vice versa).
  2. Dataset quality matters. Evaluation is only as good as the ground truth. If reference answers are bloated or inaccurate, reference-based metrics are fundamentally flawed.
  3. The future is Model-based Evaluation. While not perfect, using GPT-4 or similar models to grade outputs aligns much closer to human judgment.

As we move toward AI agents that perform complex data analysis for us, ensuring they are faithful to the source data is paramount. This paper provides the roadmap for how we should—and shouldn’t—grade their homework. By moving away from surface-level text matching and toward semantic evaluation, we can build more reliable and trustworthy data assistants.