Introduction
Imagine you are asking a Large Language Model (LLM) to summarize a complex financial report into a neat, easy-to-read table. The model churns out a grid of numbers and headers. At a glance, it looks perfect. The columns align, the formatting is crisp, and the headers look professional.
But is it actually good?
In the world of Natural Language Processing (NLP), we have become very good at generating text. However, generating structured data—like tables—from unstructured text is a different beast. More importantly, evaluating whether that generated table is accurate is a notoriously difficult problem.
If an AI generates a table where the column “Revenue” and “Profit” are swapped, standard evaluation metrics might not catch the severity of the error. Conversely, if the AI generates a perfect table but uses “Q1 Earnings” as a header instead of “First Quarter Revenue,” traditional metrics might punish it for not matching the reference exactly, even though the meaning is identical.
This brings us to a fascinating paper titled “Is This a Bad Table?”, created by researchers at Adobe Research. They argue that we have been judging tables all wrong. They propose a new method, TABEVAL, which stops looking at tables as grids of text and starts looking at them as collections of facts.
In this post, we will break down why evaluating tables is so hard, how TABEVAL uses “Table Unrolling” to solve it, and what this means for the future of document generation.
The Problem with Existing Metrics
To understand the innovation of this paper, we first need to look at how we currently grade AI-generated tables.
Most Text-to-Table systems are evaluated using metrics like Exact Match or BERTScore.
- Exact Match is rigid. It checks if the cell values in the generated table are identical to the reference table.
- BERTScore is more flexible; it uses embeddings to see if words are semantically similar.
However, these metrics have a fatal flaw: they often evaluate cells or rows in isolation, or they rely too heavily on the structural layout matching the reference.
Consider a “Reference Table” about an election. Now imagine the AI generates a “Predicted Table.”
- The Good but Penalized Table: The AI generates the correct data but changes the column order or uses synonyms for headers (e.g., “Votes” vs. “Total Ballots”). Traditional metrics often mark this as a failure because the structure doesn’t align with the reference.
- The Bad but Rewarded Table: The AI copies the exact words from the reference but puts them in the wrong cells (e.g., swapping two candidates’ vote counts). Because the content is similar, metrics like BERTScore might give this a high rating, failing to realize that the factual relationship (who got how many votes) is now wrong.
The researchers highlight that we need a metric that captures semantics (meaning) rather than just syntax (structure).
The Solution: TABEVAL
The core contribution of this paper is TABEVAL, a new evaluation strategy. The intuition is simple but powerful: A table is just a compact way of presenting a list of facts. To judge a table, we shouldn’t compare grids; we should compare the list of facts contained within them.
The process works in two main stages, as illustrated below.

As shown in Figure 1, the pipeline moves from the raw tables (left) to a list of statements (middle), and finally to a score (right). Let’s break down these stages.
Stage 1: Table Unrolling (TabUnroll)
You cannot easily compare two tables if they have different layouts. To solve this, the authors propose TabUnroll. This is a prompting strategy that uses an LLM (specifically via Chain-of-Thought prompting) to “unroll” a table into a series of atomic natural language statements.
How does the LLM know how to do this? The authors guide the model with a specific schema:
- Identify Headers and Rows: The model first parses the structure.
- Find Primary Keys: The model looks for the “anchor” of the row—the value that uniquely identifies it (e.g., a “Candidate Name” or a “Year”).
- Construct Atomic Statements: The model combines the Primary Key with other column values to create simple sentences.
For example, looking at the table in Figure 1:
- Instead of just seeing the cell “448,143” under “Democratic,” the model generates the statement: “Brad Henry received 448,143 votes.”
- It creates another statement: “Brad Henry is a candidate from the Democratic Party.”
This transforms a structural object (a table) into a semantic object (a list of assertions). This creates a level playing field where the layout of the columns no longer matters—only the information does.
Stage 2: Entailment-Based Scoring
Once both the Reference Table (the ground truth) and the Predicted Table (the AI generation) have been unrolled into lists of statements, the problem shifts. We now need to compare List A to List B.
The researchers use Natural Language Inference (NLI) to measure Entailment. Entailment asks: Does the truth of Statement A guarantee the truth of Statement B?
They calculate three key metrics:
1. Precision (Correctness)
Precision measures how much of the information in the Predicted table is actually true according to the Reference table. If the AI hallucinates a number that isn’t in the reference, Precision drops.
The formula is defined as:

In plain English: For every statement in the generated table (\(p_i\)), we find the statement in the reference table (\(g_j\)) that best supports it. We average these scores.
2. Recall (Completeness)
Recall measures how much of the Reference information was successfully captured by the AI. If the AI leaves out a row or a column, Recall drops.

Here, we do the reverse: For every fact in the ground truth, we check if the generated table contains a statement that covers it.
3. F1 Score (Overall Quality)
Finally, they compute the F1 score, which is the harmonic mean of Precision and Recall, providing a single summary score for the table’s quality.
A New Standard: The DescToTTo Dataset
To validate their new metric, the researchers needed a good dataset. Existing datasets for Text-to-Table generation had limitations:
- WikiBio: Too simple (mostly key-value pairs).
- Rotowire: Too domain-specific (strictly sports statistics).
The authors introduced DESCTOTTO, a curated dataset of 1,250 diverse Wikipedia tables paired with text descriptions.

As seen in Table 1 above, DescToTTo (left column) covers a “Wikipedia” domain similar to WikiTableText but features significantly longer text descriptions (Avg. text length 155.94) and supports complex multi-row/column structures, making it a much more realistic benchmark for modern document generation tasks.
Experiments and Results
The researchers compared TABEVAL against standard metrics (Exact Match, Chrf, and BERTScore) across four datasets. They also gathered human ratings to see which metric actually aligned with how real people judge tables.
The results were illuminating.
Visualizing the Difference
The most compelling evidence comes from looking at examples where standard metrics failed but TABEVAL succeeded.

Let’s look closely at Figure 2:
- Table A (Top Right): This table contains the correct information but uses different column headers compared to the reference.
- BERTScore (BS) gives it a low F1 of 37.7 because the headers don’t match the reference strings.
- TABEVAL gives it a high F1 of 99.5. The unrolling process realized that the facts were identical, regardless of the header names.
- Table B (Bottom Left): This table has significant errors (hallucinated numbers).
- BERTScore gives it a dangerously high F1 of 100, likely because the surface tokens look very similar to the reference.
- TABEVAL correctly identifies the factual errors, dropping the F1 score to 81.5.
This visual confirmation proves that TABEVAL is “reading” the table like a human would, focusing on content accuracy rather than surface presentation.
Correlation with Human Judgment
To prove this statistically, the authors calculated the correlation between the automatic metrics and human ratings. A higher correlation means the metric is a better proxy for human quality assessment.

Table 2 shows the Pearson correlations. The rows labeled O-C (Ours with Claude) and O-G (Ours with GPT-4) consistently outperform the baselines (BS, Chrf, E) in almost every category.
- Correctness (Corct): Look at the DescToTTo column. Standard metrics like BERTScore (BS) have a correlation of only 0.21 to 0.27 (for L-IFT models). TABEVAL jumps up to 0.39, a significant improvement.
- Completeness (Compl): The trend continues, with TABEVAL showing stronger alignment with human perceptions of whether the table is “complete.”
Model Performance Comparison
Finally, the paper used these metrics to compare different AI models (GPT-4, GPT-3.5, LLaMa-2) on the task of table generation.

Table 3 highlights a major discrepancy. If you look at the DescToTTo dataset using standard metrics (BS), GPT-4 scores a 41.78, significantly lower than the supervised L-IFT model (63.01).
A user looking at this might think, “Wow, LLaMa-2 is much better than GPT-4 at tables!”
But look at the O-G (TABEVAL) scores. GPT-4 scores 68.92 while L-IFT scores 55.91. TABEVAL reveals that GPT-4 is actually generating semantically superior tables, but because it doesn’t blindly copy the training data’s structure (which supervised models like L-IFT do), standard metrics were unfairly penalizing it.
Conclusion and Implications
The research presented in “Is This a Bad Table?” offers a critical correction to how we build and evaluate generative AI.
The key takeaway is that structure does not equal semantics. As we rely more on LLMs to automate document creation, we cannot rely on metrics that simply check if the right words are in the right grid cells. We need metrics that verify if the information conveyed is true.
TABEVAL provides a robust path forward by:
- Deconstructing tables into atomic facts (Unrolling).
- Verifying facts using logical entailment (NLI).
- Rewarding variation in structure as long as the data remains accurate.
This approach ensures that good tables aren’t penalized for being creative with headers, and bad tables aren’t rewarded just for looking pretty. For students and researchers entering the field of NLP, this paper serves as a reminder: always question your evaluation metrics. The goal of AI isn’t just to generate tokens; it’s to generate meaning.
](https://deep-paper.org/en/paper/2406.14829/images/cover.png)