Introduction: The “Hallucination” in the Spreadsheet

We are all familiar with the superpowers of Large Language Models (LLMs). They can write poetry, summarize emails, and generate code. But we are also painfully familiar with their kryptonite: facts. When you ask an LLM a question that requires precise data analysis—like “What is the ranking of California’s wildfire severity compared to the national average?"—the model often hallucinates. It acts like a student who didn’t study for the test but is trying to bluff their way through an essay.

To fix this, the industry adopted Retrieval Augmented Generation (RAG). RAG connects the LLM to a library of text documents, allowing it to “look up” answers before writing. This works wonders for unstructured text. But what happens when the answer isn’t in a document, but hidden inside a massive SQL database? What if the answer requires math, aggregation, or filtering?

Retrieving text isn’t enough. We need Analytics Augmented Generation (AAG).

In this post, we break down a fascinating paper from Northwestern University that introduces SATYRN, a platform designed to bridge the gap between LLMs and structured data. Unlike tools that simply ask an LLM to write SQL code (which often breaks), SATYRN uses a deterministic, neurosymbolic approach to guarantee that the data feeding the AI is accurate.

The Problem with “Code Interpreters”

Before diving into SATYRN, we need to understand the status quo. The current state-of-the-art method for handling data with LLMs is the “Code Interpreter” approach (used by OpenAI and others).

In this model, you give the LLM a schema and a question, and the LLM attempts to write a Python script or SQL query to get the answer. While impressive, this is inherently non-deterministic. If the LLM hallucinates a column name, writes bad logic, or fails to handle a join correctly, the execution fails, or worse—it succeeds but returns the wrong number. The LLM is acting as both the analyst and the writer, and it is rarely qualified for the former.

SATYRN proposes a different path: separate the analysis from the generation. By using a symbolic engine to handle the data logic and an LLM to handle the prose, SATYRN achieves significantly higher accuracy without sacrificing fluency.

Core Method: The SATYRN Architecture

The SATYRN platform is built on a “neurosymbolic” architecture. This means it combines neural networks (LLMs) with symbolic processing (hard-coded logic and structured query plans).

The workflow is a four-step pipeline designed to ensure that the LLM is never guessing numbers. It only sees facts that have already been verified.

The high level approach of SATYRN and its analytics augmented generation.

As shown in Figure 1 above, the process flows from the user’s request into a strict analytical pipeline before any language generation occurs. Let’s break down the four critical components that make this work.

1. The Ring: Making Data “LLM-Ready”

Raw database schemas are often messy. Column names like col_x_99 are meaningless to an LLM, and knowing that a column is an “integer” doesn’t tell you if it represents dollars, years, or ID numbers.

SATYRN wraps the database in a semantic layer called a Ring. The Ring defines:

  • Entities: Real-world objects (e.g., “Wildfire”, “State”).
  • Attributes: Properties with semantic types (e.g., distinguishing a “Metric” like acreage from an “Identifier” like an ID).
  • Relationships: Pre-defined joins between tables.

An example of a SATYRN ring with two entities defined: State and Wildfire.

By using the Ring (Figure 2), SATYRN abstracts away the complexity of the underlying database. The system doesn’t need to figure out SQL joins at runtime; they are encoded in the Ring. This allows the system to be domain-agnostic—whether the data is about wildfires, court cases, or urban housing, the logic remains the same.

2. SQR: The Language of Analysis

If we aren’t letting the LLM write SQL, how do we query the data? The researchers developed a new intermediate language called Structured Question Representation (SQR).

SQR is a directed acyclic graph (DAG) where each node represents an atomic operation—like retrieve, filter, groupby, or average.

A SQR plan, in textual and graph forms, for determining average wildfire size for each state in 2020.

In Figure 3, you can see how a question about “average wildfire size” is broken down into logical steps. Unlike SQL, which is a declarative block of text, SQR is composable. The system can take small SQR snippets (templates) and snap them together to build complex queries without syntax errors.

3. Report Blueprints

Users rarely want just a single number; they want a report. SATYRN uses Report Blueprints to define standard analytical workflows. A blueprint acts as a recipe. For example, a “Ranking Report” blueprint knows that to rank entities, it needs to:

  1. Calculate the metric for the target entity.
  2. Count the total entities.
  3. Find the target’s rank.
  4. Identify the top 3 entities for comparison.

An instantiation of a report blueprint for ranking California against other states by average wildfire size.

Figure 4 illustrates how a user’s request (Target: California, Metric: Wildfire Size) is mapped onto a blueprint. The system doesn’t “guess” what information is relevant. The blueprint deterministically dictates exactly which SQR plans must be executed to generate a comprehensive report.

4. The SQR Composer and Analysis Engine

The magic happens in the SQR Composer. It takes the abstract requirements from the Blueprint, looks up the specific data paths in the Ring, and assembles them into executable plans.

This figure shows the process SATYRN uses to generate a set of executable SQR plans to derive the information required for a report.

Once composed (Figure 8), the Analysis Engine converts the SQR plan into the native query language of the database (like SQL) and executes it.

Crucially, the output of this engine is not a raw data table. It converts the data rows into Natural Language Statements.

For example, instead of passing a row [Oregon, 4, 280] to the LLM, SATYRN generates the text: “Oregon is ranked 4 according to median fire size.” and “The median fire size for Oregon is 280.00 acres.”

Factual statements generated across three domains for the same ranking blueprint.

As seen in Figure 10, these statements serve as the “context” for the LLM. The LLM is essentially told: “Here is a list of true facts. Rewrite them into a fluent report.” This drastically reduces the cognitive load on the model. It doesn’t have to calculate; it just has to communicate.

Experiments & Results

To test SATYRN, the researchers generated 200 reports across 8 different domains (including Healthcare, Criminal Justice, and Environmental Sustainability). They compared SATYRN against:

  1. Unaugmented GPT-4: Asking GPT-4 to write the report without data access.
  2. Code Interpreter (GPT-4): Giving GPT-4 the data files and letting it write its own code.
  3. SATYRN (Ablation): Using SATYRN but passing tables to the LLM instead of text statements.

They evaluated the reports on Accuracy (are the numbers right?), Fluency (does it read well?), and Coherence.

The Accuracy Gap

The results were striking. SATYRN significantly outperformed the baselines in factual accuracy.

The fraction of claims classified as factual, rather than confabulated or refuted.

As shown in Figure 5, SATYRN (the three bars on the far right) achieved roughly 86% accuracy, regardless of whether it used GPT-4 or the smaller, open-source Mistral-7B.

Compare this to the Code Interpreter (the second group from left), which only achieved 57% accuracy. The Code Interpreter often failed because it wrote incorrect Python pandas logic or misunderstood the data structure.

Types of Errors

It is also revealing to look at how the models failed.

The fraction of claims that were classified as factual, confabulated, or refuted.

Figure 6 breaks down the error types:

  • Confabulated (Orange): The model made things up. Unaugmented GPT-4 (far left) did this constantly, which is expected.
  • Refuted (Red): The model had the data but got the answer wrong. The Code Interpreter had a massive “Refuted” rate. It tried to calculate the answer but failed.

SATYRN (far right) minimizes both red and orange bars.

The Power of Small Models

Perhaps the most interesting finding in this paper is hidden in the ablation study. When the researchers provided data as tables (SATYRN-Table), the small model (Mistral-7B) struggled, achieving only 55% accuracy. But when the data was converted into natural language statements (the full SATYRN method), Mistral-7B’s accuracy jumped to 89%, matching GPT-4.

This suggests that expensive, massive models aren’t necessary for high-quality reporting if the data is pre-processed into clear, declarative statements.

The average number of claims made for each configuration across all domains.

Finally, Figure 7 shows that SATYRN isn’t just accurate; it’s productive. It generates a high volume of verifiable claims (the gray and blue bars on the right), whereas unaugmented models tend to be vague and produce fewer concrete claims.

Conclusion and Implications

SATYRN represents a shift in how we think about “AI for Data.” Rather than relying on the LLM to be a “jack of all trades” that writes queries, executes code, and drafts text, SATYRN advocates for specialization.

By using a symbolic engine (SQR and the Ring) to handle the rigid logic of databases, and an LLM to handle the fluidity of language, we get the best of both worlds. The system is:

  1. Deterministic: The analytics plan is generated by rules, not probability.
  2. Model-Agnostic: It works just as well with local, open-source models as it does with GPT-4.
  3. Accurate: It drastically reduces hallucinations and calculation errors compared to Code Interpreters.

For students and developers looking at the future of RAG, the lesson is clear: when it comes to structured data, don’t just retrieve—analyze. And never ask an LLM to do a database’s job.