Can Small Models Outreason GPT-4? The Power of Synthetic Logic Data

Introduction

Large Language Models (LLMs) have mastered the art of fluency. They can write poetry, summarize emails, and even generate code. However, there is one frontier where they notoriously struggle: strict logical reasoning. If you present an LLM with a complex First-Order Logic (FOL) puzzle involving multiple variables, quantifiers (“everyone,” “some”), and negations, it often hallucinates a convincing but logically unsound answer.

To bridge this gap, researchers have been training models on “synthetic data”—logic puzzles generated by algorithms. The idea is simple: if the model sees enough examples of perfect logic, it might learn the underlying rules. However, existing methods for generating this data have been somewhat rigid, relying on domain-specific algorithms that generate proofs first and then translate them into text. This limits the variety and linguistic naturalness of the training data.

In the paper “Scaling Synthetic Logical Reasoning Datasets with Context-Sensitive Declarative Grammars,” Damien Sileo presents a novel framework called Unigram. This approach flips the script on how we generate reasoning data. Instead of building a proof tree and translating it, Unigram generates natural language and logical formulas simultaneously using flexible grammars, and then uses an external theorem prover to determine the answer.

The results are startling. By training a relatively small model (DeBERTa-v3-base, which has only 184 million parameters) on this new dataset, the researchers achieved state-of-the-art accuracy on human-authored logic benchmarks, surpassing the performance of the massive GPT-4.

Background: The Struggle for Synthetic Logic

To understand why this paper is significant, we first need to look at how synthetic reasoning datasets are usually created.

The Standard: Forward Inference

Previous datasets like RuleTaker or LogicNLI rely on forward inference or proof-tree generation.

Start with Axioms: The system starts with a set of logical rules (e.g., “If X is A, then X is B”).
Generate a Proof: It algorithmically constructs a valid derivation path (a proof tree).
Translate: Finally, it translates that specific proof into English.

While this ensures the logic is correct, it introduces “sampling bias.” The generator only produces problems that fit its specific proof-generation algorithm. It often results in “an illusion of reasoning depth,” where the model learns to recognize the shape of the proof rather than understanding the logical semantics. Furthermore, generating “Neutral” examples (where the premise does not prove nor disprove the hypothesis) or “Contradictions” in this framework requires complex, unnatural workarounds.

The Goal: Natural Language Inference (NLI)

The paper focuses on the Natural Language Inference format, which consists of three parts:

Premise: A set of facts and rules (e.g., “All students are happy. Mary is a student.”).
Hypothesis: A statement to test (e.g., “Mary is happy.”).
Label: The logical relationship:
Entailment: The hypothesis must be true based on the premise.
Contradiction: The hypothesis must be false.
Neutral: The truth of the hypothesis cannot be determined from the premise.

The challenge is to generate millions of these pairs that are linguistically diverse and logically complex without manual human effort.

Core Method: The Unigram Framework

The researchers propose Unigram, a declarative generation framework. Rather than writing code to generate a proof, they write a grammar that describes how English sentences and logical formulas relate to each other.

1. Binding Two Languages

The core innovation of Unigram is that it generates two outputs at the exact same time: Natural Language (English) and Logical Code (TPTP).

In a standard Context-Free Grammar (CFG), you might have a rule that says a Sentence leads to a NounPhrase and a VerbPhrase. In Unigram, rules bind two languages. A rule defines:

Input Types: What arguments does this rule need?
Realizers: Two templates—one for English and one for the logical representation (TPTP).
Constraints: Conditions that must be met for the rule to apply.

For example, a rule might define how to express a “universal affirmative” statement:

English Realizer: “Everyone who is {0} is {1}.”
Logic Realizer: ![X]: ({0}[X] => {1}[X]) (For all X, if X is 0, then X is 1).

2. Context-Sensitive Constraints

Standard grammars are often “context-free,” meaning they don’t know what happened previously in the sentence generation. Unigram is context-sensitive. It maintains a “state” of the generation.

This allows for crucial semantic constraints, such as:

Distinctness: Preventing redundant statements like “Mary likes Mary” or “If Mary is happy then Mary is happy.”
Constraint Checking: Ensuring that the generated predicates don’t logically crash into each other (e.g., ensuring someone isn’t described as both “tall” and “short” unless that’s the intended contradiction).

3. Solver-Based Labeling

Instead of trying to generate a premise that leads to a specific hypothesis (which is hard), Unigram generates a premise and a hypothesis independently (but relatedly) and then asks: “What is the relationship between these two?”

To answer this, the system feeds the generated TPTP logical code into Vampire, a powerful external First-Order Logic theorem prover.

If Vampire proves Premise AND (NOT Hypothesis) is impossible, the label is Entailment.
If Vampire proves Premise AND Hypothesis is impossible, the label is Contradiction.
Otherwise, it is Neutral.

This approach allows the generation of much more complex and varied logical structures because the generator doesn’t need to know the answer in advance—the solver figures it out later.

4. Better Logical Modeling

The paper introduces several logical nuances that were missing from previous datasets:

Explicit Finite vs. Open Domains: In logic, proving something for “everyone in the room” is different from proving it for “everyone in the universe.”
Finite: “Mary and Paul are the only people in the room.” (Induction is possible here).
Open: “Everyone anywhere…” Unigram models this distinction, creating problems that test whether the model understands the scope of quantifiers.
Realistic Predicates: Previous datasets often used random adjectives (“The sad mouse is green”). Unigram uses semantically consistent predicates. It ensures that “Blue” and “Green” are treated as mutually exclusive (you can’t be both), whereas “Rich” and “Happy” are not. This forces the model to learn real-world logical constraints.
Constraining Material Conditionals: In formal logic, “If P then Q” (\(P \rightarrow Q\)) is technically true if P is false. This is called a material conditional. However, in natural language, saying “If unicorns exist, then the moon is made of cheese” feels weird, even if logically “valid” because unicorns don’t exist. This often confuses models. Unigram adds constraints to prevent these confusing, vacuous conditionals from appearing inside negations, making the data cleaner and more robust.

Experiments & Results

The researchers created a dataset called Unigram-FOL containing 100k generated examples. They then fine-tuned DeBERTa-v3 models (both Base and Large versions) on this data and tested them against various benchmarks.

The Benchmarks

FOLIO: A dataset of logic problems written by expert human annotators. This is the gold standard.
RuleTaker / LogicNLI / FLD: Other synthetic datasets (used for comparison).
WANLI: A difficult NLI dataset requiring diverse reasoning.

Key Findings

The results show a clear hierarchy of performance.

Comparison of auxiliary synthetic training datasets effect on the evaluation tasks.

As shown in the table above, the Unigram-FOL model (Line 5 for D-base, Line 13 for D-large) consistently outperforms models trained on previous synthetic datasets (RuleTaker, LogicNLI, FLD).

1. Beating GPT-4: Perhaps the most striking result is the performance on FOLIO.

GPT-4 (prompted with an external solver) achieves roughly 72.5% accuracy on FOLIO.
DeBERTa-Large fine-tuned on Unigram-FOL achieves 82.2%.
When Unigram-FOL is combined with another dataset (FLD), performance jumps to 88.6%.

This demonstrates that a specialized model, significantly smaller than a Large Language Model (LLM), can achieve superhuman performance on logical tasks if the training data is of high enough quality.

2. The “Realism” Ablation: The researchers performed “ablation studies” (removing features to see what matters).

Removing Realistic Predicates: When they replaced the smart predicates with random adjectives (like in LogicNLI), performance dropped significantly (from 64.4% to 62.4% on base models, and a larger drop on specific fragment tests).
Removing Conditional Constraints: Allowing “vacuous” material conditionals also hurt performance.

This proves that it’s not just the quantity of data that matters, but the logical quality—modeling real-world constraints like open domains and semantic interference is crucial.

Conclusion and Implications

The “Unigram” paper teaches us a valuable lesson about the current state of AI: Data quality reigns supreme.

By shifting from procedural proof-generation to a declarative grammar backed by a robust solver, the authors created a dataset that captures the nuances of First-Order Logic better than any previous attempt. They demonstrated that we don’t necessarily need trillion-parameter models to solve complex reasoning tasks. A 184-million parameter model, when fed a diet of high-quality, linguistically diverse, and logically sound synthetic data, can out-reason the giants of the industry.

For students and researchers, this opens up exciting avenues. It suggests that “neuro-symbolic” AI—combining the flexibility of neural networks with the rigor of symbolic logic (via solvers like Vampire)—is a powerful path forward. Future work could extend this grammar-based approach to even more complex domains, such as planning, modal logic (reasoning about possibility/necessity), or even code verification.

The code and datasets are publicly available, allowing anyone to experiment with generating their own logic worlds and training models to master them.

Introduction#

Background: The Struggle for Synthetic Logic#

The Standard: Forward Inference#

The Goal: Natural Language Inference (NLI)#

Core Method: The Unigram Framework#

1. Binding Two Languages#

2. Context-Sensitive Constraints#

3. Solver-Based Labeling#

4. Better Logical Modeling#

Experiments & Results#

The Benchmarks#

Key Findings#

Conclusion and Implications#