Introduction: The Illusion of Intelligence

Large Language Models (LLMs) like GPT-4 and Gemini have captivated the world with their ability to write code, compose poetry, and pass standardized tests. When you chat with these models, their fluency can easily be mistaken for deep understanding. They seem to reason, argue, and deduce. But are they actually performing logical reasoning, or are they simply excellent pattern matchers mimicking the structure of an argument?

This distinction is critical. True human intelligence involves multi-step logical reasoning—the ability to take a set of premises, apply inference rules, and chain them together to reach a novel conclusion. If you know that All men are mortal and Socrates is a man, you don’t need to have memorized the sentence “Socrates is mortal” to know it’s true. You derive it.

While LLMs have shown promise, existing benchmarks often let them off easy. They test simple, single-step logic or focus on very narrow types of reasoning. To truly stress-test these models, we need a tougher exam.

Enter Multi-LogiEval, a new benchmark proposed by researchers at Arizona State University. This paper introduces a comprehensive dataset designed to evaluate LLMs on multi-step reasoning across three distinct types of logic. The results are eye-opening: while models shine at the surface level, their “intelligence” often crumbles as the reasoning chain gets deeper.

The Gap in Current Benchmarks

Before diving into the solution, we must understand the problem with current evaluation methods. Most existing logical reasoning datasets suffer from two main limitations:

  1. Simplicity: They often focus on single-step reasoning (e.g., “A implies B, A is true, therefore B”).
  2. Limited Scope: They usually stick to one type of logic, often ignoring the messy, non-monotonic reasoning used in real life.

As shown in the comparison table below, previous datasets like LogicNLI or ProntoQA miss critical components. Some lack multi-step capabilities, while others ignore Non-Monotonic (NM) logic entirely.

Table 1: Comparison of Multi-LogiEval with existing datasets and benchmarks

Multi-LogiEval fills this gap by covering three logic types—Propositional Logic (PL), First-Order Logic (FOL), and Non-Monotonic (NM) reasoning—and specifically testing how models handle increasing depths of reasoning, from 1 to 5 steps.

The Core Method: Building a Logic Torture Test

The researchers did not simply scrape the internet for logic puzzles. They built a synthetic, rigorous dataset from the ground up using a two-stage process: generating rule combinations and then translating those into natural language.

1. The Logic Types

To be comprehensive, the dataset covers three domains:

  • Propositional Logic (PL): This deals with propositions (statements that are true or false) and connectives like “and,” “or,” and “if… then.”
  • First-Order Logic (FOL): This adds complexity with quantifiers (like “for all” or “there exists”) and predicates.
  • Non-Monotonic (NM) Reasoning: This is closer to how humans think. It deals with defaults and exceptions. For example, “Birds fly” is generally true, but if you learn the bird is a penguin, you retract that conclusion.

The researchers utilized over 30 inference rules. You can see the foundational rules for PL and FOL below. These are the mathematical building blocks of the dataset.

Table 2: Inference rules that establish the relationship between premises and their corrsponding conclusions.

2. Chaining Rules for Multi-Step Reasoning

The “Multi” in Multi-LogiEval stands for multi-step. The researchers created chains where the conclusion of one inference rule becomes the premise for the next.

Imagine a chain of dominoes. The first rule might deduce logical statement \(Q\) from \(P\). The next rule takes \(Q\), combines it with a new premise \(R\), to deduce \(S\). This continues up to five levels deep (Depth-5).

Figure 2: Process for combining multiple logical inference rules for PL and FOL

As illustrated above, this chaining process is rigorous. The system ensures that a conclusion at depth \(D\) logically entails the context provided. If the model gets the final answer right, it theoretically must have traversed the entire logical chain correctly.

3. From Symbols to Stories

The logic formulas (like \(((p \to q) \land p) \vdash q\)) are perfect for computers but unnatural for LLMs trained on text. The researchers used a “teacher” model (Claude-2) to translate these symbolic chains into natural language stories.

They constructed elaborate prompts that defined the logical rules and asked the model to wrap them in a coherent narrative involving real-world concepts (like “studying for exams” or “weather conditions”) rather than abstract variables like \(X\) and \(Y\).

Figure 3: Data generation prompt for PL and FOL

The prompt structure, shown above, ensures diversity and formatting consistency. The result is a dataset of “Context” and “Question” pairs. The Context contains the story (the premises), and the Question asks about a logical conclusion derived from that story.

Here is what the final data looks like across the different logic types:

Table 4: NL examples of different rule combinations for allthree logic types.

Notice how the Non-Monotonic example (bottom row) deals with “usually” and exceptions (Jim vs. Pam getting free lunch), representing a more nuanced type of reasoning than the strict mathematical logic of PL and FOL.

Designing the Experiments

With the dataset generated and manually validated (removing roughly 14% of samples that had logical errors), the researchers put today’s top models to the test.

The Task: A binary classification problem. Given the context and the question, does the conclusion logically follow? The model must answer “Yes” or “No.”

The Prompting Strategy: They used Zero-shot Chain-of-Thought (CoT). This means they didn’t give the models examples of how to solve the specific problem (Zero-shot), but they did ask the models to “think step-by-step” before answering. This is crucial because we want to measure the model’s innate reasoning ability based on its pre-training, not its ability to copy a pattern from a few examples.

The Contenders:

  • Proprietary Models: GPT-4, ChatGPT, Gemini-Pro.
  • Open Source Models: Yi-34B, Orca-2-13B, Mistral-7B.

Results: The Depth Trap

The results expose a fundamental weakness in current LLMs: Logical endurance.

When the reasoning is shallow (Depth-1), most models perform admirably. However, as the chain of logic grows longer, performance degrades—sometimes catastrophically.

The Performance Cliff

The graph below is the most telling visualization in the paper. Look at the downward slope for almost every model.

Figure 1: Performance (avg. accuracy across each depth for PL & FOL) of various LLMs on Multi-LogiEval.

  • GPT-4 (Blue Diamonds): It is the most robust, starting near 98% accuracy at Depth-1. However, even GPT-4 dips significantly as complexity increases, hovering around 65-70% at deeper levels.
  • Orca-2-13B (Purple Stars): This model illustrates the struggle of smaller, open-source models. It starts strong but crashes to near 10% accuracy at Depth-5—which is far worse than random guessing.
  • The “Depth Effect”: The steep decline proves that models struggle to maintain a coherent “thread” of truth over multiple steps. An error in step 2 propagates to step 3, 4, and 5, compounding the failure.

The Numerical Breakdown

For a granular look at the accuracy, we can examine the specific numbers across logic types.

Table 6: Evaluation of LLMs in terms of accuracy on Multi-LogiEval.

A few key takeaways from this table:

  1. Classical Logic is Hard: In First-Order Logic (FOL), open-source models like Orca and Yi-34B drop to single-digit or low double-digit accuracy at Depth-5.
  2. Random Baseline: Since this is a Yes/No task, a random guesser would achieve roughly 50% accuracy (depending on class balance). The researchers calculated a weighted random baseline: Equation for random baseline The calculated random baseline for Depth-5 is roughly 83.33% (due to the distribution of Yes/No answers). As seen in Table 6, every single model underperformed the random baseline at Depth-5 on average. This suggests that for highly complex logic, current LLMs might be “hallucinating” logic rather than reasoning.

The Non-Monotonic Surprise

You might notice in Table 6 that for Non-Monotonic (NM) logic, performance actually increases or stays stable as depth increases for some models. This seems counter-intuitive.

The researchers explain that constructing deep NM chains is difficult. To achieve Depth-5 in NM, they combined one Non-Monotonic rule with several standard Propositional Logic rules. As the depth increased, the ratio of “standard” logic to “fuzzy” NM logic shifted. The addition of standard logical rules helped ground the models, improving their performance compared to the pure, confusing ambiguity of shallow NM problems.

Qualitative Analysis: Why Do They Fail?

The researchers didn’t just look at the scores; they analyzed the “reasoning chains” the models generated.

  1. Mapping Failures: At Depth-1, models often failed to map the natural language back to the logical rule. For example, failing to realize that “John is not at home” satisfies the \(\neg P\) condition.
  2. Context Length vs. Information: Surprisingly, models performed slightly better at Depth-3 than Depth-2 in some cases. The researchers hypothesize that the slightly longer context provided more “connective tissue” for the models to latch onto.
  3. The Verbosity Trap: ChatGPT (as opposed to GPT-4) tended to generate much longer reasoning chains at Depth-5. However, length did not correlate with accuracy. The model would often talk itself into a circle or lose the plot entirely, highlighting that verbosity is not the same as logic.
  4. Open Source Size: Interestingly, the smaller Mistral-7B often outperformed the larger Orca-13B and Yi-34B at higher depths. This suggests that model architecture and training quality (specifically reasoning-focused training) matter more than raw parameter count when it comes to logic.

Conclusion and Implications

Multi-LogiEval serves as a reality check for the AI industry. While we celebrate LLMs for their linguistic prowess, their logical core remains brittle. The sharp drop in performance as reasoning depth increases indicates that these models mimic reasoning steps rather than executing a robust logical algorithm.

For students and researchers, this paper highlights crucial future directions:

  • Neuro-Symbolic AI: The failure of pure LLMs suggests we might need to combine neural networks with traditional symbolic solvers (like Prover9) to handle heavy logic, rather than relying on the LLM to do it all in-context.
  • Better Training Data: We need more datasets like Multi-LogiEval included in pre-training to teach models how to reason, not just what the answer looks like.

As we move toward AGI, benchmarks like Multi-LogiEval will be essential. They remind us that true intelligence isn’t just about knowing the answer—it’s about the validity of the path you took to get there.