Large Language Models (LLMs) have become exceptionally good at generating procedural text. If you ask a state-of-the-art model for a recipe to bake a cake, it will likely produce a perfectly coherent list of steps: mix the dry ingredients, beat the eggs, combine them, and bake at a specific temperature. On the surface, it looks like the model understands the process.

But there is a significant difference between memorizing a sequence of words and understanding the causal logic that binds those steps together. Does the model know why you must mix the flour before baking? Does it understand that you can chop the nuts while the oven preheats, but you can’t frost the cake before it cools?

This distinction—between mimicking structure and understanding causality—is the focus of a recent paper titled “CAT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans.” The researchers introduce a new benchmark designed to stress-test an LLM’s ability to reason about time and cause within natural language plans. The results reveal that while LLMs can write recipes, they often struggle to understand the fundamental dependencies within them.

In this deep dive, we will explore the architecture of CAT-BENCH, analyze the surprising failures of top-tier models like GPT-4 and Gemini, and discuss why standard prompting techniques like Chain-of-Thought might not work the way we expect them to in planning tasks.

The Problem: Generation vs. Understanding

Planning is a central component of decision-making in Artificial Intelligence. Traditionally, AI planning has been studied in rigid, simulated environments like “Blocksworld” (where an agent stacks blocks) or using formal languages like PDDL (Planning Domain Definition Language). While these environments allow for perfect logical checks, they don’t reflect the messiness of the real world.

Real-world plans are often expressed in natural language—instruction manuals, medical guidelines, and cooking recipes. An agent operating in the real world needs to understand the “preconditions” (what must be true before an action) and “effects” (what becomes true after an action).

The researchers posit that if an LLM truly understands a plan, it should be able to identify the temporal dependencies between steps.

Dependent Steps: Step A must happen before Step B (e.g., you must peel the banana before eating it).
Independent Steps (Parallel): Step A and Step B can happen in any order without affecting the outcome (e.g., chopping onions and chopping garlic).

Current LLMs are often “biased” towards linearity. Because they are trained to predict the next token, they tend to assume that just because Step 1 is written before Step 2, Step 1 causes Step 2. CAT-BENCH was created to expose this heuristic.

Introducing CAT-BENCH

To rigorously test this capability, the authors built CAT-BENCH (Causal and Temporal Benchmark). They utilized the Recipe Flow Graph Corpus, a dataset of 300 English cooking recipes that have been annotated with directed acyclic graphs (DAGs). In these graphs, nodes represent steps, and edges represent dependencies.

If a path exists from Step \(i\) to Step \(j\) in the graph, they are dependent. If no path exists, they are independent (non-dependent).

Using this data, the researchers generated 4,260 questions spanning 57 unique plans. The questions are binary (Yes/No) and fall into two categories:

DEP (Dependent): Questions about steps that must occur in a specific order.

Example: “Must Step 6 happen before Step 8?”
Reasoning: Testing knowledge of preconditions and effects.

NONDEP (Non-Dependent): Questions about steps that are independent.

Example: “Must Step 7 happen after Step 6?”
Reasoning: Testing knowledge of step independence and parallel execution.

This diagram illustrates how CAT-Bench analyzes recipe steps for dependency relationships. At the top, under “Almond Flour Chocolate Cake,” three consecutive steps (Step 6-8) are shown with arrows indicating they form dependent steps. Below these, Step 12 (Whip cream till stiff peaks) appears without such dependencies.

As shown in Figure 1, the benchmark isolates specific pairs of steps. In the “Almond Flour Chocolate Cake” example, the model is asked if Step 6 must happen before Step 8. The correct answer requires understanding that ingredients must be in the bowl (Step 6) before they can be stirred (Step 8). Conversely, adding almonds and adding milk might be done in parallel, making them non-dependent.

The benchmark enables two specific evaluation tasks:

Step Order Prediction: A binary classification task (Yes/No).
Step Order Explanation: The model must explain why a dependency exists (or doesn’t).

Experimental Setup

The researchers tested a wide variety of models, ranging from open-source options like Llama3-8B to proprietary giants like GPT-4-Turbo, GPT-4o, Claude 3.5 Sonnet, and the Gemini 1.5 family.

They explored different prompting strategies to see if they could coax better reasoning out of the models:

(A): Answer Only. The model simply predicts “Yes” or “No”.
(A + E): Answer then Explain. The model gives the answer, then provides the reasoning.
(E + A): Explain then Answer. Also known as Chain-of-Thought (CoT), where the model reasons first to guide its final answer.

Key Results: How Good are LLMs at Planning?

The results were surprisingly underwhelming. On a balanced dataset (where a random guess would yield 50% accuracy), the best zero-shot models struggled to break significantly away from random chance when only providing an answer.

1. The “Dependency Bias”

One of the most critical findings is that LLMs are heavily biased towards predicting dependence. They generally assume that the order in which steps are listed in the text is the only valid order.

Table 1: Performance of all models on Step Order Prediction when just providing an answer(A)and when also explaining that answer (A + E). We report per-label as well as macro average precision, recall and F1 score.

Table 1 highlights this struggle. Look at the NONDEP columns (questions about independent steps). The Recall (R) for many models is abysmally low in the (A) setting. For example, GPT-4o has a NONDEP Recall of only 0.19.

This means that for questions where the answer is “No, these steps don’t depend on each other,” GPT-4o incorrectly said “Yes, they do” over 80% of the time. The models are relying on the temporal order of the text as a heuristic for causal dependence.

2. Explanations Improve Accuracy

Interestingly, asking the model to explain its reasoning (A + E) significantly improved performance across the board.

Gemini 1.5 Pro saw its F1 score jump from 0.55 (Answer only) to 0.73 (Answer + Explanation).
GPT-4o jumped from 0.49 to 0.70.

By forcing the model to generate an explanation, it seems to ground the model better in the context of the recipe, helping it overcome the simple “text order” heuristic. However, even with explanations, the best F1 score of 0.73 indicates that nearly a quarter of the time, the models are still getting the causal logic wrong.

3. Human Evaluation of Explanations

Since the models were generating explanations, the researchers had to evaluate the quality of that text. They employed human annotators to rate the explanations on a Likert scale of 1 to 5.

Table 2: Human evaluation metrics for explanations generated by various models in the (A + E) setting.

Table 2 reveals a disconnect between the models’ confidence and human assessment.

MODAVG (Modified Average) accounts for cases where the prediction was wrong.
The scores hover around 2.6 to 2.9 out of 5.
This suggests that even when models are right, their explanations are often mediocre. Worse, models are capable of hallucinating convincing-sounding explanations for completely incorrect answers. Llama3-8B, for instance, often justified the opposite of the correct answer.

Deep Dive: Robustness and Consistency

A robust AI shouldn’t change its mind about a fact just because you phrased the question differently. The researchers introduced two metrics to measure consistency.

Temporal Consistency (TC)

If a model says “Step A must happen before Step B,” it should logically also say “Step B must happen after Step A.” If the model answers “Yes” to the first and “No” to the second, it is temporally inconsistent.

Order Contrastive Consistency (OCC)

This is a clever test. For NONDEP (independent) steps, the order in the text doesn’t matter. The researchers took recipes and physically swapped the order of independent steps in the input text (e.g., putting “mix dry ingredients” after “mix wet ingredients” in the text, even though they are parallel tasks). If the model is truly reasoning about the plan, swapping the text order shouldn’t change its answer about the dependency.

Table 3:Robustness of different models on two consistency metrics, TC and OCC.

Table 3 shows high inconsistency.

GPT-4o has decent Temporal Consistency (79.86%) but terrible Order Contrastive Consistency (47.96%).
This confirms the hypothesis: The model is over-relying on the position of the sentence in the prompt. When the researchers swapped the sentence order, the model changed its answer about the dependency, proving it wasn’t analyzing the logic of the recipe, but merely the sequence of the words.

The Chain-of-Thought Anomaly

Perhaps the most surprising finding in the paper is related to Chain-of-Thought (CoT) prompting. Standard wisdom in the LLM community suggests that “Let’s think step by step” (reasoning before answering) yields the best results.

However, on CAT-BENCH, the Answer-then-Explain (A+E) method outperformed Explain-then-Answer (E+A/CoT).

Table 4: Performance of gpt-4o on the Step Order Prediction task when just predicting the dependency (A) vs predicting and explaining the judgment (A + E) VS using chain-of-thought prompting (E + A).

As seen in Table 4, while CoT (E+A) is better than just answering (A), it lags behind (A+E). Why would reasoning first hurt the model?

The researchers found that CoT often led the model into hallucinations. When asked to reason step-by-step, the model would sometimes invent details about the recipe to support a linear narrative.

Figure 4: Example of hallucinations produced by GPT4 in the (E + A) setting.

Figure 4 provides a stark example. The model claims Step 10 cannot happen after Step 3 because “the eggplant needs to be cooked.” The problem? There is no eggplant in the recipe. It is a lentil soup recipe. The Chain-of-Thought process allowed the model to drift away from the source text and hallucinate ingredients, leading to incorrect causal reasoning.

Where Do Models Fail? Error Analysis

The researchers categorized the types of errors models make when they fail to identify dependencies.

1. Multi-hop Dependency Failure

Models struggle with the transitive property of logic (A causes B, B causes C, therefore A causes C).

Figure 5: Examples of cases where GPT-4 comes up with good (upper box) and bad (lower box) answers. This error is of the multi-hop dependency type.To make shortcakes,removing the cake from the oven (Step 10)is dependent on baking the cake (step 9) which is later dependent on combining the ingredients (Step 2). Examples of other error types can be found in Figure 6.

In Figure 5 (lower box), the model fails to realize that cooling the cake (Step 10) depends on combining the flour (Step 2). It correctly notes that cooling follows baking, but misses the deeper link that baking required the earlier mixing steps.

2. Distance Bias

The researchers analyzed how the distance between steps in the text affected accuracy.

Figure 8: Difference in performance of models between (A + E) and (A) settings split by the distance between the steps being asked about in the question.

Figure 8 illustrates a “Distance Bias.” Models are much more likely to predict a dependency if the steps are far apart in the text (Distant). They assume that if Step 1 is at the start and Step 20 is at the end, they must be dependent. Generating explanations (the red bars) helps mitigate this bias significantly compared to just answering (blue bars), but the tendency remains.

Conclusion: The Illusion of Understanding

CAT-BENCH serves as a reality check for the capabilities of Large Language Models in planning domains. While LLMs are fluent and can reproduce the structure of a plan, their grasp of the underlying logic—the causal web of preconditions and effects—is brittle.

The key takeaways for students and researchers are:

Don’t mistake fluency for logic: Just because a plan looks readable doesn’t mean the steps are logically sound.
Beware the Linear Bias: Models struggle to understand that real-world actions can happen in parallel; they are biased toward the sequential order of the input text.
Prompting Matters: For this specific task, asking for the answer followed by an explanation worked better than standard Chain-of-Thought, likely because it constrained the model to the binary decision before it could hallucinate details.
Verification is Key: We cannot yet rely on LLMs to autonomously validate critical plans (like safety procedures or chemical synthesis) without human oversight or external verification tools.

CAT-BENCH provides a standardized way to measure progress in this area. Until models can score highly on metrics like Order Contrastive Consistency and NONDEP recall, we should view their “planning” abilities as sophisticated pattern matching rather than true causal reasoning.

The Problem: Generation vs. Understanding#

Introducing CAT-BENCH#

Experimental Setup#

Key Results: How Good are LLMs at Planning?#

1. The “Dependency Bias”#

2. Explanations Improve Accuracy#

3. Human Evaluation of Explanations#

Deep Dive: Robustness and Consistency#

Temporal Consistency (TC)#

Order Contrastive Consistency (OCC)#

The Chain-of-Thought Anomaly#

Where Do Models Fail? Error Analysis#

1. Multi-hop Dependency Failure#

2. Distance Bias#

Conclusion: The Illusion of Understanding#