Fixing the Timeline: How Counterfactuals Teach LLMs to Understand Time

Large Language Models (LLMs) like GPT-4 and Llama-3 are impressive polymaths. They can write poetry, debug code, and summarize history. But for all their sophistication, they often struggle with a concept that a primary school student grasps intuitively: Time.

Specifically, LLMs struggle with relative temporal understanding. If you tell a model that “John finished dinner before he went for a walk,” and then ask, “Did John go for a walk after dinner?”, a human knows the answer is immediately “Yes.” However, LLMs frequently get confused by these logical entanglements. They suffer from temporal inconsistency—they might correctly answer one version of the question but contradict themselves when the question is phrased slightly differently.

In this post, we are doing a deep dive into the paper “Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models” by Jongho Kim and Seung-won Hwang. We will explore how these researchers developed a novel prompting method that uses “counterfactuals”—hypothetical, time-flipped scenarios—to force models to be logically consistent, significantly improving their ability to reason about time.

The Problem: When LLMs Lose Track of Time

To understand the solution, we first have to understand the specific failure mode of the models. The researchers focus on relative event understanding. This isn’t about calculating dates (e.g., “What is 10 days after May 1st?”); it is about understanding the sequence and properties of events based on context.

The core issue is inconsistency.

Imagine a model is given a story and asked two questions:

Q1: Does Event A happen after Event B?
Q2: Does Event A happen before Event B?

Logically, if the answer to Q1 is “No,” the answer to Q2 must be “Yes” (assuming they don’t happen simultaneously). However, standard prompting techniques, and even the famous Chain-of-Thought (CoT) reasoning, treat these questions in isolation.

As illustrated below, existing approaches can lead the model to answer “Yes” to both questions, creating a paradox where Event A is both before and after Event B.

Figure 1: Example of leveraging counterfactual questions to resolve temporal inconsistency in LLMs.

In Figure 1 (a) above, you can see the standard failure mode. The model generates a chain of thought for the first question and concludes “Yes.” It then generates a separate chain of thought for the conflicting question and also concludes “Yes.” The model lacks a global view of the timeline.

The researchers propose a solution shown in Figure 1 (b): Counterfactual-Consistency Prompting (CCP). Instead of answering the question in a vacuum, the model is prompted to generate a counterfactual question—essentially asking, “What if the timeline was reversed?"—and uses the answer to that hypothetical to constrain the final result.

The Core Insight: Reasoning via Constraints

Why does this work? The intuition relies on the logical interdependence of time. Temporal reasoning imposes constraints. If an event has a specific duration, it cannot have a drastically different duration. If Event A leads to Event B, the reverse cannot be true.

The authors formalize this logic with the following implication:

Equation 1 showing logical implication of temporal relations.

In this equation:

\(r_{2}(e_{1}, e_{2})\) represents a counterfactual temporal relation (e.g., “before”).
\(r_{1}(e_{1}, e_{2})\) represents the original relation (e.g., “after”).
\(\mathcal{V}\) is the set of valid/coherent relations.

Simply put: If the model determines that the counterfactual scenario (\(r_2\)) is true, it is logically constrained to accept that the original scenario (\(r_1\)) is false (or vice-versa). By explicitly forcing the model to evaluate the counterfactual, we create a temporal constraint that guides the model toward consistency.

How Counterfactual-Consistency Prompting (CCP) Works

The CCP method is a multi-step process designed to approximate these constraints during inference. It doesn’t require fine-tuning the model; it works entirely through clever prompting.

Step 1: Generating Temporally Counterfactual Questions

Standard “data augmentation” in machine learning often involves swapping words to create new training data. However, the researchers needed something more specific: Temporally Counterfactual Questions.

They prompt the model to generate a modified version of the user’s question where the temporal semantics are flipped. This is dynamic—the model writes its own counterfactuals based on the context.

Original: “Did they get married after they moved to Maine?”
Generated Counterfactual: “Did they get married before they moved to Maine?”

This applies to more than just ordering. It covers duration, frequency, and stationarity. For example, if the text says an empire lasted centuries, a counterfactual might ask if it lasted “1 year.”

Figure 4: Examples of MCTACO Question Types. MCTACO covers various temporal aspects including event duration, frequency, stationarity, ordering, and typical time.

As shown in Figure 4, the method is versatile. Whether the question is about how often something happens (Frequency) or how long it takes (Duration), the model can generate a “mirror universe” question that challenges the original premise.

Step 2: Aggregating Predictions

Once the counterfactual question is generated, the model answers it. But here is the critical part: the model doesn’t just trust the counterfactual answer blindly (because the model could be wrong about that, too!).

Instead, the method uses Self-Consistency Aggregation. The model re-weights its final prediction by considering the probability distributions of both the original question and the counterfactual question together.

Equation 2 showing the probability aggregation function.

This formula represents the aggregation. The final prediction \(P(Y)\) isn’t just based on the original question \(Q\). It is a function of the original question plus the generated counterfactual questions \(Q^{c}\) and their answers \(Y^{c}\).

If the model says there is a 60% chance “A is before B” (Original) but a 90% chance “A is NOT after B” (Counterfactual), the aggregation step allows the stronger signal from the counterfactual to correct the original answer.

Experimental Setup

To prove that CCP works, the researchers tested it against strong baselines, including standard prompting (SP), Chain-of-Thought (CoT), Self-Consistency, and Multi-Agent Debate.

They utilized three distinct datasets focusing on relative time:

TempEvalQA-Bi: Focuses on explicit event ordering (Before vs. After).
TRACIE: A more complex dataset involving implicit events where the timeline must be inferred from a story.
MCTACO: A diverse dataset covering duration, frequency, and typical times (as seen in Figure 4).

The Metrics: They measured Accuracy (ACC) and, crucially, Inconsistency (INC). The INC score measures how often the model contradicts itself. A lower INC score is better.

Results: Consistency Breeds Accuracy

The results were compelling. CCP outperformed baselines across the board, particularly in reducing inconsistency.

For example, on the Llama-3-70B model:

Standard Prompting had an inconsistency rate (INC) of roughly 40%.
Chain-of-Thought (CoT) reduced this slightly to 31%.
CCP (Ours) slashed the inconsistency rate to 19.2%.

This improvement in consistency led to higher accuracy scores. When the model stops contradicting itself, it naturally becomes more accurate.

Generated vs. Retrieved Questions

One might ask: “Do we really need the model to generate a new question? Can’t we just find a similar question in the dataset and use that as a comparison?”

The researchers compared their generative approach against a “Retrieval” baseline (Ret.Q).

Figure 2: Comparison between counterfactual example collection methods on MCTACO with Llama-3-8B.

Figure 2 shows the Inconsistency (INC) scores for Retrieved Questions (Blue) vs. CCP (Orange). Remember, lower is better.

Across almost every category—Duration, Frequency, Ordering—CCP achieved lower inconsistency. This suggests that dynamically generating a specific counterfactual for the exact context at hand is far more effective than trying to retrieve a loosely related example from a database. The model needs a custom-tailored constraint to reason effectively.

The Importance of Aggregation

Another hypothesis: “Maybe the aggregation math isn’t necessary. Maybe we can just ask the counterfactual question, flip the answer, and use that?”

The researchers tested this “Direct Answer” (Dir.A) approach against the full CCP method.

Figure 3: Comparison between different counterfactual leveraging methods with the Llama-3-8B model.

Figure 3 illustrates the results on the TempEvalQA and TRACIE datasets. The Green bars (Direct Answer) show significantly higher inconsistency than the Blue bars (CCP).

This validates the importance of the aggregation step (Equation 2). The model sometimes hallucinates on the counterfactual question, too. By weighing the original and the counterfactual answers together, the model performs a “sanity check,” leading to much more robust performance.

Less is More

In the era of “Big Data,” we often assume that more context is always better. One might think that generating 5 or 10 counterfactual questions would provide even more constraints and better accuracy.

Surprisingly, the research suggests the opposite.

Figure 6: Inconsistency changes with the different number of counterfactual questions. The Llama-3-8B model is used.

Figure 6 plots the Inconsistency rate against the number of counterfactual questions. As you can see, the error rate increases (goes up) as we add more questions (from 1 to 7).

The best performance comes from generating just one high-quality counterfactual. Why? The authors suggest that piling on multiple counterfactuals adds noise. It introduces too much conflicting information into the context window, overwhelming the model’s reasoning capabilities. This mirrors findings in other areas of NLP where “contrastive reasoning” degrades if too many contrasts are introduced at once.

Conclusion and Implications

The paper “Counterfactual-Consistency Prompting for Relative Temporal Understanding” offers a sophisticated yet elegant solution to a stubborn problem in AI. It highlights that LLMs, for all their power, often lack basic logical consistency when dealing with the abstract flow of time.

Key takeaways for students and practitioners:

Consistency \(\neq\) Accuracy, but they are linked. You cannot have a reliable model that believes \(A < B\) and \(B < A\). Fixing consistency is often a prerequisite for fixing accuracy.
Self-Correction via Counterfactuals. Asking “What if this were false?” is a powerful reasoning heuristic. It forces the model to verify its own logic.
Prompting is Programming. This method didn’t require retraining the massive Llama-3 or GPT-4 models. It simply required a smarter algorithmic approach to how we ask questions.

While the method has limitations—it still struggles with arithmetic reasoning involving absolute dates (e.g., calculating years)—it represents a significant step forward in making LLMs grounded, logical reasoners rather than just stochastic parrots. As we move toward agents that need to plan and schedule tasks in the real world, this kind of temporal consistency will be essential.

The Problem: When LLMs Lose Track of Time#

The Core Insight: Reasoning via Constraints#

How Counterfactual-Consistency Prompting (CCP) Works#

Step 1: Generating Temporally Counterfactual Questions#

Step 2: Aggregating Predictions#

Experimental Setup#

Results: Consistency Breeds Accuracy#

Generated vs. Retrieved Questions#

The Importance of Aggregation#

Less is More#

Conclusion and Implications#