Introduction

“She got a high-paying job because she graduated from a top university.”

When we read a sentence like this, our brains instantly form a causal link. We assume the degree caused the job offer. But did it? Perhaps she was simply a brilliant coder, and she would have landed that job regardless of her alma mater. To determine if the degree was the true cause, we would ideally need to see a parallel universe: one where she didn’t go to that university but had the exact same skills and background, and observe if she still got the job.

In the field of Natural Language Processing (NLP), this task is called Event Causality Identification (ECI). It is the process of extracting causal relations between events in text. Traditionally, AI models have solved this by looking for linguistic patterns—keywords like “because,” “due to,” or “therefore.”

However, language is messy. We use causal words informally all the time. Relying on patterns often leads to “specious” (false) causality. Even advanced Large Language Models (LLMs) like GPT-4 often act as “causal parrots,” repeating correlations they have seen in training data rather than understanding the underlying mechanics of cause and effect.

In a fascinating new paper, Event Causality Identification with Synthetic Control, researchers from UPenn and the Allen Institute for AI propose a shift in how we approach this problem. Instead of looking for keywords, they adopt a rigorous framework from economics called the Rubin Causal Model. They attempt to do the impossible: generate that “parallel universe” within the text domain to mathematically prove causality.

The Background: Counterfactuals and Economics

To understand the paper’s contribution, we first need to understand the difficulty of causal inference.

The Rubin Causal Model (RCM)

The gold standard for causality is the Rubin Causal Model. It posits that to know if Event A (\(e_1\)) caused Event B (\(e_2\)), we must compare two outcomes:

The Observed Outcome: What actually happened after \(e_1\).
The Counterfactual Outcome: What would have happened if \(e_1\) had not occurred (denoted as \(\neg e_1\)).

The causal effect is the difference between these two probabilities.

Figure 1: An example illustrating the temporal ordering of treatment event e_1, observed outcome e_2, and pretreatment events on a time axis.

As shown in Figure 1, imagine a timeline involving a protagonist, Alex.

Pretreatment (\(e_{-1}, e_0\)): Alex goes to the gym and feels hungry.
Treatment (\(e_1\)): Alex walks into a restaurant.
Outcome (\(e_2\)): Alex orders food.

To prove that walking into the restaurant caused him to order food, we need to intervene. We imagine a scenario (\(\neg e_1\)) where Alex doesn’t walk into the restaurant (perhaps he opens a delivery app instead). If he still orders food in this alternative timeline, then walking into the restaurant wasn’t the specific cause of him eating.

The paper formalizes this “treatment effect” (\(\Delta\)) as the difference in probabilities:

Equation for the treatment effect delta.

Here, the symbol \(\prec\) indicates temporal order (occurrence before). The challenge, obviously, is that we cannot observe both timelines simultaneously. Alex either went to the restaurant or he didn’t. In the physical world, we solve this with Randomized Controlled Trials (RCTs)—splitting people into two groups. But in a static text narrative, we cannot split the protagonist in two.

The Inspiration: Basque Country Economics

Since we cannot clone Alex, the researchers looked to a method used in economics called Synthetic Control.

In the early 2000s, economists wanted to study the economic impact of terrorism in the Basque Country (a region in Spain). They couldn’t “pause” terrorism to see what would happen to the GDP. Nor could they find another region that was exactly like the Basque Country but peaceful.

Instead, they created a “Synthetic Basque Country.” They took bits and pieces from other regions (Catalonia, Madrid, etc.) and weighted them mathematically to create a composite region that looked exactly like the Basque Country before the terrorism started.

Figure 2: A line graph showing per capita GDP for the Basque Country versus a synthetic control region.

As Figure 2 illustrates, the “Synthetic” line tracks the “Actual” line perfectly until the terrorism begins (the divergence point). The gap that opens up afterwards represents the true causal cost of the conflict.

The authors of this paper asked a bold question: Can we do this for text? Can we build a “Synthetic Alex” from other stories to see what he would have done if he hadn’t entered that restaurant?

The Core Method: Synthesizing Stories

The proposed method is a pipeline designed to construct this “synthetic twin” from a large corpus of text. The process involves three main stages: Retrieval, Synthesis, and Estimation.

Figure 3: An illustration of the system architecture showing the study unit, retrieved control group, and the merging process.

Figure 3 provides a high-level overview. We start with our specific story (the Study Unit). We then search a massive database for “non-contemporary control groups”—other stories that are similar but not identical. Finally, we merge them into a Synthetic Control Unit.

Let’s break down the technical steps.

1. Retrieval (Finding the Raw Materials)

We need to find stories that resemble our protagonist’s situation before the treatment event. The researchers use a large dataset of narratives (specifically TinyStories for their experiments).

To ensure the matching is based on the situation and not specific names, they first anonymize the text (e.g., changing “Timmy” to “a boy”). They then use BM25 (a standard retrieval algorithm) and embedding similarity to find documents where the pre-treatment events are similar to our study unit.

Crucially, they filter these retrieved stories to ensure:

The pre-treatment context matches.
The “treatment” event (e.g., entering the restaurant) does not happen in the retrieved story.
There is some other intervention instead.

2. Synthesis (Creating the Twin)

This is where the math gets interesting. We rarely find one single story that is a perfect match for Alex’s situation. However, a weighted combination of 5 or 10 different stories might be a perfect match.

The researchers use text embeddings (specifically text-embedding-ada-002) to turn the stories into vector representations. They then perform a Ridge Regression optimization.

They try to find a set of weights (\(w\)) such that the weighted sum of the control stories (\(u_j\)) is as close as possible to the study unit (\(u_{study}\)):

Equation showing the minimization objective for finding weights w.

\(u_{study}\): The embedding of our protagonist’s history.
\(u_j\): The embeddings of the retrieved stories.
\(\lambda\): A regularization term to prevent overfitting (ensuring we don’t just copy one story).

Once these weights (\(w\)) are calculated, they tell us how much “influence” each retrieved story should have. We then take the outcomes (\(o_j\)) of those retrieved stories and combine them using the same weights to create a Synthetic Outcome embedding:

Equation showing the calculation of the synthetic outcome.

3. Inversion (Reading the Result)

At this point, \(o_{synthetic}\) is just a vector of numbers. It represents the “concept” of what happens in the counterfactual timeline. To make sense of it, we need to turn it back into English.

This process is called Model Inversion. The authors utilize a state-of-the-art technique called Vec2Text (Morris et al., 2023).

Equation showing the inversion of the synthetic embedding back to text.

This step generates a textual description of the synthetic outcome. For example, if the synthetic embedding suggests “satisfying hunger via delivery,” the Vec2Text model might output: “The boy opened an app and ordered pizza.”

4. Estimation

Finally, the system compares the actual outcome (\(e_2\)) with the synthetic outcome.

Scenario A: In the real story, Alex ordered food. In the synthetic story (where he didn’t enter the restaurant), he also ordered food (via an app).
Conclusion: Entering the restaurant was not the cause. He was just hungry.
Scenario B: In the real story, Alex ordered food. In the synthetic story, he kept walking and went to a park.
Conclusion: Entering the restaurant was the cause.

The similarity between the real and synthetic outcomes is evaluated using GPT-3.5-turbo to make the final causal determination.

Experiments and Results

The researchers tested their “Synthetic Control” method against several baselines, including:

GPT-4-turbo (Zero-shot): Simply asking GPT-4 “Did event A cause event B?”
Counterfactual Prompting: Asking GPT-4 to imagine the counterfactual without the rigorous retrieval process.
ROCK and COLA: Previous state-of-the-art models grounded in the Rubin Causal Model but using different matching techniques.

They used the COPES dataset, a benchmark of causal stories. Specifically, they focused on COPES-hard, a subset of the data where standard LLMs struggle significantly (generating high false positives).

The Results

The results were compelling. As shown in Table 1 below, the Synthetic Control method outperformed the others, particularly in Precision.

Table 1: Comparison of model performances on the COPES-hard dataset.

Key Takeaways from the Data:

High Precision (0.2663): The Synthetic Control method is much better at avoiding false positives. This is the most critical metric in causality; we want to avoid saying A caused B when it was actually just a coincidence. It achieved a 29.8% improvement over GPT-4-turbo in precision.
Balanced F1 Score: While GPT-4 has high recall (it guesses “yes” a lot, catching most true causes but also flagging many false ones), the Synthetic Control method achieves the highest F1 score (0.3930), representing the best balance between finding true causes and rejecting false ones.
Efficiency: It is worth noting that this method utilizes GPT-3.5 and embedding models, yet it outperforms the much larger and more expensive GPT-4-turbo on this reasoning task.

Conclusion and Implications

This paper represents a significant step forward in making AI “reason” rather than just “predict.” By borrowing the Synthetic Control method from economics, the authors demonstrated that we can perform rigorous causal inference in text without needing millions of parameters or massive generative models.

The implications are broad:

Beyond “Stochastic Parrots”: This method provides a structured, mathematical framework for causality that forces the model to check its work against a constructed counterfactual.
Bias Mitigation: By relying on retrieved historical data rather than the internal biases of an LLM, this approach helps mitigate the “hallucination” of causal relationships based on common stereotypes or linguistic patterns.
Interdisciplinary AI: This is a prime example of how concepts from social sciences (econometrics) can solve hard problems in computer science.

As we move toward AI systems that make decisions in the real world—from legal analysis to medical diagnosis—distinguishing between “what usually happens” (correlation) and “what made this happen” (causation) is vital. The Synthetic Control method offers a promising path toward that future.

Introduction#

The Background: Counterfactuals and Economics#

The Rubin Causal Model (RCM)#

The Inspiration: Basque Country Economics#

The Core Method: Synthesizing Stories#

1. Retrieval (Finding the Raw Materials)#

2. Synthesis (Creating the Twin)#

3. Inversion (Reading the Result)#

4. Estimation#

Experiments and Results#

The Results#

Conclusion and Implications#