Introduction

One of the most fundamental questions in science, policy-making, and everyday logic is: “Did X cause Y?”

In a perfect world, we would run controlled experiments to answer this. We would force people to smoke to see if they get lung cancer, or control interest rates in a vacuum to see how inflation reacts. But in the real world, we rarely have that luxury. We are often stuck with observational data—snapshots of what happened without any control over the variables.

Determining causality from observation is already difficult, but it becomes exponentially harder when we introduce latent variables. These are the unobserved “ghosts” in the machine—factors that influence our observed variables but aren’t in our dataset. For example, if you are analyzing the link between “Ice Cream Sales” and “Drowning Incidents,” a latent variable is “Summer Heat.” If you don’t account for it, you might incorrectly conclude that ice cream causes drowning.

Traditionally, researchers use Directed Acyclic Graphs (DAGs) to map these relationships. But standard DAGs often break down when latent variables are involved. To fix this, we use more complex structures like Maximal Ancestral Graphs (MAGs). The catch? Most algorithms try to learn the entire global structure of the graph just to answer a simple question about two variables. Imagine trying to map every street in New York City just to find out if your street connects to the one next door. It is inefficient and computationally expensive.

In this post, we will dive into a research paper titled “Local Identifying Causal Relations in the Presence of Latent Variables”. The researchers propose a novel method called LocICR (Local Identifying Causal Relations). This algorithm allows us to zoom in on a specific pair of variables and determine their causal relationship locally, ignoring the rest of the massive network, even when hidden confounders are lurking.

The Problem with Hidden Variables

To understand the solution, we first need to understand the mess created by latent variables.

In a standard causal model, we assume Causal Sufficiency, meaning there are no hidden common causes for any pair of variables we observe. When this holds, we use DAGs. However, in fields like medicine or sociology, this assumption rarely holds.

Let’s look at a medical example involving oxygen distribution in a child’s body.

Figure 1. (a) The underlying graph where Cardiac Mixing is a latent variable. (b) A DAG over observed variables. (c) A MAG characterizing the causal relations.

In Figure 1(a), we see the “Underlying Graph.” Notice the variable Cardiac Mixing in the dashed box. This is a latent variable—it exists, it affects the system, but we cannot measure it directly.

If we ignore this latent variable and try to fit a standard DAG to the observed data, we get Figure 1(b). Look at the red arrow from Hypoxia in \(O_2\) to Hypoxia distribution. The DAG claims there is a direct causal link. However, looking back at the underlying graph (a), there is no direct path! The relationship is actually confounded by the hidden Cardiac Mixing. The DAG has given us a “false positive” for causality.

This is where Maximal Ancestral Graphs (MAGs) come in, shown in Figure 1(c). A MAG is designed to handle latent variables. Instead of just single arrows (\(\rightarrow\)), MAGs use bi-directed edges (\(\leftrightarrow\)) to represent confusion caused by latent variables. In Figure 1(c), the MAG correctly identifies that Hypoxia in \(O_2\) and Hypoxia distribution are merely associated (perhaps via a hidden cause), not directly causal.

The Goal: From Global to Local

The current state-of-the-art methods usually follow a “Global Structure Learning” path:

  1. Take the data.
  2. Learn the entire graph (usually represented as a Partial Ancestral Graph or PAG, which represents a class of possible MAGs).
  3. Look at the specific edge between X and Y to see if it’s causal.

If you have thousands of genes or social variables, learning the whole graph is incredibly slow (NP-hard). The authors of this paper ask: Can we skip the global map and just look at the neighborhood of X and Y to find the answer?

Classifying Causal Relationships

Before we get to the “how,” we must define the “what.” When we look at a pair of variables \((X, Y)\) in a graph with latent variables, we can’t always be 100% sure of the direction. Instead, we classify the relationship into three buckets based on the Markov Equivalence Class (MEC). The MEC is the set of all valid graphs that could explain our data.

Figure 6. Overview of Causal Relations in the Markov equivalence class (MEC) of a true MAG.

As shown in Figure 6, the relationships are:

  1. Invariant Non-Ancestor: In every valid graph that explains the data, there is no directed path from X to Y. (X definitely does not cause Y).
  2. Invariant Ancestor: In every valid graph, there is a directed path from X to Y. (X definitely causes Y).
  • Explicit: There is a specific path common to all graphs.
  • Implicit: Every graph has a path, but it might change routes.
  1. Possible Ancestor: In some valid graphs X causes Y, but in others, it doesn’t. (We can’t be sure).

The goal of the LocICR algorithm is to correctly place a pair \((X, Y)\) into one of these buckets using only local information.

The Core Method: LocICR

The genius of this paper lies in defining “Local Characterizations.” The authors prove mathematically that you don’t need to check the whole world to know if X causes Y; you only need to check a specific set of neighbors.

1. The Local Markov Property and “Augmented” Sets

To do this locally, the authors utilize the Markov Blanket. In a general sense, the Markov Blanket of X (\(MB(X)\)) is the smallest set of variables that contains all the information you need to predict X. If you know \(MB(X)\), learning anything else in the network tells you nothing new about X.

Figure 2. The illustrative example for MB, where X is the target of interest and the blue vertices belong to MB(X).

However, simply knowing the neighbors isn’t enough because of the latent variables. The authors introduce a crucial concept: the Augmented Parent Set, denoted as \(Pa^*(X)\).

The Augmented Parent Set (\(Pa^*\))

Usually, “parents” are direct causes. But in a PAG (Partial Ancestral Graph), we deal with “Augmented Parents.” These include direct parents and nodes connected via specific “arrow-collider paths” (paths that look like \(X \leftarrow \dots \leftarrow V\)).

Why does this matter? The authors derived a local rule (Theorem 1) that acts as a litmus test for non-causality:

Theorem 1 (The Non-Causal Test): X is an Invariant Non-Ancestor of Y if and only if X and Y are independent given the Augmented Parent Set.

Mathematically:

\[ X \perp \perp Y \mid P a ^ { * } ( X , \mathcal { P } ) \]

If you condition on this specific local set and the correlation between X and Y vanishes, X does not cause Y.

The Augmented Undirected Neighbor Set (\(Ne^*\))

What if they are not independent? Does that mean X causes Y? Not necessarily—it could be a spurious correlation or reverse causality. To confirm causality, the authors introduce a second set: the Augmented Undirected Neighbor Set (\(Ne^*\)).

This set accounts for the uncertainty in the graph (edges that we can’t fully orient yet, denoted by circles \(\circ\)).

Theorem 2 (The Causal Test): X is an Explicit Invariant Ancestor (definite cause) of Y if they remain dependent even after conditioning on both the Augmented Parents and the Augmented Neighbors.

2. The Algorithm in Action

The LocICR algorithm combines these theorems into a process. It doesn’t look at the whole graph. It builds a “Local PAG” around X and Y.

Let’s walk through an example execution provided by the authors. Suppose we want to find the relationship between node J and node F.

Step 1: Learn the Local Structure

First, the algorithm identifies the Markov Blanket of the target variable J. It looks at the data and finds variables \(A, G, F, B, C, D, H\) are relevant. It then performs independence tests only within this group to draw a local map.

Figure 8. The sequential process of step one in Algorithm 1… learning local results.

In Figure 8, you can see the algorithm evolving:

  • (a) The Ground Truth (what we hope to find).
  • (b) It learns the structure around J.
  • (d) It expands slightly to learn the structure around G (a neighbor).
  • (f) It learns around F.
  • (h) It stitches these local views together into a consistent Local PAG.

Notice it never touched nodes far away from this cluster. It focused entirely on the relevant neighborhood.

Step 2 & 3: Identify Sets and Test

Once the local graph is built, the algorithm identifies the sets we discussed: \(Pa^*\) (Augmented Parents) and \(Ne^*\) (Neighbors).

Figure 9. The process of steps two and three in Algorithm 1 in the example.

In Figure 9:

  • (a) Shows the final local structure.
  • (c) It identifies \(Pa^*(J) = \{A, G\}\). These are the variables that block non-causal paths.

Finally, the algorithm runs the statistical test: Is J independent of F given \(\{A, G\}\)?

In this example, the answer is No. The dependence remains. Based on the logic of Theorem 2 (and because the neighbor set is empty here), the algorithm concludes: J is an Explicit Invariant Ancestor of F.

Experiments and Results

Does this actually work better than the “Global” approach? The authors tested LocICR against seven state-of-the-art algorithms, including:

  • PC-ITC / RFCI-Zhang: Methods that learn the global structure first.
  • RFCI-LVIDA: A method that estimates causal effects globally.

They used four benchmark networks (MILDEW, ALARM, WIN95PTS, ANDES) and generated synthetic data with random latent variables.

Accuracy (F1 Score)

The primary metric is the Weighted F1 Score, which balances Precision (how many predicted causes were real) and Recall (how many real causes were found).

Figure 4. Performance of eight algorithms on four benchmark networks

Look at the left-hand charts in Figure 4 (The “WF1 Score” columns).

  • The Red Square line (LocICR) is consistently at the top.
  • This indicates that LocICR is not only matching but often outperforming global methods in correctly classifying relationships.
  • Global methods often struggle because errors in a distant part of the graph can propagate and ruin the structure near X and Y. Local methods are insulated from these distant errors.

Efficiency (Number of Tests)

The real killer feature is efficiency. Look at the right-hand charts in Figure 4 (The “nTest” columns).

  • The Y-axis represents the number of conditional independence tests performed.
  • The Red Square line (LocICR) is consistently near the bottom.
  • Global methods (like the Cyan and Purple lines) require drastically more statistical tests.

Interpretation: LocICR gets the right answer more often, and it does so with significantly less computational work.

Real-World Application

The authors didn’t stop at simulations. They applied LocICR to a real-world dataset: the General Social Survey. They looked at variables like Father’s Occupation, Son’s Education, and Son’s Income.

Using domain knowledge, we generally accept that a father’s occupation influences a son’s education, which influences the son’s income.

  • LocICR Result: It identified Father’s Occupation as an invariant ancestor of Son’s Education.
  • LocICR Result: It identified Father’s Occupation as an invariant ancestor of Son’s Income.

It also correctly identified non-causal pairs. For example, it found that Son’s Income is an invariant non-ancestor of Number of Siblings (which makes sense—your income today cannot retroactively change how many siblings you were born with).

Conclusion and Implications

Causal discovery is often viewed as an “all-or-nothing” task: either you map the whole system, or you know nothing. This paper challenges that view.

By developing a mathematically sound method to identify local causal relationships in the presence of latent variables, the authors have provided a tool that is scalable to massive datasets. If a biologist wants to know if Gene A causes Gene B, they no longer need to map the entire human genome’s interaction network to find out. They can just look at the neighborhood.

Key Takeaways:

  1. Latent Variables matter: Ignoring them leads to false conclusions (like standard DAGs).
  2. Global is slow: Learning the whole graph is unnecessary for specific questions.
  3. Local is sufficient: By using Augmented Parent Sets and local PAGs, we can mathematically prove causality (or lack thereof) for specific pairs.
  4. Efficiency: LocICR provides state-of-the-art accuracy with a fraction of the computational cost.

This approach opens the door for more robust causal inference in complex, data-rich fields like epidemiology, genetics, and economics, where hidden variables are always part of the equation.