Causality

Feb 21, 2025

\( \newcommand{\states}{\mathcal{S}} \newcommand{\actions}{\mathcal{A}} \newcommand{\observations}{\mathcal{O}} \newcommand{\rewards}{\mathcal{R}} \newcommand{\traces}{\mathbf{e}} \newcommand{\transition}{P} \newcommand{\reals}{\mathbb{R}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\complexs}{\mathbb{C}} \newcommand{\field}{\mathbb{F}} \newcommand{\numfield}{\mathbb{F}} \newcommand{\expected}{\mathbb{E}} \newcommand{\var}{\mathbb{V}} \newcommand{\by}{\times} \newcommand{\partialderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\defineq}{\stackrel{{\tiny\mbox{def}}}{=}} \newcommand{\defeq}{\stackrel{{\tiny\mbox{def}}}{=}} \newcommand{\eye}{\Imat} \newcommand{\hadamard}{\odot} \newcommand{\trans}{\top} \newcommand{\inv}{{-1}} \newcommand{\argmax}{\operatorname{argmax}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\avec}{\mathbf{a}} \newcommand{\bvec}{\mathbf{b}} \newcommand{\cvec}{\mathbf{c}} \newcommand{\dvec}{\mathbf{d}} \newcommand{\evec}{\mathbf{e}} \newcommand{\fvec}{\mathbf{f}} \newcommand{\gvec}{\mathbf{g}} \newcommand{\hvec}{\mathbf{h}} \newcommand{\ivec}{\mathbf{i}} \newcommand{\jvec}{\mathbf{j}} \newcommand{\kvec}{\mathbf{k}} \newcommand{\lvec}{\mathbf{l}} \newcommand{\mvec}{\mathbf{m}} \newcommand{\nvec}{\mathbf{n}} \newcommand{\ovec}{\mathbf{o}} \newcommand{\pvec}{\mathbf{p}} \newcommand{\qvec}{\mathbf{q}} \newcommand{\rvec}{\mathbf{r}} \newcommand{\svec}{\mathbf{s}} \newcommand{\tvec}{\mathbf{t}} \newcommand{\uvec}{\mathbf{u}} \newcommand{\vvec}{\mathbf{v}} \newcommand{\wvec}{\mathbf{w}} \newcommand{\xvec}{\mathbf{x}} \newcommand{\yvec}{\mathbf{y}} \newcommand{\zvec}{\mathbf{z}} \newcommand{\Amat}{\mathbf{A}} \newcommand{\Bmat}{\mathbf{B}} \newcommand{\Cmat}{\mathbf{C}} \newcommand{\Dmat}{\mathbf{D}} \newcommand{\Emat}{\mathbf{E}} \newcommand{\Fmat}{\mathbf{F}} \newcommand{\Gmat}{\mathbf{G}} \newcommand{\Hmat}{\mathbf{H}} \newcommand{\Imat}{\mathbf{I}} \newcommand{\Jmat}{\mathbf{J}} \newcommand{\Kmat}{\mathbf{K}} \newcommand{\Lmat}{\mathbf{L}} \newcommand{\Mmat}{\mathbf{M}} \newcommand{\Nmat}{\mathbf{N}} \newcommand{\Omat}{\mathbf{O}} \newcommand{\Pmat}{\mathbf{P}} \newcommand{\Qmat}{\mathbf{Q}} \newcommand{\Rmat}{\mathbf{R}} \newcommand{\Smat}{\mathbf{S}} \newcommand{\Tmat}{\mathbf{T}} \newcommand{\Umat}{\mathbf{U}} \newcommand{\Vmat}{\mathbf{V}} \newcommand{\Wmat}{\mathbf{W}} \newcommand{\Xmat}{\mathbf{X}} \newcommand{\Ymat}{\mathbf{Y}} \newcommand{\Zmat}{\mathbf{Z}} \newcommand{\Sigmamat}{\boldsymbol{\Sigma}} \newcommand{\identity}{\Imat} \newcommand{\epsilonvec}{\boldsymbol{\epsilon}} \newcommand{\thetavec}{\boldsymbol{\theta}} \newcommand{\phivec}{\boldsymbol{\phi}} \newcommand{\muvec}{\boldsymbol{\mu}} \newcommand{\sigmavec}{\boldsymbol{\sigma}} \newcommand{\jacobian}{\mathbf{J}} \newcommand{\ind}{\perp!!!!\perp} \newcommand{\bigoh}{\text{O}} \)

tags: Statistics, Machine Learning, AI

This note will talk about the Statistics and AI notions of causality. This will also host as a place to link to various resources and notes on these resources.

Common Cause principle

It is well known that statistical properties alone do not determine causal structures.

Reichenbach’s common cause principle:

If two random variables X and Y are statistically dependent, then there exists a third variable Z that causally influences both. (As a special case, Z may coincide with either X or Y.) Furthermore, this variable Z screens X and Y from each other in the sense that given Z, they become independent.

While this principle lays out a primal causal model, estimating such a model from data (especially if the data isn’t causally sufficient) is extremely difficult and in many cases the model is not identifiable. While this should give us pause into the motivations of using causality in machine learning, making assumptions about the data generation process without hard assumptions on the agent’s used internal Causal Model might produce powerful techniques or insights into designing algorithms. This direction might also provide groundwork for understanding when problems in perception are hopeless (Scholkopf et al. 2012). With this in mind we will cautiously move forward.

Terms and special cases

Disentangled model

https://arxiv.org/abs/2102.11107

Independent Causal Mechanisms Principle

Markov Factorizations

Domain adaptation

Classes of Causal models

Causal hierarchy

Some assumptions for tractability

These are defined under a simple functional causal model where \(C\) is a cause variable (with id noise \(N_C\)), the function \(\psi\) is a deterministic mechanism, and \(E = \psi(C, N_E)\) with \(N_E\) is independent noise.

Causal Sufficiency

Assuming two independent noise variables \(N_C\) and \(N_E\) with random variables with distributions \(P(N_C)\) and \(P(N_E)\). The function \(\phi\) and the noise \(N_E\) jointly determine \(P(E|C)\) via \(E=\phi(C, N_E)\). This conditional is thought as the mechanism transforming cause \(C\) into effect \(E\).

Independence of mechanism and input

The mechanism \(\psi\) is independent of the distribution of the cause.

Richness of functional causal models

Two-variable functional causal models are so “rich” that the causal direction cannot be inferred.

Additive noise models

The additive noise model assumes for some function \(\phi\)

\[ E = \phi( C) + N_E \]

The importance of this is that under usual conditions (i.e. if \(\phi\) is linear and \(N_E\) was gaussian), two real-valued random variables X and Y can be fit by an ANM model in at most one direction (which is considered the causal direction).

The Causal Markov condition

Let \(G\) be a causal graph with vertices \(\mathbf{V}\) and \(P\) be a probability distribution over the vertices in \(\mathbf{V}\) generated by the causal structure represented by \(G\). \(G\) and \(P\) satisfy the causal markov condition if and only if for every \(W \in \mathbf{V}\), \(W\) is independent of \(V(\text{\bf Descendants}(W) \cup \text{\bf Parents}(W))\) given \(\text{\bf Parents}(W)\).

The Causal Minimality Condition

The Faithfulness Condition

Causal Model

As it relates to time series

Granger Causality

Questions

Inferring the causal model from data.

Say we observe the data from the toy examples in inference’s blog. How do you disentangle this to find the causal mechanisms? Are you able to run something through the function \(\phi\)? This seem unrealistic, especially if you just have the data…

Counterfactuals aren’t testable

If we are using a causal model to build some type of way to do counterfactual reasoning, how can we measure performance of models? In toy domains it is easy to do generate the correct distributions, but my time in RL has taught me that nothing is so simple when moving beyond toy problems. (i.e. distribution shifts can have major implications for the space the agent is in vs where it has seen before). Acting in the world begets distributional change, which effects the original problem. Does causal inference reason about these distributional shifts? What are the knock on effects?

Inferring which causal model is correct from data

Given the toy examples and two causal models, can we infer which model is correct given the observed data. I suspect the answer to this is no, and we need to do interventions to uncover the actual causal model. But then I’m confused by the motivation. We are motivated to learn the causal structure of a problem w/o being able to do A/B testing… If we can’t uncover the causal structure w/o intervention and A/B testing, then what is the point of causal inference?

Apparently this can be done w/ additive noise models and nonlinear causal relationships with an extremely simple mechanism described by (Hoyer et al. 2009). (i.e. do n regressions and check the causal relationship based on the correlation with the resulting irreducible error).

Is the independence of mechanism and input a reasonable assumption for learning systems?

I’m reminded of weapons of math destruction. Can we assume the underlying mechanisms for generating effects are independent of the distributions we use? I guess you can always separate out the different functions for different variables, i.e. instead of \(C \rightarrow_\psi E\) you have \((C_1, C_2, C_3, \ldots) \rightarrow_\psi E\). So in this instance we will have a multi-input causal structure.

Is the restriction to non-linear w/ additive gaussian noise true, or is it just invertability of the causal mechanism?

Can we infer between various causal graphs with un-seen latent causes?

How does the causal inference in (Hoyer et al. 2009) depend on model performance?

Causal State Representations

(Zhang et al. 2021)

Reading list

https://arxiv.org/abs/2106.04619

Nonlinear causal discovery with additive noise models

https://papers.nips.cc/paper/2008/file/f7664060cc52bc6f3d620bcedc94a4b6-Paper.pdf

Granger Causality

(Scholkopf 2019)

References

Hoyer, Patrik, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. 2009. “Nonlinear Causal Discovery with Additive Noise Models.” In Advances in Neural Information Processing Systems. Vol. 21. Curran Associates, Inc.

Scholkopf, Bernhard. 2019. “Causality for Machine Learning.”

Scholkopf, Bernhard, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris M Mooij. 2012. “On Causal and Anticausal Learning.” In ICML.

Zhang, Amy, Zachary C. Lipton, Luis Pineda, Kamyar Azizzadenesheli, Anima Anandkumar, Laurent Itti, Joelle Pineau, and Tommaso Furlanello. 2021. “Learning Causal State Representations of Partially Observable Environments.”

Tags:

Causality

Matthew Schlegel

Causality

Common Cause principle

Terms and special cases

Disentangled model

Independent Causal Mechanisms Principle

Markov Factorizations

Domain adaptation

Classes of Causal models

Causal hierarchy

Some assumptions for tractability

Causal Sufficiency

Independence of mechanism and input

Richness of functional causal models

Additive noise models

The Causal Markov condition

The Causal Minimality Condition

The Faithfulness Condition

Causal Model

As it relates to time series

Granger Causality

Questions

Inferring the causal model from data.

Counterfactuals aren’t testable

Inferring which causal model is correct from data

Is the independence of mechanism and input a reasonable assumption for learning systems?

Is the restriction to non-linear w/ additive gaussian noise true, or is it just invertability of the causal mechanism?

Can we infer between various causal graphs with un-seen latent causes?

How does the causal inference in (Hoyer et al. 2009) depend on model performance?

Causal State Representations

Reading list

Causal state representations

Learning Causal State Representations of Partially Observable Environments

Causal Effect Identifiability under Partial-Observability

https://arxiv.org/abs/2106.04619

Nonlinear causal discovery with additive noise models

Granger Causality

(Scholkopf 2019)

References

Tags:

Links to this note: