Introduction

This vignette is a modification of (Tikka, Hyttinen, and Karvanen 2021).

A causal effect is defined as the distribution \(P(\mathbf{Y} { \, | \, }\textrm{do}(\mathbf{X}),\mathbf{Z})\) where variables \(\mathbf{Y}\) are observed, variables \(\mathbf{X}\) are intervened upon (forced to values irrespective of their natural causes) and variables \(\mathbf{Z}\) are conditioned on. Instead of placing various parametric restrictions based on background knowledge, we are interested in this paper in the question of identifiability: can the causal effect be uniquely determined from the distributions (data) we have and a graph representing our structural knowledge on the generating causal system.

In the most basic setting we are identifying causal effects from a single observational input distribution, corresponding to passively observed data. To solve such problems more generally than what is possible with the back-door adjustment (Spirtes, Glymour, and Scheines 1993; Pearl 2009; Greenland, Robins, and Pearl 1999), Pearl (1995) introduced do-calculus, a set of three rules that together with probability theory enable the manipulation of interventional distributions. Shpitser and Pearl (2006) and Huang and Valtorta (2006) showed that do-calculus is complete by presenting polynomial-time algorithms whose each step can be seen as a rule of do-calculus or as an operation based on basic probability theory. The algorithms have a high practical value because the rules of do-calculus do not by themselves provide an indication on the order in which they should be applied. The algorithms save us from manual application of do-calculus, which is a tedious task in all but the simplest problems.

Since then many extensions of the basic identifiability problem have appeared. In identifiability using surrogate experiments (Bareinboim and Pearl 2012), or \(z\)-identifiability, an experimental distribution is available in addition to the observed probability distribution. For data observed in the presence of selection bias, both algorithmic and graphical identifiability results have been derived (Bareinboim and Tian 2015; Correa, Tian, and Bareinboim 2018). More generally, the presence of missing data necessitates the representation of the missingness mechanism, which poses additional challenges (Mohan, Pearl, and Tian 2013; Shpitser, Mohan, and Pearl 2015). Another dimension of complexity is the number of available data sources. Identification from a mixture of observational and interventional distributions that originate from multiple conceptual domains is known as transportability for which complete solutions exist in a specific setting (Bareinboim and Pearl 2014).

While completeness has been accomplished for a number of basic identifiability problems, there are still many challenging but important extensions to the identifiability problem that have not been studied so far. To find solutions to the more complicated identifiability problems, we present a unified approach to the identification of observational and interventional causal queries by constructing a search algorithm that directly applies the rules of do-calculus. We impose no restrictions to the number or type of known input distributions: we thus provide a solution to problems for which no other algorithmic solutions exist. We also extend to identifiability under missing data together with mechanisms related to selection bias and transportability.

To combat the inherent computational complexity of the search-based approach, we derive rules and techniques that avoid unnecessary computational steps. We are able to detect trivial queries where non-identifiability can be determined directly from the inputs. We also present a search heuristic that considerably speeds up the search in the cases where the effect is indeed identifiable. The approach, called do-search, is provably sound and it retains the completeness in the cases previously proven to be solved by do-calculus rules. We can easily scale up to the problems sizes commonly reported in the literature. The R package dosearch (R Core Team 2024; Tikka, Hyttinen, and Karvanen 2020) provides an implementation of the search algorithm and is available on CRAN. The complete details of do-search can be found in (Tikka, Hyttinen, and Karvanen 2021).

The General Causal Effect Identification Problem

Our presentation is based on Structural Causal Models (SCM) and the language of directed graphs. We assume the reader to be familiar with these concepts and refer them to detailed works on these topics for extended discussion and descriptions, such as (Pearl 2009) and (Koller and Friedman 2009).

Following the standard set-up of do-calculus (Pearl 1995), we assume that the causal structure can be represented by a semi-Markovian causal graph \(G\) over a set of vertices \(\mathbf{V}\). The directed edges correspond to direct causal relations between the variables (relative to \(\mathbf{V}\)); directed edges do not form any cycles. Confounding of any two observed variables in \(\mathbf{V}\) by some unobserved common cause is represented by a bidirected edge between the variables.

In a non-parametric setting, the problem of expressing a causal quantity of interest in terms of available information has been be described in various ways depending on the context. When available data are affected by selection bias or missing data, a typical goal is to “recover” some joint or marginal distributions. If data are available from multiple conceptual domains, a distribution is “transported” from the source domains, from which a combination of both observational and experimental data are available, to a target domain. The aforementioned can be expressed in the SCM framework by equipping the graph of the model with special vertices. However, on a fundamental level these problems are simply variations of the original identifiability problem of causal effects and as such, our goal is to represent them as a single generalized identifiability problem.

The general form for a causal identifiability problem that we consider is formulated as follows.

Input: A set of known distributions of the form \(P(\mathbf{A}_i | \textrm{do}(\mathbf{B}_i), \mathbf{C}_i)\), a query \(P(\mathbf{Y} { \, | \, }\textrm{do}(\mathbf{X}), \mathbf{Z})\) and a semi-Markovian causal graph \(G\) over \(\mathbf{V}\).
Task: Output a formula for the query \(P(\mathbf{Y} { \, | \, }\textrm{do}(\mathbf{X}),\mathbf{Z})\) over the input distributions, or decide that it is not identifiable.

Here \(\mathbf{A}_i,\mathbf{B}_i, \mathbf{C}_i\) are disjoint subsets of \(\mathbf{V}\) for all \(i\), and \(\mathbf{X},\mathbf{Y},\mathbf{Z}\) are disjoint subsets of \(\mathbf{V}\). The causal graph \(G\) may contain vertices which describe mechanisms related to transportability and selection bias. In the following subsections we explain several important special cases of this problem definition, some that have been considered in the literature and some which have not been.

The SCM framework can be extended to describe missing data mechanisms. For each variable \(V_i\), two special vertices are added to the causal graph. The vertex \(V_i^*\) is the observed proxy variable which is linked to the true variable \(V_i\) via the missingness mechanism (Little and Rubin 1986; Mohan, Pearl, and Tian 2013): \[\begin{equation} V_i^* = \begin{cases} V_i, & \mathrm{if}\; R_{V_i} = 1, \\ \textrm{NA}, & \mathrm{if}\; R_{V_i} = 0, \end{cases} \end{equation}\] where \(\textrm{NA}\) denotes a missing value and \(R_{V_i}\) is called the response indicator (of \(V_i\)). In other words, the variable \(V_i^*\) that is actually observed matches the true value \(V_i\) if it is not missing (\(R_{V_i} = 1\)). We note that in this formulation, each true variable has its own response indicator, meaning that we do not consider shared indicators between variables or multiple indicators for a single variable. Furthermore, if there is no missingness associated with a given variable \(V_i\) meaning that it is fully observed, the corresponding response indicator \(R_{V_i}\) always has the value \(1\). The omission of a proxy variable and a response indicators of a specific variable from a graph encodes the assumption that the variable in question if fully observed. Note that intervention nodes are added for true variables and response indicators but not for proxy variables. On a symbolic level one could intervene on proxy variables, however we are only interested in interventions that keep equation 1 intact.

The dosearch package

We implemented do-search in C++ and constructed an R interface using the Rcpp package (Eddelbuettel and Francois 2011). This interface is provided by the R package dosearch.

library(dosearch)

Calling the search from R is straightforward via the primary function that carries the name of package.

dosearch(
  data, query, graph, 
  transportability = NULL, selection_bias = NULL, missing_data = NULL,
  control = list()
)

The required inputs of the function are data, query and graph. Argument data is used to encode the set of known input distributions in the general identifiability problem as a character string, where each distribution is separated by a new line. For example, if we have access to distributions \(P(W), P(Y { \, | \, }X)\), and \(P(Z { \, | \, }\textrm{do}(X), W)\), we would write

data <- "
  P(W)
  P(Y|X)
  P(Z|do(X),W)
"

The individual distributions can also be given as a list of character vectors of length one:

data <- list(
  "P(W)",
  "P(Y|X)",
  "P(Z|do(X),W)"
)

The \(\textrm{do}(\cdot)\)-operator can either precede or succeed conditioning variables, but it must appear only once in a given term, meaning that expressions such as P(Y|do(A),B,do(C)) are not allowed, but should instead be given as P(Y|B,do(A,C)) or P(Y|do(A,C),B). If variable sets are desired, each member of the set has to be included explicitly.

Argument query is used to describe the query of the general identifiability problem as a character string, similarly as data. If we are interested in identifying \(P(Y { \, | \, }\textrm{do}(X), W)\) we would write

query <- "P(Y|do(X),W)"

Instead of describing distributions via text, it is also possible to use the following structure that encodes the role of each variable via a numeric vector:

query <- c(Y = 0, X = 1, W = 2)

Given a distribution of the form \(P(\mathbf{A} { \, | \, }\textrm{do}(\mathbf{B}),\mathbf{C})\) and a variable \(V\), a value 0 means that \(V \in \mathbf{A}\), value 1 means that \(V \in \mathbf{B}\) and value 2 means that \(V \in \mathbf{C}\). This format can also be used to input data as a list of numeric vectors:

data <- list(
  c(W = 0),
  c(Y = 0, X = 2),
  c(Z = 0, X = 1, W = 2)
)

Finally, graph encodes the semi-Markovian graph \(G\) of the causal model as a character string with each edge on its own line. A directed edge from \(X\) to \(Y\) is given as X -> Y and a bidirected edge between \(X\) and \(Y\) is given as X <-> Y. Intervention nodes should not be given explicitly, since they are added automatically after calling dosearch. Furthermore, only vertices with incoming or outgoing edges should be included in graph. As an example, we can encode a simple back-door graph with an added unobserved confounded between \(X\) and \(Y\) as follows:

graph <- "
  X -> Y
  Z -> X
  Z -> Y
  X <-> Y
"

Alternatively, one may use igraph graphs (Csardi and Nepusz 2006) in the syntax of the causaleffect package (Tikka and Karvanen 2017) or DAGs created using the dagitty package.

library("igraph")
graph <- graph.formula(X -+ Y, Z -+ X, Z -+ Y, X -+ Y, Y -+ X)
graph <- set_edge_attr(graph, "description", 4:5, "U")

library("dagitty")
graph <- dagitty("dag{X -> Y; Z -> X; Z -> Y; X <-> Y}")

The next two optional parameters, and , are used to denote those vertices of \(G\) that should be understood as either transportability nodes or selection bias nodes, respectively. Providing these parameters may increase search performance in relevant problems. Both of these parameters should be given as character strings, where individual variables are separated by a comma, for example transportability = "S,T". Parameter missing_data, as the name suggests, is used to define missingness mechanisms (1) as a character string, where individual mechanisms are separated by a comma. In order to describe that \(R_X\) is the response indicator of \(X\) we would write R_X : X, which also implicitly defines that X* is the proxy variable of X.

The list control can be used to set various additional parameters that are not directly related to the identifiability problem itself, but more so to the output of the search and other auxiliary details, such as benchmarking and obtaining derivations that show how the query distribution can be reached from the inputs using do-calculus. One such control parameter determines whether to use the search heuristic or not. Documentation of the dosearch package contains detailed information on the full list of control parameters.

The return object of dosearch is a list with three components by default. The first component, identifiable, is a logical value that takes the value TRUE when the target distribution described by query is identifiable from the inputs. The second component, formula, is a character string describing the target distribution in terms of the inputs in LaTeX syntax if the target is identifiable. Otherwise this component is just an empty character string. The third component call contains the arguments of the original function call.

References

Bareinboim, E., and J. Pearl. 2012. “Causal Inference by Surrogate Experiments: Z-Identifiability.” In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, edited by N. de Freitas and K. Murphy, 113–20. AUAI Press.

———. 2014. “Transportability from Multiple Environments with Limited Experiments: Completeness Results.” In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, 280–88.

Bareinboim, E., and J. Tian. 2015. “Recovering Causal Effects from Selection Bias.” In Proceedings of the 29th AAAI Conference on Artificial Intelligence, 3475–81.

Correa, J., J. Tian, and E. Bareinboim. 2018. “Generalized Adjustment Under Confounding and Selection Biases.” In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

Csardi, G., and T. Nepusz. 2006. “The igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. https://igraph.org.

Eddelbuettel, D., and R. Francois. 2011. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. https://doi.org/10.18637/jss.v040.i08.

Greenland, S., J. M. Robins, and J. Pearl. 1999. “Confounding and Collapsibility in Causal Inference.” Statistical Science 14 (1): 29–46. https://doi.org/10.1214/ss/1009211805.

Huang, Y., and M. Valtorta. 2006. “Pearl’s Calculus of Intervention Is Complete.” In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, 217–24. AUAI Press.

Koller, D., and N. Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Little, R. J. A., and D. B. Rubin. 1986. Statistical Analysis with Missing Data. New York, NY, USA: John Wiley & Sons, Inc.

Mohan, K., J. Pearl, and J. Tian. 2013. “Graphical Models for Inference with Missing Data.” In Proceedings of the 26th International Conference on Neural Information Processing Systems, 1277–85.

Pearl, J. 1995. “Causal Diagrams for Empirical Research.” Biometrika 82 (4): 669–88. https://doi.org/10.2307/2337329.

———. 2009. Causality: Models, Reasoning, and Inference. Second. Cambridge University Press.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Shpitser, I., K. Mohan, and J. Pearl. 2015. “Missing Data as a Causal and Probabilistic Problem.” In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, edited by Marina Meila and Tom Heskes, 802–11. AUAI Press.

Shpitser, I., and J. Pearl. 2006. “Identification of Joint Interventional Distributions in Recursive Semi-Markovian Causal Models.” In Proceedings of the 21st National Conference on Artificial Intelligence – Volume 2, 1219–26. AAAI Press.

Spirtes, P., C. Glymour, and R. Scheines. 1993. Causation, Prediction, and Search. Springer-Verlag.

Tikka, S., A. Hyttinen, and J. Karvanen. 2020. dosearch: Causal Effect Identification from Multiple Incomplete Data Sources. https://cran.r-project.org/package=dosearch.

———. 2021. “Causal Effect Identification from Multiple Incomplete Data Sources: A General Search-Based Approach.” Journal of Statistical Software 99 (5): 1–40. https://doi.org/10.18637/jss.v099.i05.

Tikka, S., and J. Karvanen. 2017. “Identifying Causal Effects with the R Package causaleffect.” Journal of Statistical Software 76 (12): 1–30. https://doi.org/10.18637/jss.v076.i12.