On the nature of causation and correlation: elections and cancer

Credit to Randall Munroe.
With election season approaching, everyone wants to know how the future of the United States' leadership will shape up. As we turn to data, we can make predictions through inferences of the past and present, especially as statisticians such as Nate Silver would explain. As the title would suggest, in this post I discuss under what conditions, exactly, can we use experimental data to deduce a causal relationship between two or more variables?

Scientists create randomized controlled experiment through which they can infer causal relations among different phenomena, variables, and other observations and ideas. Something like understanding how an object (in the absence of all forms of light) emits blackbody radiation requires understanding how an object were to emit forms of light in the first place. And, from that, a scientist may be able to infer that the object's own state in the absence of light causes it to emit radiation in this way. Still, randomized controlled experiments have their own limits and caveats no matter what experiment is being performed. This leaves scientists with questions of how to infer other types of explanations and what sort of causal relationships we can truly create for a system. 

Israeli-American computer scientist and philosopher Judea Pearl laid out much of the research related to theories of causality. Causal inference itself is a theory that is still debated among scientists and philosophers, and the premises, arguments, and conclusions that the theory provides can give us an understanding of correlation that doesn't fall to errors in reasoning.

Pearl's causal calculus is a set of three simple, powerful algebraic rules which can be used to make inferences about causal relationships. In particular, I’ll talk about the ways causal inference is possible, but I'll also go into detail of the limits of these methods.

In explaining causes, consider the relationship of smoking and lung cancer. Several decades ago, the U.S. Surgeon General published a study that put forward the claim that cigarette smoking causes lung cancer. But the report came under attack not just by tobacco companies, but also by some of the world’s most prominent statisticians, including the great Ronald Fisher. There could have been other factors that are at play in this complicated relationship between smoking and lung cancer such as genetics, environmental factors, and even more personal characteristics such as age or race. And, in actuality, one would have to understand the relationship among lung cancer and the decision of whether an individual chooses to smoke, not the simple correlation of smoking and lung cancer itself. The actual relationship and the way these factors are correlated with one another was most likely much more nuanced than just claiming cigarette smoking causes cancer. The importance of understanding the details and specifics of these relationships would be necessary for individuals to make healthy and beneficial decisions in their lives.

A randomized controlled experiment, as briefly discussed earlier, may be possible, but the way it should be done requires explanation. A scientist may choose to perform an experiment in which they would force a person to smoke or not and, through an appropriately sized sample or smokers and non-smokers, they could observe the cancer rates among those two groups. The scientist would have to keep all other factors equal and maintain that the groups are truly random and large enough to account for an appropriate generalization that smoking causes lung cancer. The scientist could determine whether there is a causal correlation. 

Yet, in reality, things don't occur so simply. Experiments such as these are time-consuming, difficult to maintain, and rely on controlling for many factors that complicate the issue observed. This doesn't even account for ethical or legal issues in performing such an experiment. This raises a fundamental, significant issue for scientists seeking to explore the relationships among activities like smoking and its associated health consequences.

The causal models we construct to analyze the smoking-cancer connection allow us to create diagrams that dictate there's a hidden factor at play with both smoking and lung cancer. Mathematically speaking, the arrows dictate the relationship between how one factor causes another. Since we don't know exactly how it behaves or what it is, we illustrate it this way:
simple and clean.

There is also a third possibility: that the combination of both smoking and a hidden factor contribute to lung cancer. This makes our correlations and relationships even more complicated, but allow for more nuanced and detailed justifications of these relationships. We could perhaps develop the argument that smoking inherently may reduce the probability of lung cancer while some hidden factor increases the likelihood of cancer in a way that we observe the increase in risk of cancer. These possibilities and explanations may seem to hinge on unnecessarily complicated premises and observations, but, given how many factors are at play in the empirical evidence on the issue of smoking and cancer, they provide us with much more potential for creating accurate arguments about the issue. 


For the notion of causality to make sense we need to constrain the class of graphs that can be used in a causal model. We don't want any loops to occur in the graph. If there were loops, we wouldn't be able to discern an appropriate causal relationship as any particular node in the loop would "cause" itself in this graph. The nature and reasons for being of one node depend on itself when we create graphs. In investigating causal relationships, we create models that we continue to change and update with more and more nodes and relationships added to the graph. We generally want to avoid methods that introduce some random indirect variable affecting every vertex of the graph. We want to have as much certainty as possible when generating the graphs.

Causal models can generally be described using their conditional probabilities, such as with the equation    p(x_1,x_2,\ldots) = \prod_j p(x_j | \mbox{pa}(x_j)).  . In this equation, we're describing each probability (using p) as the product of the probabilities of the events that cause it.

Simpson's paradox illustrates issues of grouping individuals by various factors and the trends that we observe when we choose to do so. A clear example of Simpson's paradox can be observed with a certain study of gender bias among graduate school admissions to University of California, Berkeley. The 1973 fall data demonstrated the trend that men applying were more likely than women to be admitted. Of the 8442 men who applied, 44% were admitted while, of the 4321 women who applied, 35% were admitted. However, the admissions results of six departments were significantly biased against men, while those of four were significantly biased against women. In fact, the combined data showed a "small but statistically significant bias in favor of women." 

The paper went on to argue that that women tended to apply to departments with low rates of admission among all or most applications (such as English), while men tended to apply to the departments of the contrary (such as in engineering and chemistry).

From a purely causal point of view (in understanding which factors provide the means for determining other factors) this result seems paradoxical. Making clear, educated statements on whether an individual is likely to be accepted to Berkeley may hinge more on the assumptions and premises that lead up to our conclusions rather than seeking easily-to-digest, generalizable conclusions. Two variables which appear correlated can become anti-correlated when another factor is taken into account.

By any means, causality itself seems to fly under the radar among too many scientists. To ascertain the truth and validity of arguments of causality and use them in any sort of discipline, one must come to understand the nature of causality itself. In creating arguments and recognizing the limitations of these methods of inquiry, we can create more refined understandings of the universe and allow more certainty in our predictions and inferences.

There are ways to resolve Simpson's paradox, though. With a causal Bayesian network (an acyclic graph as we've been working with, let's say X causes Y), we can measure how changing X would change Y and determine the relationship thereof. As with our example, we have ethical and logistic reasons why this might not be possible. One could also show that an extra variable correlates with both X and Y. As in, we could determine that X causes Z which causes Y. Finally, one might have an indirect variable which affects both X and Y. Such a relationship would look like Z causes X and Z causes Y. As explained using the graphs illustrating smoking and lung cancer, we generally want our measurements to avoid these hidden variables to determine how a causal model works.