Home
Posts

Correlation is not Causation

By Jimmy Fisher
Nov 03, 2024
in Philosophy

1.3K views

One of the sloppiest bad habits undermining credible data science efforts today, across epidemiology and machine learning models alike, is the conflation of association with causality.

Within any scientific endeavor concerned with identifying factors that, if altered, can be reasonably expected to affect outcomes, researchers endeavor to isolate variables contributing to the outcomes being studied. This is most plainly observed in the physical sciences, such as in the engineering of aircraft -- either airplanes fly at the end of runways, or they do not.

However, in social sciences like sociology and the more abstract branches of psychology, wherein the stories told not only supply retrospective motivational context but are treated as objective constructs acting upon populations as airplane wings upon the wind, it can be a tricky task to separate theoretical explanations from the predictors themselves.

Consider social determinants of health, of which CDC's academic experts inform us there are five, an area of study predicated on the idea that community factors associated with differences observed across populations are at least partially determined by:

Social Determinants of Health. Retrieved from CDC.gov

This is not to take a stance on the presence or lack of empirical support for this claim but to highlight that the entire body of study is constrained by a lack of statistical differentiation between how such factors, however operationalized, contribute to observed outcomes. For example, data indicate that educational attainment is semi-predictive of economic stability, and it is well known that healthcare access and community context is significantly determined by socioeconomic status. Cyclical poverty is predicted by single-parent households, teenage pregnancy, and dropping out of high school, factors disproportionately observed both in low-income, white rural households and low-income, black urban households alike. Since these variables are so tightly interrelated at a population level, it can be complicated to meaningfully discern how they consistently interact and contribute to distributions of collective outcomes because describing one of them is indirectly describing them all.

Including more data has only further complexified this ambiguity. Giant public datasets indicate that relatively poor physical and mental health are associated with such poor communities, very often beset by worse schools, higher pollution, lower health literacy, greater chronic disease burdens, fewer high-paying jobs, more fast food restaurants, higher illicit drug use, more binge drinking and homicide, and several other contributing factors.

So what is causing what?

Statistically speaking, we do not know. Public health advocates argue that this web of dispossessing circumstances is caused or due to one thing or other, and perhaps they are right in whole or in part, but in the absence of sufficient detailed data or advanced statistical methods controlling for confounding collinearities associated with socioeconomic status and population density, epidemiologists cannot statistically know.

Interestingly, machine learning models do not contend with these complexities, their purpose being merely to predict and not to discover or validate causal mechanisms. Bathing suit use may correlate with ice cream sales, and using umbrellas may predict people getting wet, but this does not imply that bathing suits and umbrellas are proximate causes because concurrency is not causality either. So, associated variables can be predictive, correlated to outcomes, but not themselves cause the outcomes.

The good news is that there are ways to test for this. The bad news is that legitimate data science must admit when the data to properly assess causality are unavailable, as is often the case in areas of study such as socioecological variables predicting population health (i.e., so-called "social" determinants of health).

My own experiences while working with departments of public health, patient advocacy groups, low-income healthcare facilities, public health nurses, and multiple physician associations suggest significant opportunities to meaningfully improve public health, but only after the strengths and weaknesses of the available data are completely considered without preconception.

The broader point here is that data scientists are people, too, filling in the gaps of what hardcore data science can offer with presuppositions that, whether true or not, tend to confuse empiricism with expert opinion and/or good-intentions with credible science. Discriminating between association, correlation, concurrency, and causality is prerequisite to accurately perceiving and understanding the world in which we live.

If a picture is worth a thousand words, here are tens of thousands: https://www.tylervigen.com/spurious-correlations

Super Admin

Jimmy Fisher

previous post Empiricism & LLMs

next post U.S. Population (Census)

you may also like

by Jimmy Fisher
Oct 19, 2024

Variable Operationalization

by Jimmy Fisher
Oct 19, 2024

Experts vs. Expertise

by Jimmy Fisher
Dec 14, 2024

No Skepticism, No Science

by Jimmy Fisher
Apr 23, 2025

Master's Capstone in AI

by Jimmy Fisher
Dec 18, 2024

Mental Health, MLR, & One-Hot Encoding (BRFSS)

by Jimmy Fisher
Dec 17, 2024

Chi-Square Tests & BRFSS Weights

Philosophy

by Jimmy Fisher
Dec 14, 2024

No Skepticism, No Science

There is no science without skepticism. Let me explain what I mean.

read more